Daily arXiv Papers - 2025-08-06

Summaries of research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation

Radhika Dua, Young Joon, Kwon, Siddhant Dogra, Daniel Freedman, Diana Ruan, Motaz Nashawaty, Danielle Rigau, Daniel Alexander Alber, Kang Zhang, Kyunghyun Cho, Eric Karl Oermann

Main category: cs.CL

TL;DR: ICARE is an interpretable framework for evaluating radiology reports using LLM agents and MCQA, outperforming existing metrics in clinical alignment.

DetailsMotivation: Existing metrics for radiology report evaluation lack interpretability and clinical relevance, necessitating a more reliable method.

Method: ICARE uses two LLM agents to generate and answer clinically meaningful questions, comparing agreement as a proxy for precision and recall.

Result: ICARE aligns better with expert judgment, shows sensitivity to clinical content, and provides interpretable error patterns.

Conclusion: ICARE offers a transparent and clinically grounded approach to evaluating radiology reports, improving reliability and interpretability.

Abstract: Radiological imaging is central to diagnosis, treatment planning, and clinical decision-making. Vision-language foundation models have spurred interest in automated radiology report generation (RRG), but safe deployment requires reliable clinical evaluation of generated reports. Existing metrics often rely on surface-level similarity or behave as black boxes, lacking interpretability. We introduce ICARE (Interpretable and Clinically-grounded Agent-based Report Evaluation), an interpretable evaluation framework leveraging large language model agents and dynamic multiple-choice question answering (MCQA). Two agents, each with either the ground-truth or generated report, generate clinically meaningful questions and quiz each other. Agreement on answers captures preservation and consistency of findings, serving as interpretable proxies for clinical precision and recall. By linking scores to question-answer pairs, ICARE enables transparent, and interpretable assessment. Clinician studies show ICARE aligns significantly more with expert judgment than prior metrics. Perturbation analyses confirm sensitivity to clinical content and reproducibility, while model comparisons reveal interpretable error patterns.

[2] Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives

Yinuo Xu, Veronica Derricks, Allison Earl, David Jurgens

Main category: cs.CL

TL;DR: DEM-MoE models annotator disagreement in NLP tasks using demographic-aware routing, performs well across groups, and uses synthetic data for imputation, improving diverse perspective representation.

DetailsMotivation: To better capture annotator disagreement and group-level variation in subjective NLP tasks, addressing sparse demographic coverage.

Method: DEM-MoE routes inputs to expert subnetworks based on annotator demographics. Uses LLM-generated synthetic annotations for data imputation and blends real/synthetic data with tailored strategies.

Result: DEM-MoE performs competitively across demographic groups, especially in high-disagreement datasets. Synthetic data aligns moderately with human annotations.

Conclusion: DEM-MoE and synthetic data blending improve diverse perspective representation, with optimal strategies depending on dataset structure.

Abstract: We present an approach to modeling annotator disagreement in subjective NLP tasks through both architectural and data-centric innovations. Our model, DEM-MoE (Demographic-Aware Mixture of Experts), routes inputs to expert subnetworks based on annotator demographics, enabling it to better represent structured, group-level variation compared to prior models. DEM-MoE consistently performs competitively across demographic groups, and shows especially strong results on datasets with high annotator disagreement. To address sparse demographic coverage, we test whether LLM-generated synthetic annotations via zero-shot persona prompting can be used for data imputation. We show these synthetic judgments align moderately well with human annotations on our data and offer a scalable way to potentially enrich training data. We then propose and evaluate approaches for blending real and synthetic data using strategies tailored to dataset structure. We find that the optimal strategies depend on dataset structure. Together, these contributions improve the representation of diverse perspectives.

[3] Highlight & Summarize: RAG without the jailbreaks

Giovanni Cherubin, Andrew Paverd

Main category: cs.CL

TL;DR: The paper introduces Highlight & Summarize (H&S), a design pattern for retrieval-augmented generation (RAG) systems to prevent jailbreaking and hijacking of LLMs by avoiding direct exposure of user queries to the generative model.

DetailsMotivation: Existing defenses against malicious prompts in LLMs are probabilistic and easily bypassed. H&S aims to prevent attacks by design.

Method: H&S splits the RAG pipeline into a highlighter (extracts relevant passages) and a summarizer (generates answers from highlights), avoiding direct query exposure to the LLM.

Result: H&S responses, especially with an LLM-based highlighter, often outperform standard RAG in correctness, relevance, and quality.

Conclusion: H&S provides a robust, design-based solution to prevent LLM misuse while maintaining or improving response quality.

Abstract: Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. For example, when interacting with a chatbot, malicious users can input specially crafted prompts to cause the LLM to generate undesirable content or perform a completely different task from its intended purpose. Existing mitigations for such attacks typically rely on hardening the LLM’s system prompt or using a content classifier trained to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. In this paper, we present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user’s question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user’s question and extracts relevant passages (“highlights”) from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe several possible instantiations of H&S and evaluate their generated responses in terms of correctness, relevance, and response quality. Surprisingly, when using an LLM-based highlighter, the majority of H&S responses are judged to be better than those of a standard RAG pipeline.

[4] Merge-based syntax is mediated by distinct neurocognitive mechanisms: A clustering analysis of comprehension abilities in 84,000 individuals with language deficits across nine languages

Elliot Murphy, Rohan Venkatesh, Edward Khokhlovich, Andrey Vyshedskiy

Main category: cs.CL

TL;DR: The paper explores the computational operation ‘Merge’ in syntax, identifying three distinct cognitive mechanisms for processing different Merge-based linguistic structures, supported by behavioral evidence.

DetailsMotivation: To investigate whether the elementary operation 'Merge' in syntax is supported by distinct cognitive mechanisms for different linguistic structures.

Method: Systematic study of participants’ comprehension of sentences with varying syntactic complexity, using clustering analyses to identify behavioral patterns.

Result: Behavioral evidence revealed three distinct structural types, suggesting different cognitive mechanisms for processing Merge-based objects.

Conclusion: While Merge may have evolved suddenly, its processing involves distinct cognitive mechanisms for different linguistic structures, potentially emerging at different developmental stages and subject to selective impairment.

Abstract: In the modern language sciences, the core computational operation of syntax, ‘Merge’, is defined as an operation that combines two linguistic units (e.g., ‘brown’, ‘cat’) to form a categorized structure (‘brown cat’, a Noun Phrase). This can then be further combined with additional linguistic units based on this categorial information, respecting non-associativity such that abstract grouping is respected. Some linguists have embraced the view that Merge is an elementary, indivisible operation that emerged in a single evolutionary step. From a neurocognitive standpoint, different mental objects constructed by Merge may be supported by distinct mechanisms: (1) simple command constructions (e.g., “eat apples”); (2) the merging of adjectives and nouns (“red boat”); and (3) the merging of nouns with spatial prepositions (“laptop behind the sofa”). Here, we systematically investigate participants’ comprehension of sentences with increasing levels of syntactic complexity. Clustering analyses revealed behavioral evidence for three distinct structural types, which we discuss as potentially emerging at different developmental stages and subject to selective impairment. While a Merge-based syntax may still have emerged suddenly in evolutionary time, responsible for the structured symbolic turn our species took, different cognitive mechanisms seem to underwrite the processing of various types of Merge-based objects.

[5] Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models

Wenjie Luo, Ruocheng Li, Shanshan Zhu, Julian Perry

Main category: cs.CL

TL;DR: The paper introduces the Coherent Multimodal Reasoning Framework (CMRF) to improve cross-modal reasoning in LVLMs, achieving state-of-the-art performance with a 69.4% accuracy.

DetailsMotivation: Current LVLMs lack deep, deliberative reasoning, relying on superficial associations. CMRF aims to enhance common sense reasoning through iterative, self-correcting inference.

Method: CMRF decomposes queries into sub-questions (RDU), performs contextual inference (CIE), and evaluates coherence (CAM), refining reasoning iteratively.

Result: CMRF outperforms baselines by +2.4%, excelling in complex reasoning tasks like VCR and A-OKVQA.

Conclusion: CMRF’s modules and iterative refinement significantly improve reasoning coherence and accuracy, validated by ablation studies and human evaluations.

Abstract: Despite significant advancements, current large language models (LLMs) and vision-language models (LVLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of “deliberative thinking.” They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. To address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs’ common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Our framework integrates three key modules: a Reasoning Decomposition Unit (RDU) for breaking down problems into sub-questions, a Contextual Inference Engine (CIE) for contextual inference, and a Coherence Assessment Module (CAM) for evaluating logical consistency and confidence. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaVA-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source LVLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. It attains an average accuracy of 69.4%, surpassing the best open-source baseline by +2.4 percentage points, with particular strength in complex reasoning scenarios. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning.

[6] SLIM-LLMs: Modeling of Style-Sensory Language RelationshipsThrough Low-Dimensional Representations

Osama Khalid, Sanvesh Srivastava, Padmini Srinivasan

Main category: cs.CL

TL;DR: The paper explores sensorial language’s relationship with stylistic features using Reduced-Rank Ridge Regression (R4) and introduces SLIM-LLMs for efficient modeling.

DetailsMotivation: To understand how sensorial language connects with stylistic features and improve modeling efficiency.

Method: Uses Reduced-Rank Ridge Regression (R4) and introduces SLIM-LLMs for non-linear relationships.

Result: Low-rank LIWC features (r=24) match full feature set (r=74) performance, reducing parameters by 80%.

Conclusion: SLIM-LLMs effectively model sensorial language with fewer parameters, maintaining performance.

Abstract: Sensorial language – the language connected to our senses including vision, sound, touch, taste, smell, and interoception, plays a fundamental role in how we communicate experiences and perceptions. We explore the relationship between sensorial language and traditional stylistic features, like those measured by LIWC, using a novel Reduced-Rank Ridge Regression (R4) approach. We demonstrate that low-dimensional latent representations of LIWC features r = 24 effectively capture stylistic information for sensorial language prediction compared to the full feature set (r = 74). We introduce Stylometrically Lean Interpretable Models (SLIM-LLMs), which model non-linear relationships between these style dimensions. Evaluated across five genres, SLIM-LLMs with low-rank LIWC features match the performance of full-scale language models while reducing parameters by up to 80%.

[7] Can LLMs Generate High-Quality Task-Specific Conversations?

Shengqi Li, Amarnath Gupta

Main category: cs.CL

TL;DR: A framework for controlling conversation quality in LLMs using nine key parameters across six dimensions, showing significant improvements in dialogue properties.

DetailsMotivation: Address challenges in conversation generation like topic coherence, knowledge progression, character consistency, and control granularity.

Method: Parameter-based control framework tested with state-of-the-art LLMs.

Result: Statistically significant differences in generated conversation properties.

Conclusion: The framework standardizes conversation quality control for applications in education, therapy, customer service, and entertainment, with future work on expanding parameters and benchmarks.

Abstract: This paper introduces a parameterization framework for controlling conversation quality in large language models. We explore nine key parameters across six dimensions that enable precise specification of dialogue properties. Through experiments with state-of-the-art LLMs, we demonstrate that parameter-based control produces statistically significant differences in generated conversation properties. Our approach addresses challenges in conversation generation, including topic coherence, knowledge progression, character consistency, and control granularity. The framework provides a standardized method for conversation quality control with applications in education, therapy, customer service, and entertainment. Future work will focus on implementing additional parameters through architectural modifications and developing benchmark datasets for evaluation.

[8] CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis

Main category: cs.CL

TL;DR: A novel method using Contextual Co-occurrence Matrices and Tensors effectively detects adversarial and jailbreak prompts in LLMs, achieving high F1 scores with minimal labeled data and significant speed improvements.

DetailsMotivation: The complexity of LLMs makes them vulnerable to attacks like jailbreaks, necessitating robust detection methods for safe use.

Method: Leverages latent space characteristics of Contextual Co-occurrence Matrices and Tensors to identify adversarial prompts.

Result: Achieves an F1 score of 0.83 with only 0.5% labeled prompts, a 96.6% improvement over baselines, and is 2.3 to 128.4 times faster.

Conclusion: The method is highly effective and efficient for detecting adversarial prompts in LLMs, especially in data-scarce scenarios.

Abstract: The widespread use of Large Language Models (LLMs) in many applications marks a significant advance in research and practice. However, their complexity and hard-to-understand nature make them vulnerable to attacks, especially jailbreaks designed to produce harmful responses. To counter these threats, developing strong detection methods is essential for the safe and reliable use of LLMs. This paper studies this detection problem using the Contextual Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce environments. We propose a novel method leveraging the latent space characteristics of Contextual Co-occurrence Matrices and Tensors for the effective identification of adversarial and jailbreak prompts. Our evaluations show that this approach achieves a notable F1 score of 0.83 using only 0.5% of labeled prompts, which is a 96.6% improvement over baselines. This result highlights the strength of our learned patterns, especially when labeled data is scarce. Our method is also significantly faster, speedup ranging from 2.3 to 128.4 times compared to the baseline models. To support future research and reproducibility, we have made our implementation publicly available.

[9] When Algorithms Meet Artists: Topic Modeling the AI-Art Debate, 2013-2025

Ariya Mukherjee-Gandhi, Oliver Muellerklein

Main category: cs.CL

TL;DR: A 12-year analysis of AI-generated art discourse reveals a misalignment between artists’ concerns and media narratives, highlighting gatekeeping via technical jargon.

DetailsMotivation: To address the marginalization of artists' voices in AI-art discourse and analyze the disconnect between their concerns and dominant narratives.

Method: Analyzed 439 curated 500-word excerpts from diverse sources (2013-2025) using a reproducible BERTopic-based methodology to identify thematic clusters.

Result: Identified five stable thematic clusters and found technical jargon often sidelines artists’ urgent issues.

Conclusion: Calls for transparency-driven engagement with artist perspectives in AI-creative discussions and provides a methodological baseline for future research.

Abstract: As generative AI continues to reshape artistic production and alternate modes of human expression, artists whose livelihoods are most directly affected have raised urgent concerns about consent, transparency, and the future of creative labor. However, the voices of artists are often marginalized in dominant public and scholarly discourse. This study presents a twelve-year analysis, from 2013 to 2025, of English-language discourse surrounding AI-generated art. It draws from 439 curated 500-word excerpts sampled from opinion articles, news reports, blogs, legal filings, and spoken-word transcripts. Through a reproducible methodology, we identify five stable thematic clusters and uncover a misalignment between artists’ perceptions and prevailing media narratives. Our findings highlight how the use of technical jargon can function as a subtle form of gatekeeping, often sidelining the very issues artists deem most urgent. Our work provides a BERTopic-based methodology and a multimodal baseline for future research, alongside a clear call for deeper, transparency-driven engagement with artist perspectives in the evolving AI-creative landscape.

[10] ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems

Chenxi Wang, Jizhan Fang, Xiang Chen, Bozhong Tian, Ziwen Xu, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: The paper proposes Knowledge Editing and introduces ADS-Edit, a dataset for improving Large Multimodal Models in Autonomous Driving Systems, addressing challenges like traffic knowledge gaps and complex road conditions.

DetailsMotivation: Challenges in applying Large Multimodal Models to Autonomous Driving Systems, such as traffic knowledge misunderstandings and diverse road conditions, motivate the need for targeted model behavior modifications.

Method: The authors propose Knowledge Editing for targeted model adjustments and introduce ADS-Edit, a multimodal dataset tailored for Autonomous Driving Systems.

Result: Comprehensive experiments were conducted, yielding insights into the effectiveness of Knowledge Editing in autonomous driving applications.

Conclusion: The work aims to advance knowledge editing applications in autonomous driving, with code and data made publicly available.

Abstract: Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model’s behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving. Code and data are available in https://github.com/zjunlp/EasyEdit/blob/main/examples/ADSEdit.md.

[11] Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation

Haoran Wang, Xiongxiao Xu, Baixiang Huang, Kai Shu

Main category: cs.CL

TL;DR: PAD introduces a lightweight, inference-time defense for RAG systems to prevent private data leakage by injecting calibrated noise during generation, ensuring privacy without retraining.

DetailsMotivation: To address privacy risks in RAG systems when handling sensitive data, preventing extraction attacks that leak confidential information.

Method: Proposes Privacy-Aware Decoding (PAD), which injects Gaussian noise into token logits, uses confidence-based screening, sensitivity estimation, and context-aware noise calibration, with RDP accounting for privacy guarantees.

Result: PAD significantly reduces private data leakage while maintaining response utility, outperforming existing defenses in experiments on real-world datasets.

Conclusion: PAD offers a scalable, model-agnostic solution for privacy in RAG systems, advancing privacy-preserving decoding strategies.

Abstract: Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, \delta)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.

[12] Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation

Zizhong Li, Haopeng Zhang, Jiawei Zhang

Main category: cs.CL

TL;DR: The paper introduces TPARAG, a token-level attack framework targeting RAG systems, outperforming prior methods in both retrieval and generation stages.

DetailsMotivation: Addressing security vulnerabilities in RAG systems, particularly the risk of malicious content manipulation, which prior methods inadequately tackled.

Method: Proposes TPARAG, using a lightweight white-box LLM to iteratively optimize malicious passages at the token level for retrievability and attack success.

Result: TPARAG outperforms existing approaches in retrieval and end-to-end attack effectiveness on open-domain QA datasets.

Conclusion: Reveals critical RAG vulnerabilities and provides insights for improving robustness against such attacks.

Abstract: While large language models (LLMs) have achieved remarkable success in providing trustworthy responses for knowledge-intensive tasks, they still face critical limitations such as hallucinations and outdated knowledge. To address these issues, the retrieval-augmented generation (RAG) framework enhances LLMs with access to external knowledge via a retriever, enabling more accurate and real-time outputs about the latest events. However, this integration brings new security vulnerabilities: the risk that malicious content in the external database can be retrieved and used to manipulate model outputs. Although prior work has explored attacks on RAG systems, existing approaches either rely heavily on access to the retriever or fail to jointly consider both retrieval and generation stages, limiting their effectiveness, particularly in black-box scenarios. To overcome these limitations, we propose Token-level Precise Attack on the RAG (TPARAG), a novel framework that targets both white-box and black-box RAG systems. TPARAG leverages a lightweight white-box LLM as an attacker to generate and iteratively optimize malicious passages at the token level, ensuring both retrievability and high attack success in generation. Extensive experiments on open-domain QA datasets demonstrate that TPARAG consistently outperforms previous approaches in retrieval-stage and end-to-end attack effectiveness. These results further reveal critical vulnerabilities in RAG pipelines and offer new insights into improving their robustness.

[13] Cross-lingual Opinions and Emotions Mining in Comparable Documents

Motaz Saad, David Langlois, Kamel Smaili

Main category: cs.CL

TL;DR: The paper studies sentiment and emotion differences in English-Arabic comparable texts, using cross-lingual annotation and bilingual lexicons, finding alignment varies by news source.

DetailsMotivation: To understand how sentiments and emotions differ across topic-aligned but non-translated English-Arabic documents, a gap in prior research.

Method: Annotate texts with sentiment/emotion labels using cross-lingual methods and bilingual lexicons, then statistically compare alignment in document pairs.

Result: Sentiment and emotion annotations align when documents are from the same news agency but diverge when from different sources.

Conclusion: The method is language-independent and applicable to other language pairs, revealing source-dependent sentiment/emotion alignment.

Abstract: Comparable texts are topic-aligned documents in multiple languages that are not direct translations. They are valuable for understanding how a topic is discussed across languages. This research studies differences in sentiments and emotions across English-Arabic comparable documents. First, texts are annotated with sentiment and emotion labels. We apply a cross-lingual method to label documents with opinion classes (subjective/objective), avoiding reliance on machine translation. To annotate with emotions (anger, disgust, fear, joy, sadness, surprise), we manually translate the English WordNet-Affect (WNA) lexicon into Arabic, creating bilingual emotion lexicons used to label the comparable corpora. We then apply a statistical measure to assess the agreement of sentiments and emotions in each source-target document pair. This comparison is especially relevant when the documents originate from different sources. To our knowledge, this aspect has not been explored in prior literature. Our study includes English-Arabic document pairs from Euronews, BBC, and Al-Jazeera (JSC). Results show that sentiment and emotion annotations align when articles come from the same news agency and diverge when they come from different ones. The proposed method is language-independent and generalizable to other language pairs.

[14] Listening to the Unspoken: Exploring “365” Aspects of Multimodal Interview Performance Assessment

Jia Li, Yang Wang, Wenhao Qian, Jialong Hu, Zhenzhen Hu, Richang Hong, Meng Wang

Main category: cs.CL

TL;DR: A novel framework for interview performance assessment using multimodal data (video, audio, text) achieves comprehensive and unbiased evaluations, winning the AVI Challenge 2025.

DetailsMotivation: To ensure holistic and fair evaluations of candidates by capturing explicit and implicit cues from multimodal data.

Method: Integrates three modalities, six responses per candidate, and five evaluation dimensions. Uses modality-specific feature extractors and a Shared Compression Multilayer Perceptron for fusion, followed by a two-level ensemble learning strategy.

Result: Achieved a multi-dimensional average MSE of 0.1824 and secured first place in the AVI Challenge 2025.

Conclusion: The framework effectively advances automated and multimodal interview performance assessment, demonstrating robustness and fairness.

Abstract: Interview performance assessment is essential for determining candidates’ suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365’’ aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.

[15] Long Story Generation via Knowledge Graph and Literary Theory

Ge Shi, Kaiyu Huang, Guochen Feng

Main category: cs.CL

TL;DR: Proposes a multi-agent Story Generator using LLMs to address theme drift and incoherent plots in long text generation, improving story quality.

DetailsMotivation: Previous outline-based methods suffer from theme drift and unappealing plots, prompting a need for a better approach.

Method: Uses multi-agent LLMs with long/short-term memory storage, a story theme obstacle framework, and multi-agent interaction for revisions.

Result: Generates higher-quality long stories compared to previous methods.

Conclusion: The proposed method effectively improves long story generation by addressing key issues in coherence and appeal.

Abstract: The generation of a long story consisting of several thousand words is a sub-task in the field of long text generation~(LTG). Previous research has addressed this challenge through outline-based generation, which employs a multi-stage method for generating outlines into stories. However, this approach suffers from two common issues: almost inevitable theme drift caused by the loss of memory of previous outlines, and tedious plots with incoherent logic that are less appealing to human readers. In this paper, we propose the multi-agent Story Generator structure to improve the multi-stage method, using large language models~(LLMs) as the core components of agents. To avoid theme drift, we introduce a memory storage model comprising two components: a long-term memory storage that identifies the most important memories, thereby preventing theme drift; and a short-term memory storage that retains the latest outlines from each generation round. To incorporate engaging elements into the story, we design a story theme obstacle framework based on literary narratology theory that introduces uncertain factors and evaluation criteria to generate outline. This framework calculates the similarity of the former storyline and enhances the appeal of the story by building a knowledge graph and integrating new node content. Additionally, we establish a multi-agent interaction stage to simulate writer-reader interaction through dialogue and revise the story text according to feedback, to ensure it remains consistent and logical. Evaluations against previous methods demonstrate that our approach can generate higher-quality long stories.

[16] RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior

Junyao Yang, Jianwei Wang, Huiping Zhuang, Cen Chen, Ziqian Zeng

Main category: cs.CL

TL;DR: RCP-Merging is a novel framework for merging domain-specific LLMs with long CoT reasoning models, improving domain task performance without degrading reasoning capabilities.

DetailsMotivation: To create a dual-capability model combining long CoT reasoning and domain-specific knowledge efficiently, avoiding the pitfalls of current merging methods.

Method: RCP-Merging treats reasoning model weights as prior, uses a reasoning capability indicator, and selectively merges domain-specific weights.

Result: Experiments show RCP-Merging improves domain task performance by 9.5% and 9.2% over state-of-the-art methods, preserving long CoT reasoning.

Conclusion: RCP-Merging successfully integrates domain-specific and reasoning models, offering a resource-efficient solution with enhanced performance.

Abstract: Large Language Models (LLMs) with long chain-of-thought (CoT) capability, termed Reasoning Models, demonstrate superior intricate problem-solving abilities through multi-step long CoT reasoning. To create a dual-capability model with long CoT capability and domain-specific knowledge without substantial computational and data costs, model merging emerges as a highly resource-efficient method. However, significant challenges lie in merging domain-specific LLMs with long CoT ones since nowadays merging methods suffer from reasoning capability degradation, even gibberish output and output collapse. To overcome this, we introduce RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior, a novel merging framework designed to integrate domain-specific LLMs with long CoT capability, meanwhile maintaining model performance in the original domain. Treating reasoning model weights as foundational prior, our method utilizes a reasoning capability indicator to preserve core long CoT capability model weights while selectively merging essential domain-specific weights. We conducted extensive experiments on Qwen2.5-7B, Llama3.1-8B, and Qwen2.5-1.5B models in BioMedicine and Finance domains. Our results show that RCP-Merging successfully merges a reasoning model with domain-specific ones, improving domain task performance by 9.5% and 9.2% over state-of-the-art methods, without significantly harming the original long CoT reasoning capability.

[17] What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: The paper explores speech tokenizer designs in LLM-centric speech-language models (SLMs), finding decoupled tokenization improves alignment and synthesis. Multi-token prediction (MTP) boosts decoding speed and reduces word error rates. A speaker-aware generation paradigm and new benchmark enhance knowledge understanding and speaker consistency.

DetailsMotivation: To unify speech and text understanding/generation by addressing cross-modal alignment and speech generation challenges in SLMs.

Method: Systematically compares coupled, semi-decoupled, and fully decoupled speech tokenizers; introduces multi-token prediction (MTP) and a speaker-aware generation paradigm.

Result: Decoupled tokenization improves alignment/synthesis; MTP speeds decoding 12x and reduces word error rate (6.07 to 3.01). Speaker-aware methods enhance knowledge/speaker consistency.

Conclusion: Decoupled tokenization and MTP significantly advance SLM performance, while speaker-aware methods improve contextual and speaker-specific generation.

Abstract: Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

[18] Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following

Chenyang Wang, Liang Wen, Shousheng Jia, Xiangzheng Zhang, Liang Xu

Main category: cs.CL

TL;DR: The paper addresses inconsistent instruction adherence in LLMs by proposing a framework for rigorous reasoning, involving preview and self-checking, and demonstrates improved performance.

DetailsMotivation: LLMs struggle with complex instructions due to lazy reasoning during the thinking stage, leading to poor adherence.

Method: The framework includes generating complex instructions, filtering prompts, rejection sampling, and fine-tuning with Entropy-SFT and TEA-RL for reasoning transformation.

Result: The Light-IF-32B model outperforms larger open-source and closed-source models on instruction-following benchmarks.

Conclusion: The proposed framework effectively enhances LLMs’ reasoning and instruction adherence, achieving superior performance.

Abstract: While advancements in the reasoning abilities of LLMs have significantly enhanced their performance in solving mathematical problems, coding tasks, and general puzzles, their effectiveness in accurately adhering to instructions remains inconsistent, particularly with more complex directives. Our investigation identifies lazy reasoning during the thinking stage as the primary factor contributing to poor instruction adherence. To mitigate this issue, we propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking, essential for satisfying strict instruction constraints. Specifically, we first generate instructions with complex constraints and apply a filtering process to obtain valid prompts, resulting in three distinct prompt datasets categorized as hard, easy, and pass. Then, we employ rejection sampling on the pass prompts to curate a small yet high-quality dataset, enabling a cold-start initialization of the model and facilitating its adaptation to effective reasoning patterns. Subsequently, we employ an entropy-preserving supervised fine-tuning (Entropy-SFT) strategy coupled with token-wise entropy-adaptive (TEA-RL) reinforcement learning guided by rule-based dense rewards. This approach encourages the model to transform its reasoning mechanism, ultimately fostering generalizable reasoning abilities that encompass preview and self-checking. Extensive experiments conducted on instruction-following benchmarks demonstrate remarkable performance improvements across various model scales. Notably, our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.

[19] Analyzing German Parliamentary Speeches: A Machine Learning Approach for Topic and Sentiment Classification

Lukas Pätz, Moritz Beyer, Jannik Späth, Lasse Bohlen, Patrick Zschech, Mathias Kraus, Julian Rosenberger

Main category: cs.CL

TL;DR: The study analyzes 28,000 German Bundestag speeches using ML models for topic and sentiment classification, revealing party-specific discourse trends and shifts when parties transition between government and opposition roles.

DetailsMotivation: To understand how political discourse evolves in the Bundestag, focusing on topic trends, sentiment dynamics, and party-specific strategies.

Method: Developed and trained two ML models (topic and sentiment classification) on a manually labeled dataset of parliamentary speeches.

Result: High model performance (AUROC: 0.94 for topics, 0.89 for sentiment). Found relationships between party roles (government/opposition) and discourse style, with governing responsibilities influencing rhetoric.

Conclusion: Governing roles and ideological positions shape political discourse, with observable shifts when parties transition between government and opposition.

Abstract: This study investigates political discourse in the German parliament, the Bundestag, by analyzing approximately 28,000 parliamentary speeches from the last five years. Two machine learning models for topic and sentiment classification were developed and trained on a manually labeled dataset. The models showed strong classification performance, achieving an area under the receiver operating characteristic curve (AUROC) of 0.94 for topic classification (average across topics) and 0.89 for sentiment classification. Both models were applied to assess topic trends and sentiment distributions across political parties and over time. The analysis reveals remarkable relationships between parties and their role in parliament. In particular, a change in style can be observed for parties moving from government to opposition. While ideological positions matter, governing responsibilities also shape discourse. The analysis directly addresses key questions about the evolution of topics, sentiment dynamics, and party-specific discourse strategies in the Bundestag.

[20] Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models

Muhammed Saeed, Shaina Raza, Ashmal Vayani, Muhammad Abdul-Mageed, Ali Emami, Shady Shehata

Main category: cs.CL

TL;DR: The paper investigates how grammatical gender in languages influences visual representation in Text-to-Image (T2I) models, revealing significant biases in gender representation based on linguistic structure.

DetailsMotivation: To explore the overlooked impact of grammatical gender on visual outputs in T2I models, beyond demographic and stereotypical biases.

Method: A cross-linguistic benchmark was created with 800 prompts in five gendered and two gender-neutral languages, generating 28,800 images across three T2I models.

Result: Grammatical gender strongly affects image generation, with masculine markers increasing male representation to 73% and feminine markers increasing female representation to 38%. Effects vary by language resource and model architecture.

Conclusion: Language structure itself shapes AI-generated visuals, introducing a new dimension for bias and fairness in multilingual, multimodal systems.

Abstract: Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., une sentinelle'' - grammatically feminine in French but referring to the stereotypically masculine concept guard’’). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems.

[21] GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation

Zhenxuan Zhang, Kinhei Lee, Peiyuan Jing, Weihang Deng, Huichi Zhou, Zihao Jin, Jiahao Huang, Zhifan Gao, Dominic C Marshall, Yingying Fang, Guang Yang

Main category: cs.CL

TL;DR: The paper introduces GEMA-Score, a multi-agent workflow for evaluating medical reports, addressing limitations of current metrics by combining objective quantification and subjective evaluation.

DetailsMotivation: Current metrics for medical report generation lack clinical reliability, missing fine-grained details and interpretability, posing risks for clinical use.

Method: GEMA-Score uses a multi-agent workflow with LLMs to assess disease diagnosis, location, severity, and uncertainty, and evaluates completeness, readability, and terminology.

Result: GEMA-Score achieves high correlation with human expert evaluations (Kendall coefficient = 0.69 for ReXVal, 0.45 for RadEvalX).

Conclusion: GEMA-Score effectively addresses the limitations of existing metrics, providing a reliable and interpretable evaluation for clinical use.

Abstract: Automatic medical report generation has the potential to support clinical diagnosis, reduce the workload of radiologists, and demonstrate potential for enhancing diagnostic consistency. However, current evaluation metrics often fail to reflect the clinical reliability of generated reports. Early overlap-based methods focus on textual matches between predicted and ground-truth entities but miss fine-grained clinical details (e.g., anatomical location, severity). Some diagnostic metrics are limited by fixed vocabularies or templates, reducing their ability to capture diverse clinical expressions. LLM-based approaches further lack interpretable reasoning steps, making it hard to assess or trust their behavior in safety-critical settings. These limitations hinder the comprehensive assessment of the reliability of generated reports and pose risks in their selection for clinical use. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs stable calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset, demonstrating its effectiveness in clinical scoring (Kendall coefficient = $0.69$ for ReXVal dataset and Kendall coefficient = $0.45$ for RadEvalX dataset). The anonymous project demo is available at: https://github.com/Zhenxuan-Zhang/GEMA_score.

[22] Current State in Privacy-Preserving Text Preprocessing for Domain-Agnostic NLP

Abhirup Sinha, Pritilata Saha, Tithi Saha

Main category: cs.CL

TL;DR: The paper discusses the importance of anonymizing private data in large language models due to privacy concerns and explores domain-agnostic NLP methods for masking or pseudonymizing such information.

DetailsMotivation: Privacy is a human right, and modern language models often use data containing private information, making anonymization crucial to prevent data extraction.

Method: The report examines various pre-processing approaches for masking or pseudonymizing private information in textual data for NLP tasks.

Result: The study highlights the feasibility of anonymization techniques, though complete anonymization may not be achievable.

Conclusion: Anonymizing private data in NLP tasks is essential, and domain-agnostic methods can help mitigate privacy risks.

Abstract: Privacy is a fundamental human right. Data privacy is protected by different regulations, such as GDPR. However, modern large language models require a huge amount of data to learn linguistic variations, and the data often contains private information. Research has shown that it is possible to extract private information from such language models. Thus, anonymizing such private and sensitive information is of utmost importance. While complete anonymization may not be possible, a number of different pre-processing approaches exist for masking or pseudonymizing private information in textual data. This report focuses on a few of such approaches for domain-agnostic NLP tasks.

[23] Probing Syntax in Large Language Models: Successes and Remaining Challenges

Pablo J. Diego-Simón, Emmanuel Chemla, Jean-Rémi King, Yair Lakretz

Main category: cs.CL

TL;DR: Structural probes in LLMs reveal syntactic structures but are biased by word proximity and challenged by deep syntax and linguistic interference, unaffected by word predictability.

DetailsMotivation: To understand if structural and statistical factors systematically affect syntactic representations in LLMs.

Method: Analyzed structural probes on three controlled benchmarks to evaluate their performance.

Result: Probes are biased by word proximity, struggle with deep syntax and linguistic interference, but are unaffected by word predictability.

Conclusion: Highlights limitations of structural probes and proposes controlled benchmarks for better evaluation.

Abstract: The syntactic structures of sentences can be readily read-out from the activations of large language models (LLMs). However, the ``structural probes’’ that have been developed to reveal this phenomenon are typically evaluated on an indiscriminate set of sentences. Consequently, it remains unclear whether structural and/or statistical factors systematically affect these syntactic representations. To address this issue, we conduct an in-depth analysis of structural probes on three controlled benchmarks. Our results are three-fold. First, structural probes are biased by a superficial property: the closer two words are in a sentence, the more likely structural probes will consider them as syntactically linked. Second, structural probes are challenged by linguistic properties: they poorly represent deep syntactic structures, and get interfered by interacting nouns or ungrammatical verb forms. Third, structural probes do not appear to be affected by the predictability of individual words. Overall, this work sheds light on the current challenges faced by structural probes. Providing a benchmark made of controlled stimuli to better evaluate their performance.

[24] Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen

Main category: cs.CL

TL;DR: The paper explores strategies to improve open-source LLMs’ data analysis capabilities, identifying strategic planning as key and proposing a data synthesis method for enhancement.

DetailsMotivation: Open-source LLMs lag in reasoning-intensive tasks like data analysis, prompting the need for improvement.

Method: A seed dataset of diverse scenarios was curated to evaluate LLMs on data understanding, code generation, and strategic planning. Insights led to a data synthesis methodology.

Result: Strategic planning quality is critical; interaction design and task complexity affect reasoning; data quality outweighs diversity for performance.

Conclusion: The proposed data synthesis method significantly boosts open-source LLMs’ analytical reasoning, with code made publicly available.

Abstract: Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs’ analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.

[25] CardiffNLP at CLEARS-2025: Prompting Large Language Models for Plain Language and Easy-to-Read Text Rewriting

Mutaz Ayesh, Nicolás Gutiérrez-Rolón, Fernando Alva-Manchego

Main category: cs.CL

TL;DR: The CardiffNLP team used LLM-prompting with Gemma-3 for Spanish text adaptation, securing 3rd in Subtask 1 and 2nd in Subtask 2.

DetailsMotivation: To contribute to the CLEARS shared task on Spanish text adaptation by exploring LLM-prompting techniques.

Method: Adopted Gemma-3 after initial experiments with LLaMA-3.2, using various prompt variations and examples.

Result: Achieved 3rd place in Subtask 1 and 2nd place in Subtask 2.

Conclusion: The LLM-prompting approach with Gemma-3 proved effective for the task, as demonstrated by the team’s strong performance.

Abstract: This paper details the CardiffNLP team’s contribution to the CLEARS shared task on Spanish text adaptation, hosted by IberLEF 2025. The shared task contained two subtasks and the team submitted to both. Our team took an LLM-prompting approach with different prompt variations. While we initially experimented with LLaMA-3.2, we adopted Gemma-3 for our final submission, and landed third place in Subtask 1 and second place in Subtask 2. We detail our numerous prompt variations, examples, and experimental results.

[26] ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

Minghao Guo, Xi Zhu, Jingyuan Huang, Kai Mei, Yongfeng Zhang

Main category: cs.CL

TL;DR: ReaGAN introduces an agent-based framework for GNNs, enabling autonomous node-level decision-making and retrieval-augmented global relationships, outperforming fixed aggregation schemes.

DetailsMotivation: Fixed aggregation in GNNs struggles with node informativeness imbalance and ignores global semantic relationships, limiting performance.

Method: ReaGAN uses agent-based nodes with internal memory for adaptive message propagation and retrieval-augmented generation (RAG) for global semantic access.

Result: ReaGAN achieves competitive performance in few-shot settings using a frozen LLM backbone without fine-tuning.

Conclusion: ReaGAN demonstrates the effectiveness of agentic planning and local-global retrieval in enhancing graph learning.

Abstract: Graph Neural Networks (GNNs) have achieved remarkable success in graph-based learning by propagating information among neighbor nodes via predefined aggregation mechanisms. However, such fixed schemes often suffer from two key limitations. First, they cannot handle the imbalance in node informativeness – some nodes are rich in information, while others remain sparse. Second, predefined message passing primarily leverages local structural similarity while ignoring global semantic relationships across the graph, limiting the model’s ability to capture distant but relevant information. We propose Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework that empowers each node with autonomous, node-level decision-making. Each node acts as an agent that independently plans its next action based on its internal memory, enabling node-level planning and adaptive message propagation. Additionally, retrieval-augmented generation (RAG) allows nodes to access semantically relevant content and build global relationships in the graph. ReaGAN achieves competitive performance under few-shot in-context settings using a frozen LLM backbone without fine-tuning, showcasing the potential of agentic planning and local-global retrieval in graph learning.

[27] Somatic in the East, Psychological in the West?: Investigating Clinically-Grounded Cross-Cultural Depression Symptom Expression in LLMs

Shintaro Sakai, Jisun An, Migyeong Kang, Haewoon Kwak

Main category: cs.CL

TL;DR: LLMs fail to replicate cultural depression symptom patterns in English prompts but show improvement in Eastern languages, highlighting limitations in cultural sensitivity and symptom hierarchy.

DetailsMotivation: To investigate whether LLMs reproduce known cultural differences in depression symptom reporting (psychological in Western cultures, somatic in Eastern cultures).

Method: Prompted LLMs with Western or Eastern personas in English and major Eastern languages (Chinese, Japanese, Hindi) to observe symptom reporting patterns.

Result: LLMs largely failed to replicate cultural patterns in English but showed better alignment in Eastern languages. Key issues: low cultural persona sensitivity and a culturally invariant symptom hierarchy.

Conclusion: Current LLMs lack robust culture-aware capabilities, limiting their safety and effectiveness in mental health applications.

Abstract: Prior clinical psychology research shows that Western individuals with depression tend to report psychological symptoms, while Eastern individuals report somatic ones. We test whether Large Language Models (LLMs), which are increasingly used in mental health, reproduce these cultural patterns by prompting them with Western or Eastern personas. Results show that LLMs largely fail to replicate the patterns when prompted in English, though prompting in major Eastern languages (i.e., Chinese, Japanese, and Hindi) improves alignment in several configurations. Our analysis pinpoints two key reasons for this failure: the models’ low sensitivity to cultural personas and a strong, culturally invariant symptom hierarchy that overrides cultural cues. These findings reveal that while prompt language is important, current general-purpose LLMs lack the robust, culture-aware capabilities essential for safe and effective mental health applications.

[28] RooseBERT: A New Deal For Political Language Modelling

Deborah Dore, Elena Cabrio, Serena Villata

Main category: cs.CL

TL;DR: RooseBERT, a domain-specific pre-trained Language Model for political discourse, outperforms general-purpose models in tasks like sentiment analysis and argument detection, trained on large political debate corpora.

DetailsMotivation: The need for computational methods to analyze political debates, given their complexity and hidden communication strategies, challenges general-purpose Language Models.

Method: Developed RooseBERT, pre-trained on 8K political debates, and fine-tuned for four tasks: named entity recognition, sentiment analysis, argument component detection, and argument relation prediction.

Result: RooseBERT significantly outperforms general-purpose models in all four tasks, showing domain-specific pre-training’s effectiveness.

Conclusion: RooseBERT enhances political debate analysis and is released for research use, demonstrating the value of domain-specific models.

Abstract: The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models. To address this issue, we introduce a novel pre-trained Language Model for political discourse language called RooseBERT. Pre-training a language model on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (8K debates, each composed of several sub-debates on different topics) in English. To evaluate its performances, we fine-tuned it on four downstream tasks related to political debate analysis, i.e., named entity recognition, sentiment analysis, argument component detection and classification, and argument relation prediction and classification. Our results demonstrate significant improvements over general-purpose Language Models on these four tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release the RooseBERT language model for the research community.

[29] Exploring Stability-Plasticity Trade-offs for Continual Named Entity Recognition

Duzhen Zhang, Chenxing Li, Jiahua Dong, Qi Liu, Dong Yu

Main category: cs.CL

TL;DR: The paper proposes a Stability-Plasticity Trade-off (SPT) method for Continual Named Entity Recognition (CNER) to balance retaining old knowledge and acquiring new knowledge, outperforming previous methods.

DetailsMotivation: Existing CNER methods overly prioritize stability (retaining old knowledge) via Knowledge Distillation, limiting plasticity (learning new knowledge). The paper aims to address this imbalance.

Method: The SPT method balances stability and plasticity by introducing a pooling operation in Knowledge Distillation (representation perspective) and dynamically merging old and new model weights (weight perspective). It also uses confidence-based pseudo-labeling for non-entity types.

Result: Experiments across ten CNER settings on three datasets show SPT outperforms previous methods, achieving a better stability-plasticity trade-off.

Conclusion: The SPT method effectively balances stability and plasticity in CNER, addressing limitations of prior approaches and demonstrating superior performance.

Abstract: Continual Named Entity Recognition (CNER) is an evolving field that focuses on sequentially updating an existing model to incorporate new entity types. Previous CNER methods primarily utilize Knowledge Distillation (KD) to preserve prior knowledge and overcome catastrophic forgetting, strictly ensuring that the representations of old and new models remain consistent. Consequently, they often impart the model with excessive stability (i.e., retention of old knowledge) but limited plasticity (i.e., acquisition of new knowledge). To address this issue, we propose a Stability-Plasticity Trade-off (SPT) method for CNER that balances these aspects from both representation and weight perspectives. From the representation perspective, we introduce a pooling operation into the original KD, permitting a level of plasticity by consolidating representation dimensions. From the weight perspective, we dynamically merge the weights of old and new models, strengthening old knowledge while maintaining new knowledge. During this fusion, we implement a weight-guided selective mechanism to prioritize significant weights. Moreover, we develop a confidence-based pseudo-labeling approach for the current non-entity type, which predicts entity types using the old model to handle the semantic shift of the non-entity type, a challenge specific to CNER that has largely been ignored by previous methods. Extensive experiments across ten CNER settings on three benchmark datasets demonstrate that our SPT method surpasses previous CNER approaches, highlighting its effectiveness in achieving a suitable stability-plasticity trade-off.

[30] Pay What LLM Wants: Can LLM Simulate Economics Experiment with 522 Real-human Persona?

Junhyuk Choi, Hyeonchu Park, Haemin Lee, Hyebeen Shin, Hyun Joung Jin, Bugeun Kim

Main category: cs.CL

TL;DR: The paper evaluates LLMs’ ability to predict human economic decisions using real data from 522 Korean participants in PWYW experiments, finding they perform better at group-level trends than individual predictions.

DetailsMotivation: To address the gap in using real human data (instead of fictional personas) to assess LLMs' simulation of economic behavior.

Method: Systematic comparison of three multimodal LLMs using detailed persona data from 522 participants in cultural consumption scenarios, testing prediction accuracy and the impact of persona injection methods.

Result: LLMs struggle with individual-level predictions but show reasonable group-level accuracy. Advanced prompting techniques (e.g., narrative reconstruction, retrieval-augmented generation) offer no significant improvement over simple prompting.

Conclusion: The study provides the first comprehensive evaluation of LLMs’ economic behavior simulation using real human data, offering empirical insights for computational social science.

Abstract: Recent advances in Large Language Models (LLMs) have generated significant interest in their capacity to simulate human-like behaviors, yet most studies rely on fictional personas rather than actual human data. We address this limitation by evaluating LLMs’ ability to predict individual economic decision-making using Pay-What-You-Want (PWYW) pricing experiments with real 522 human personas. Our study systematically compares three state-of-the-art multimodal LLMs using detailed persona information from 522 Korean participants in cultural consumption scenarios. We investigate whether LLMs can accurately replicate individual human choices and how persona injection methods affect prediction performance. Results reveal that while LLMs struggle with precise individual-level predictions, they demonstrate reasonable group-level behavioral tendencies. Also, we found that commonly adopted prompting techniques are not much better than naive prompting methods; reconstruction of personal narrative nor retrieval augmented generation have no significant gain against simple prompting method. We believe that these findings can provide the first comprehensive evaluation of LLMs’ capabilities on simulating economic behavior using real human data, offering empirical guidance for persona-based simulation in computational social science.

[31] LECTOR: LLM-Enhanced Concept-based Test-Oriented Repetition for Adaptive Spaced Learning

Jiahao Zhao

Main category: cs.CL

TL;DR: LECTOR is a new adaptive scheduling algorithm for test-oriented learning, using LLMs for semantic analysis and personalized profiles, outperforming baselines with a 90.2% success rate.

DetailsMotivation: Existing spaced repetition systems struggle with semantic interference and lack personalized adaptation, especially in language learning.

Method: LECTOR combines LLM-powered semantic similarity assessment with spaced repetition principles, tailored for test-oriented scenarios.

Result: LECTOR achieves a 90.2% success rate, a 2.0% improvement over the best baseline (SSP-MMC), and excels in handling semantic confusion.

Conclusion: LECTOR is a promising tool for intelligent tutoring and adaptive learning platforms, addressing key challenges in spaced repetition.

Abstract: Spaced repetition systems are fundamental to efficient learning and memory retention, but existing algorithms often struggle with semantic interference and personalized adaptation. We present LECTOR (\textbf{L}LM-\textbf{E}nhanced \textbf{C}oncept-based \textbf{T}est-\textbf{O}riented \textbf{R}epetition), a novel adaptive scheduling algorithm specifically designed for test-oriented learning scenarios, particularly language examinations where success rate is paramount. LECTOR leverages large language models for semantic analysis while incorporating personalized learning profiles, addressing the critical challenge of semantic confusion in vocabulary learning by utilizing LLM-powered semantic similarity assessment and integrating it with established spaced repetition principles. Our comprehensive evaluation against six baseline algorithms (SSP-MMC, SM2, HLR, FSRS, ANKI, THRESHOLD) across 100 simulated learners over 100 days demonstrates significant improvements: LECTOR achieves a 90.2% success rate compared to 88.4% for the best baseline (SSP-MMC), representing a 2.0% relative improvement. The algorithm shows particular strength in handling semantically similar concepts, reducing confusion-induced errors while maintaining computational efficiency. Our results establish LECTOR as a promising direction for intelligent tutoring systems and adaptive learning platforms.

[32] Do language models accommodate their users? A study of linguistic convergence

Terra Blevins, Susanne Schmalwieser, Benjamin Roth

Main category: cs.CL

TL;DR: The paper investigates whether LLMs exhibit linguistic convergence (adapting to users’ linguistic patterns) and finds they often overfit compared to humans, with differences based on model type and size.

DetailsMotivation: To understand how closely LLMs mimic human linguistic convergence in dialogue, a key aspect of human communication.

Method: Systematic comparison of model completions to human responses across 16 models, 3 dialogue corpora, and various stylometric features.

Result: Models strongly converge to dialogue style, often overfitting; instruction-tuned and larger models converge less than pretrained ones.

Conclusion: Model and human convergence mechanisms likely differ, highlighting distinct behavioral patterns in LLMs.

Abstract: While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication, asking: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of exisiting dialogues to the original human responses across sixteen language models, three dialogue corpora, and a variety of stylometric features. We find that models strongly converge to the conversation’s style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained counterparts. Given the differences between human and model convergence patterns, we hypothesize that the underlying mechanisms for these behaviors are very different.

[33] Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes

Shahed Masoudian, Gustavo Escobedo, Hannah Strauss, Markus Schedl

Main category: cs.CL

TL;DR: The paper investigates gender bias in LLMs using psychological stereotypes in narrative generation, revealing biases and mitigation effects.

DetailsMotivation: Concerns about LLMs amplifying gender biases, especially in implicit forms during generative tasks, motivate this study.

Method: A novel dataset (StereoBias-Stories) with stories conditioned on psychological stereotypes is used to analyze gender bias in narrative generation.

Result: Findings include bias mitigation with non-gender attributes, intensified bias with multiple stereotype attributes, and alignment with psychological ground-truth.

Conclusion: Psychology-grounded evaluation is crucial for understanding and addressing gender bias in LLMs.

Abstract: As Large Language Models (LLMs) are increasingly used across different applications, concerns about their potential to amplify gender biases in various tasks are rising. Prior research has often probed gender bias using explicit gender cues as counterfactual, or studied them in sentence completion and short question answering tasks. These formats might overlook more implicit forms of bias embedded in generative behavior of longer content. In this work, we investigate gender bias in LLMs using gender stereotypes studied in psychology (e.g., aggressiveness or gossiping) in an open-ended task of narrative generation. We introduce a novel dataset called StereoBias-Stories containing short stories either unconditioned or conditioned on (one, two, or six) random attributes from 25 psychological stereotypes and three task-related story endings. We analyze how the gender contribution in the overall story changes in response to these attributes and present three key findings: (1) While models, on average, are highly biased towards male in unconditioned prompts, conditioning on attributes independent from gender stereotypes mitigates this bias. (2) Combining multiple attributes associated with the same gender stereotype intensifies model behavior, with male ones amplifying bias and female ones alleviating it. (3) Model biases align with psychological ground-truth used for categorization, and alignment strength increases with model size. Together, these insights highlight the importance of psychology-grounded evaluation of LLMs.

[34] NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty

Leonidas Zotos, Ivo Pascal de Jong, Matias Valdenegro-Toro, Andreea Ioana Sburlea, Malvina Nissim, Hedderik van Rijn

Main category: cs.CL

TL;DR: LLMs outperform professors in estimating exam question difficulty, with supervised learning using LLM uncertainty yielding the best results.

DetailsMotivation: Professors often struggle to accurately estimate the difficulty of exam questions, which is crucial for developing effective assessments.

Method: Compared professors with LLM-based methods (directly asking Gemini 2.5 and using LLM uncertainties in supervised learning) on True/False questions in Neural Networks and Machine Learning.

Result: Professors were outperformed by LLMs, especially when using LLM uncertainties in a supervised learning setting with minimal training data (42 samples).

Conclusion: Supervised learning with LLM uncertainty can enhance professors’ ability to estimate question difficulty, improving exam quality.

Abstract: Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment.

[35] Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu

Main category: cs.CL

TL;DR: Hi-Guard is a multimodal moderation framework designed for accurate, interpretable, and policy-aligned content moderation, using a hierarchical pipeline and taxonomy, optimized with GRPO.

DetailsMotivation: Current moderation systems lack accuracy, interpretability, and alignment with policies, hindering human review and compliance.

Method: Hi-Guard employs a hierarchical moderation pipeline (binary filtering + fine-grained classification) and taxonomy, integrates rule definitions into prompts, and uses GRPO with multi-level soft-margin rewards.

Result: Hi-Guard achieves superior accuracy, generalization, and interpretability in experiments and real-world deployment.

Conclusion: Hi-Guard enables scalable, transparent, and trustworthy content moderation systems.

Abstract: Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term “Hierarchical” reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard.

[36] CTTS: Collective Test-Time Scaling

Zhende Song, Shengji Tang, Peng Ye, Jiayuan Fan, Tao Chen

Main category: cs.CL

TL;DR: The paper introduces Collective Test-Time Scaling (CTTS) to enhance LLMs by leveraging multi-agent and multi-reward-model collaboration, proposing the CTTS-MM framework, which outperforms existing methods.

DetailsMotivation: Existing test-time scaling methods like Best-of-N and Self-Consistency are limited by single-agent paradigms, while collective-agent methods show potential to surpass these limits.

Method: Three CTTS paradigms (SA-MR, MA-SR, MA-MR) are explored, with MA-MR identified as optimal. The CTTS-MM framework incorporates Agent Collaboration Search (ACS) and Mixture of Reward Models (MoR) for enhanced inference.

Result: MA-MR consistently performs best. CTTS-MM achieves superior results across seven benchmarks.

Conclusion: CTTS-MM effectively leverages multi-agent and multi-reward-model collaboration, setting a new standard for test-time scaling in LLMs.

Abstract: Test-time scaling (TTS) has emerged as a promising research field for enhancing the effectiveness of large language models (LLMs) without extra training. However, most existing approaches, e.g., Best-of-N and Self-Consistency rely on a single agent interacting with a reward model (SA-SR), constrained by limited capabilities of a single test-time scaling (STTS) paradigm. On the other hand, recent works demonstrate that collective-agent methods can break through the upper bound of single-agent systems by orchestrating diverse models. Thus, in this paper, we take a first step towards exploring Collective Test-Time Scaling (CTTS). Consider the different interaction types of single and multiple models, we design three primary paradigms to investigate the optimal paradigm of CTTS: (1) single agent to multiple reward models (SA-MR); (2) multiple agents to single reward model (MA-SR); and (3) multiple agents to multiple reward models (MA-MR). Extensive experiments demonstrate that MA-MR consistently achieves the best performance. Based on this, we propose a novel framework named CTTS-MM that effectively leverages both multi-agent and multi-reward-model collaboration for enhanced inference. Specifically, for multi-agent collaboration, we propose an Agent Collaboration Search (ACS), which searches for the most effective combination of LLM agents from a large candidate pool; for multi-reward-model collaboration, we propose Mixture of Reword Models (MoR), which consists of a curated question pool and a Prior Reward model Ensemble Selection (PRES) to select the optimal combinations of reward models via Pair-wise Reward Ranking (PRR) metric. Experiments across seven mainstream benchmarks demonstrate that the proposed CTTS-MM consistently obtains superior performance. Code will be released at https://github.com/magent4aci/CTTS-MM.

[37] Taggus: An Automated Pipeline for the Extraction of Characters’ Social Networks from Portuguese Fiction Literature

Tiago G Canário, Catarina Duarte, Flávio L. Pinheiro, João L. M. Pereira

Main category: cs.CL

TL;DR: Taggus pipeline extracts social networks from Portuguese fiction, outperforming state-of-the-art tools with F1-scores of 94.1% (character identification) and 75.9% (interaction detection).

DetailsMotivation: Existing NLP methods underperform for constructing character social networks, especially in less-represented languages like Portuguese due to lack of annotated data.

Method: Taggus combines POS tagging and heuristics to identify characters and their interactions.

Result: Achieves 94.1% F1-score for character identification (50.7% improvement) and 75.9% for interaction detection (22.3% improvement).

Conclusion: Taggus is effective for Portuguese fiction, with future steps to improve relationship detection. Pipeline is publicly available.

Abstract: Automatically identifying characters and their interactions from fiction books is, arguably, a complex task that requires pipelines that leverage multiple Natural Language Processing (NLP) methods, such as Named Entity Recognition (NER) and Part-of-speech (POS) tagging. However, these methods are not optimized for the task that leads to the construction of Social Networks of Characters. Indeed, the currently available methods tend to underperform, especially in less-represented languages, due to a lack of manually annotated data for training. Here, we propose a pipeline, which we call Taggus, to extract social networks from literary fiction works in Portuguese. Our results show that compared to readily available State-of-the-Art tools – off-the-shelf NER tools and Large Language Models (ChatGPT) – the resulting pipeline, which uses POS tagging and a combination of heuristics, achieves satisfying results with an average F1-Score of $94.1%$ in the task of identifying characters and solving for co-reference and $75.9%$ in interaction detection. These represent, respectively, an increase of $50.7%$ and $22.3%$ on results achieved by the readily available State-of-the-Art tools. Further steps to improve results are outlined, such as solutions for detecting relationships between characters. Limitations on the size and scope of our testing samples are acknowledged. The Taggus pipeline is publicly available to encourage development in this field for the Portuguese language.2

[38] Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

Haotian Wu, Bo Xu, Yao Shu, Menglin Yang, Chengwei Qin

Main category: cs.CL

TL;DR: The paper introduces JointThinking, a new in-context learning paradigm for reasoning large language models (RLLMs) that leverages the difference between Thinking and Nothinking modes to improve accuracy with minimal latency overhead.

DetailsMotivation: Prior research has focused on training and inference strategies for RLLMs, leaving their in-context learning (ICL) potential underexplored. This work aims to fill that gap.

Method: JointThinking prompts the model to generate two answers (Thinking and Nothinking modes) and triggers a second round of Thinking only if they disagree. This minimizes latency while improving robustness.

Result: JointThinking outperforms few-shot chain-of-thought and majority voting, achieves comparable in-distribution performance to SOTA methods, and excels in out-of-distribution tasks. Error rates decrease with structural thinking diversity.

Conclusion: The method shows strong scalability and highlights the value of diverse reasoning modes. Limitations are discussed, with directions for future ICL research in RLLMs.

Abstract: Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that leverages the structured difference between two reasoning modes, i.e., Thinking and Nothinking, to improve reasoning accuracy. Specifically, our method prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt that incorporates the original question and both candidate answers. Since such disagreement occurs infrequently (e.g., only 6% in GSM8K), our method performs just one round of reasoning in most cases, resulting in minimal latency overhead. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT) and majority voting with improved answer robustness. Moreover, It achieves comparable in-distribution performance to training-based SOTA method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing that leveraging different reasoning modes consistently lowers the error rate and highlights the value of structural thinking diversity. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second round of thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

[39] ReDSM5: A Reddit Dataset for DSM-5 Depression Detection

Eliseo Bao, Anxo Pérez, Javier Parapar

Main category: cs.CL

TL;DR: ReDSM5 is a Reddit corpus with 1484 posts annotated for DSM-5 depression symptoms at sentence level, enabling interpretable depression detection models.

DetailsMotivation: Many depression cases go undiagnosed; existing methods lack clinical relevance by not linking language to DSM-5 criteria.

Method: Introduces ReDSM5, a corpus annotated by a psychologist for DSM-5 symptoms, with clinical rationales. Analyzes linguistic and emotional patterns.

Result: Baseline benchmarks for symptom classification and explanation generation are established.

Conclusion: ReDSM5 enhances interpretability and clinical relevance in depression detection from social media.

Abstract: Depression is a pervasive mental health condition that affects hundreds of millions of individuals worldwide, yet many cases remain undiagnosed due to barriers in traditional clinical access and pervasive stigma. Social media platforms, and Reddit in particular, offer rich, user-generated narratives that can reveal early signs of depressive symptomatology. However, existing computational approaches often label entire posts simply as depressed or not depressed, without linking language to specific criteria from the DSM-5, the standard clinical framework for diagnosing depression. This limits both clinical relevance and interpretability. To address this gap, we introduce ReDSM5, a novel Reddit corpus comprising 1484 long-form posts, each exhaustively annotated at the sentence level by a licensed psychologist for the nine DSM-5 depression symptoms. For each label, the annotator also provides a concise clinical rationale grounded in DSM-5 methodology. We conduct an exploratory analysis of the collection, examining lexical, syntactic, and emotional patterns that characterize symptom expression in social media narratives. Compared to prior resources, ReDSM5 uniquely combines symptom-specific supervision with expert explanations, facilitating the development of models that not only detect depression but also generate human-interpretable reasoning. We establish baseline benchmarks for both multi-label symptom classification and explanation generation, providing reference results for future research on detection and interpretability.

[40] Variety Is the Spice of Life: Detecting Misinformation with Dynamic Environmental Representations

Bing Wang, Ximing Li, Yiming Wang, Changchun Li, Jiaxu Cui, Renchu Guan, Bo Yang

Main category: cs.CL

TL;DR: The paper proposes MISDER, a dynamic framework for misinformation detection, addressing the limitations of static methods by incorporating temporal social environmental representations.

DetailsMotivation: Misinformation detection is crucial due to its harmful effects, but static methods fail to account for the dynamic nature of news veracity in evolving social environments.

Method: MISDER learns social environmental representations for each period and uses temporal models (LSTM, ODE, pre-trained dynamics) to predict future representations. Three variants are proposed: MISDER-LSTM, MISDER-ODE, and MISDER-PT.

Result: MISDER outperforms baseline methods on two datasets, demonstrating its effectiveness in dynamic misinformation detection.

Conclusion: MISDER provides a robust solution for detecting misinformation in dynamic social environments, with potential for further refinement and application.

Abstract: The proliferation of misinformation across diverse social media platforms has drawn significant attention from both academic and industrial communities due to its detrimental effects. Accordingly, automatically distinguishing misinformation, dubbed as Misinformation Detection (MD), has become an increasingly active research topic. The mainstream methods formulate MD as a static learning paradigm, which learns the mapping between the content, links, and propagation of news articles and the corresponding manual veracity labels. However, the static assumption is often violated, since in real-world scenarios, the veracity of news articles may vacillate within the dynamically evolving social environment. To tackle this problem, we propose a novel framework, namely Misinformation detection with Dynamic Environmental Representations (MISDER). The basic idea of MISDER lies in learning a social environmental representation for each period and employing a temporal model to predict the representation for future periods. In this work, we specify the temporal model as the LSTM model, continuous dynamics equation, and pre-trained dynamics system, suggesting three variants of MISDER, namely MISDER-LSTM, MISDER-ODE, and MISDER-PT, respectively. To evaluate the performance of MISDER, we compare it to various MD baselines across 2 prevalent datasets, and the experimental results can indicate the effectiveness of our proposed model.

[41] LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models

Junhong Wu, Jinliang Lu, Zixuan Ren, Ganqiang Hu, Zhi Wu, Dai Dai, Hua Wu

Main category: cs.CL

TL;DR: The paper investigates ‘Soft Thinking’ in LLMs, revealing that models often focus on dominant soft inputs, limiting reasoning path exploration. Introducing randomness via sampling strategies like Dirichlet resampling and Gumbel-Softmax improves performance.

DetailsMotivation: To address the limitation of discrete token generation in LLMs by enabling abstract, continuous reasoning (Soft Thinking) and explore its effectiveness.

Method: Probing techniques to analyze LLMs’ internal behavior, followed by introducing randomness via Dirichlet resampling and Gumbel-Softmax to enhance Soft Thinking.

Result: LLMs tend to rely on dominant soft inputs, reducing reasoning path exploration. Randomness improves performance, with Gumbel-Softmax showing superior results across benchmarks.

Conclusion: Incorporating randomness in Soft Thinking mitigates limitations and enhances reasoning capabilities, with Gumbel-Softmax being particularly effective.

Abstract: Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. This paper explores the `Soft Thinking’ capabilities of various LLMs by examining the models’ internal behavior using a suite of probing techniques. Contrary to the common belief that Soft Thinking enables the simultaneous exploration of diverse reasoning paths, our findings reveal that LLMs predominantly rely on the most influential component of the soft inputs during subsequent decoding steps. This reliance hinders the exploration of different reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding, obscuring the advantage of transmitting more information through Soft Tokens. To tackle this issue, we explore sampling strategies to introduce \emph{randomness}, employing methods such as Dirichlet resampling and the Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate randomness with controlled smoothness, resulting in superior performance across eight reasoning benchmarks.

[42] Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings

Rita González-Márquez, Philipp Berens, Dmitry Kobak

Main category: cs.CL

TL;DR: Self-supervised fine-tuning for text embeddings using cropping augmentation outperforms dropout-based methods, achieving near-SOTA quality for in-domain data with minimal fine-tuning.

DetailsMotivation: To explore self-supervised training for text embeddings, contrasting with the dominant supervised fine-tuning approach in NLP, inspired by success in computer vision.

Method: Systematic comparison of cropping and dropout-based augmentation strategies in contrastive learning, evaluated on MTEB and in-domain datasets.

Result: Cropping augmentation outperforms dropout, producing high-quality embeddings for in-domain data with minimal fine-tuning, though slightly below supervised SOTA.

Conclusion: Self-supervised fine-tuning, especially of last transformer layers, is efficient and effective for in-domain text embeddings.

Abstract: Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, sentiment analysis, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via extensive supervised fine-tuning using curated text pairs. This contrasts with computer vision, where self-supervised training based on data augmentations has demonstrated remarkable success. Here we systematically compare the two most well-known augmentation strategies for positive pair generation in contrastive learning of text embeddings. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is below the supervised SOTA models, but for in-domain data, self-supervised fine-tuning produces high-quality text embeddings after very short fine-tuning, sometimes only marginally below the supervised SOTA. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

[43] fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval

Pranshu Rastogi

Main category: cs.CL

TL;DR: A bi-encoder model fine-tuned from a pre-trained transformer achieved high performance in multilingual and cross-lingual fact-checked claim retrieval tasks.

DetailsMotivation: To improve multilingual and cross-lingual fact-checked claim retrieval using efficient models.

Method: Learning-to-Rank task with a bi-encoder model, fine-tuned from a pre-trained transformer, trained on source languages and English translations.

Result: 92% Success@10 in multilingual and 80% Success@10 in cross-lingual tasks.

Conclusion: The method is effective for multilingual and cross-lingual retrieval with lightweight models.

Abstract: SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval is approached as a Learning-to-Rank task using a bi-encoder model fine-tuned from a pre-trained transformer optimized for sentence similarity. Training used both the source languages and their English translations for multilingual retrieval and only English translations for cross-lingual retrieval. Using lightweight models with fewer than 500M parameters and training on Kaggle T4 GPUs, the method achieved 92% Success@10 in multilingual and 80% Success@10 in 5th in crosslingual and 10th in multilingual tracks.

[44] CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation

Kaiwen Zhao, Bharathan Balaji, Stephen Lee

Main category: cs.CL

TL;DR: The paper introduces CarbonPDF-QA, a dataset for analyzing carbon footprint questions in sustainability reports, and proposes CarbonPDF, a fine-tuned LLM technique that outperforms existing QA systems.

DetailsMotivation: The unstructured and inconsistent nature of text in PDF sustainability reports complicates carbon footprint analysis, necessitating a specialized solution.

Method: The authors develop CarbonPDF by fine-tuning Llama 3 on their CarbonPDF-QA dataset, which includes 1735 product reports and human-annotated QA pairs.

Result: CarbonPDF outperforms state-of-the-art QA systems, especially in handling data inconsistencies where GPT-4o struggles.

Conclusion: The proposed CarbonPDF technique effectively addresses challenges in analyzing unstructured PDF sustainability reports, offering a superior solution for carbon footprint questions.

Abstract: Product sustainability reports provide valuable insights into the environmental impacts of a product and are often distributed in PDF format. These reports often include a combination of tables and text, which complicates their analysis. The lack of standardization and the variability in reporting formats further exacerbate the difficulty of extracting and interpreting relevant information from large volumes of documents. In this paper, we tackle the challenge of answering questions related to carbon footprints within sustainability reports available in PDF format. Unlike previous approaches, our focus is on addressing the difficulties posed by the unstructured and inconsistent nature of text extracted from PDF parsing. To facilitate this analysis, we introduce CarbonPDF-QA, an open-source dataset containing question-answer pairs for 1735 product report documents, along with human-annotated answers. Our analysis shows that GPT-4o struggles to answer questions with data inconsistencies. To address this limitation, we propose CarbonPDF, an LLM-based technique specifically designed to answer carbon footprint questions on such datasets. We develop CarbonPDF by fine-tuning Llama 3 with our training data. Our results show that our technique outperforms current state-of-the-art techniques, including question-answering (QA) systems finetuned on table and text data.

[45] UPLME: Uncertainty-Aware Probabilistic Language Modelling for Robust Empathy Regression

Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon

Main category: cs.CL

TL;DR: UPLME, an uncertainty-aware probabilistic language model, improves empathy regression by handling noisy labels and outperforms existing methods.

DetailsMotivation: Supervised learning for empathy regression suffers from noisy self-reported empathy scores, with limited solutions for regression settings.

Method: UPLME uses a probabilistic language model to predict empathy scores and uncertainty, trained with Bayesian concepts and novel loss components for better performance.

Result: UPLME achieves state-of-the-art results (Pearson Correlation: 0.580, 0.634) and better calibration (error: 0.376) on benchmarks with label noise.

Conclusion: UPLME effectively handles noisy labels in empathy regression, outperforming existing methods and improving uncertainty quantification.

Abstract: Supervised learning for empathy regression is challenged by noisy self-reported empathy scores. While many algorithms have been proposed for learning with noisy labels in textual classification problems, the regression counterpart is relatively under-explored. We propose UPLME, an uncertainty-aware probabilistic language modelling framework to capture label noise in the regression setting of empathy detection. UPLME includes a probabilistic language model that predicts both empathy score and heteroscedastic uncertainty and is trained using Bayesian concepts with variational model ensembling. We further introduce two novel loss components: one penalises degenerate Uncertainty Quantification (UQ), and another enforces the similarity between the input pairs on which we predict empathy. UPLME provides state-of-the-art performance (Pearson Correlation Coefficient: $0.558\rightarrow0.580$ and $0.629\rightarrow0.634$) in terms of the performance reported in the literature in two public benchmarks, having label noise. Through synthetic label noise injection, we show that UPLME is effective in separating noisy and clean samples based on the predicted uncertainty. UPLME further outperform (Calibration error: $0.571\rightarrow0.376$) a recent variational model ensembling-based UQ method designed for regression problems.

[46] FilBench: Can LLMs Understand and Generate Filipino?

Lester James V. Miranda, Elyanah Aco, Conner Manuel, Jan Christian Blaise Cruz, Joseph Marvin Imperial

Main category: cs.CL

TL;DR: FilBench is a Filipino-centric benchmark evaluating LLMs in Filipino, Tagalog, and Cebuano. Results show gaps in reading comprehension and translation, with GPT-4o scoring highest (72.23%). SEA-LION v3 70B, tailored for Southeast Asian languages, scored 61.07%. The study highlights the need for language-specific benchmarks to advance Filipino NLP.

DetailsMotivation: To assess LLM capabilities in Filipino, Tagalog, and Cebuano, addressing a gap in existing benchmarks and promoting inclusion of Philippine languages in NLP research.

Method: FilBench was created with tasks reflecting Philippine NLP priorities (Cultural Knowledge, Classical NLP, Reading Comprehension, Generation). 27 state-of-the-art LLMs were evaluated.

Result: LLMs struggle with reading comprehension and translation. GPT-4o scored highest (72.23%), while SEA-LION v3 70B (for Southeast Asian languages) scored 61.07%.

Conclusion: Language-specific benchmarks like FilBench are crucial for advancing Filipino NLP and ensuring Philippine languages are included in LLM development.

Abstract: Despite the impressive performance of LLMs on English-based tasks, little is known about their capabilities in specific languages such as Filipino. In this work, we address this gap by introducing FilBench, a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano. We carefully curate the tasks in FilBench to reflect the priorities and trends of NLP research in the Philippines such as Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation. By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities. Our results indicate that FilBench is challenging, with the best model, GPT-4o, achieving only a score of 72.23%. Moreover, we also find that models trained specifically for Southeast Asian languages tend to underperform on FilBench, with the highest-performing model, SEA-LION v3 70B, achieving only a score of 61.07%. Our work demonstrates the value of curating language-specific LLM benchmarks to aid in driving progress on Filipino NLP and increasing the inclusion of Philippine languages in LLM development.

[47] Marito: Structuring and Building Open Multilingual Terminologies for South African NLP

Vukosi Marivate, Isheanesu Dzingirai, Fiskani Banda, Richard Lastrucci, Thapelo Sindane, Keabetswe Madumo, Kayode Olaleye, Abiodun Modupe, Unarine Netshifhefhe, Herkulaas Combrink, Mohlatlego Nakeng, Matome Ledwaba

Main category: cs.CL

TL;DR: The paper introduces ‘Marito,’ a system for aggregating and standardizing fragmented terminological data for South Africa’s languages into open datasets, improving multilingual NLP tasks like machine translation.

DetailsMotivation: The lack of structured, machine-readable terminological data for South Africa's official languages hinders multilingual NLP progress, despite existing resources.

Method: Marito systematically aggregates, cleans, and standardizes scattered terminology lists into interoperable datasets, integrated into a Retrieval-Augmented Generation (RAG) pipeline.

Result: Experiments show significant accuracy and consistency improvements in English-to-Tshivenda machine translation using large language models.

Conclusion: Marito offers a scalable, equitable foundation for NLP technologies, ensuring representation of South Africa’s linguistic diversity.

Abstract: The critical lack of structured terminological data for South Africa’s official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. \emph{Marito} addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational \emph{Marito} dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. \emph{Marito} provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa’s rich linguistic diversity is represented in the digital age.

[48] EmbedGrad: Gradient-Based Prompt Optimization in Embedding Space for Large Language Models

Xiaoming Hou, Jiquan Zhang, Zibin Lin, DaCheng Tao, Shengli Zhang

Main category: cs.CL

TL;DR: EmbedGrad optimizes text prompt embeddings via gradient-based refinement, improving task adaptation without architectural changes.

DetailsMotivation: Addressing limitations of discrete prompt engineering and continuous parameter adaptation by refining embeddings for better precision and interpretability.

Method: Proposes EmbedGrad, a framework for gradient-based refinement of text prompt embeddings, decoupling training (guided by labeled examples) from deployment (using optimized embeddings).

Result: Significant accuracy improvements, e.g., from 14.74% to 58.96% on mathematical reasoning tasks, with consistent gains across model scales and tasks.

Conclusion: EmbedGrad establishes embedding refinement as a new paradigm for task adaptation, bridging prompt engineering and parameter efficiency.

Abstract: Effectively adapting powerful pretrained foundation models to diverse tasks remains a key challenge in AI deployment. Current approaches primarily follow two paradigms:discrete optimization of text prompts through prompt engineering, or continuous adaptation via additional trainable parameters. Both exhibit limitations-discrete methods lack refinement precision while parameter-based techniques increase complexity and reduce interpretability. To address these constraints, we propose EmbedGrad, a novel framework that optimizes text prompt embeddings through gradient-based refinement. Our approach uniquely decouples training from deployment:during optimization,labeled examples guide precise embedding adjustments while preserving semantic meaning; during inference, only optimized embeddings integrate with user queries. This enables fine-grained calibration impossible in text space, such as enhancing the reasoning capability of prompts like please reason step by step. Comprehensive evaluations across mathematical reasoning, sentiment analysis, and causal judgment tasks demonstrate EmbedGrad’s effectiveness:optimizing this reasoning prompt for Qwen2.5-Math-1.5B increased accuracy from 14.74% to 58.96% on mathematical problems. Consistent improvements were observed across model scales (0.5B-14B) and all tasks, with particularly significant gains for smaller models on complex problems like causal judgment. By bridging prompt engineering and parameter efficiency without architectural changes, our work establishes embedding refinement as a powerful new paradigm for task adaptation.

[49] Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Peng Lai, Jianjie Zheng, Sijie Cheng, Yun Chen, Peng Li, Yang Liu, Guanhua Chen

Main category: cs.CL

TL;DR: LAGER enhances LLM-as-a-Judge alignment with human preferences by aggregating cross-layer representations, outperforming baselines by up to 7.5%.

DetailsMotivation: Improving alignment of automated LLM evaluation with human preferences without complex prompts or fine-tuning.

Method: LAGER aggregates cross-layer score-token logits and computes expected scores from a softmax distribution, keeping the LLM backbone frozen.

Result: Achieves up to 7.5% improvement over baselines on benchmarks like Flask, HelpSteer, and BIGGen.

Conclusion: LAGER effectively leverages internal representations for better alignment, matching or outperforming reasoning-based methods without additional steps.

Abstract: The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using large language models, a paradigm known as “LLMas-a-judge.” However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and taskrelevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a lightweight and efficient framework for enhancing LLM-as-a-Judge alignment with human scoring, via internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer scoretoken logits and computing the expected score from a softmax-based distribution, with the LLM backbone kept frozen. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the effectiveness of our method.

[50] Tackling Distribution Shift in LLM via KILO: Knowledge-Instructed Learning for Continual Adaptation

Iing Muttakhiroh, Thomas Fevens

Main category: cs.CL

TL;DR: KILO integrates dynamic knowledge graphs with instruction tuning to improve LLMs’ adaptability and knowledge retention across domain shifts.

DetailsMotivation: Address performance degradation in LLMs due to catastrophic forgetting during domain shifts.

Method: Proposes KILO, combining dynamic knowledge graphs and instruction tuning, pretrained on WikiText-103 and evaluated on BioASQ, SciQ, TweetEval, and MIND.

Result: Outperforms baselines (continual fine-tuning, ERNIE 2.0, CPT) in backward/forward transfer, F1 score, retention, and efficiency.

Conclusion: KILO effectively combines knowledge retrieval and instruction prompting to tackle domain shift challenges in continual learning.

Abstract: Large Language Models (LLMs) often suffer from performance degradation when faced with domain shifts, primarily due to catastrophic forgetting. In this work, we propose KILO (Knowledge-Instructed Learning for Continual Adaptation), a novel continual learning framework that integrates dynamic knowledge graphs with instruction tuning. By leveraging retrieved domain-specific knowledge as guidance during training, KILO enhances both adaptability to new domains and retention of previously acquired knowledge. We pretrain our model on WikiText-103 and evaluate sequential adaptation across four diverse target domains: BioASQ, SciQ, TweetEval, and MIND. Our experiments demonstrate that KILO consistently outperforms strong baselines, including continual fine-tuning, ERNIE 2.0, and CPT, in terms of backward transfer, forward transfer, F1 score, retention rate, and training efficiency. These results highlight the effectiveness of combining structured knowledge retrieval and instruction prompting to overcome domain shift challenges in continual learning scenarios.

[51] Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

Wenxuan Shen, Mingjia Wang, Yaochen Wang, Dongping Chen, Junjie Yang, Yao Wan, Weiwei Lin

Main category: cs.CL

TL;DR: Double-Bench is a new evaluation system for Retrieval-Augmented Generation (RAG) systems, addressing gaps in current benchmarks with a large-scale, multilingual, and multimodal approach.

DetailsMotivation: Current benchmarks for RAG systems are inadequate, focusing narrowly and using synthetic data, failing to reflect real-world challenges.

Method: Double-Bench includes 3,276 documents and 5,168 queries across 6 languages and 4 document types, with human-verified evidence. It evaluates 9 embedding models, 4 MLLMs, and 4 RAG frameworks.

Result: The gap between text and visual embedding models is narrowing, and current RAG frameworks often provide answers without evidence.

Conclusion: Double-Bench offers a rigorous foundation for future RAG research, with plans for annual updates to maintain relevance.

Abstract: Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.

[52] Can Large Vision-Language Models Understand Multimodal Sarcasm?

Xinyu Wang, Yue Zhang, Liqiang Jing

Main category: cs.CL

TL;DR: The paper explores the use of Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA), addressing limitations like poor visual understanding and lack of conceptual knowledge with a training-free framework.

DetailsMotivation: Sarcasm's complexity in sentiment analysis and the underexplored potential of LVLMs in MSA motivate this study.

Method: The authors propose a training-free framework integrating object extraction and external conceptual knowledge to enhance LVLMs’ sarcasm interpretation.

Result: Experiments demonstrate the framework’s effectiveness in improving sarcasm detection and explanation in multimodal contexts.

Conclusion: The proposed framework successfully addresses LVLMs’ limitations in MSA, offering a practical solution for sarcasm analysis.

Abstract: Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings, making it challenging for sentiment analysis and other emotion-sensitive tasks. While traditional sarcasm detection methods primarily focus on text, recent approaches have incorporated multimodal information. However, the application of Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) remains underexplored. In this paper, we evaluate LVLMs in MSA tasks, specifically focusing on Multimodal Sarcasm Detection and Multimodal Sarcasm Explanation. Through comprehensive experiments, we identify key limitations, such as insufficient visual understanding and a lack of conceptual knowledge. To address these issues, we propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge to improve the model’s ability to interpret and explain sarcasm in multimodal contexts. The experimental results on multiple models show the effectiveness of our proposed framework. The code is available at https://github.com/cp-cp/LVLM-MSA.

[53] CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction

Zixuan Li, Binzong Geng, Jing Xiong, Yong He, Yuxuan Hu, Jian Chen, Dingwei Chen, Xiyu Chang, Liang Zhang, Linjian Mo, Chengming Li, Chuan Yuan, Zhenan Sun

Main category: cs.CL

TL;DR: CTR-Sink addresses semantic fragmentation in CTR prediction by introducing behavior-level attention sinks to improve LM focus on meaningful behavior boundaries.

DetailsMotivation: User behavior sequences in CTR prediction differ from natural language, causing LM attention to scatter and degrade performance.

Method: Proposes CTR-Sink with behavior-level attention sinks, sink tokens, and a two-stage training strategy to regulate attention.

Result: Validated on industrial and open-source datasets, showing improved performance and better attention focus.

Conclusion: CTR-Sink effectively bridges the structural gap in CTR prediction, enhancing LM performance for recommendation systems.

Abstract: Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs’ strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose $\textit{CTR-Sink}$, a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method’s effectiveness across scenarios.

[54] FairLangProc: A Python package for fairness in NLP

Arturo Pérez-Peralta, Sandra Benítez-Peña, Rosa E. Lillo

Main category: cs.CL

TL;DR: FairLangProc is a Python package for implementing fairness techniques in NLP, compatible with Hugging Face, to centralize bias mitigation efforts.

DetailsMotivation: Address societal concerns about fairness in LLMs for decision-making by providing a unified tool for bias mitigation in NLP.

Method: Develops FairLangProc, a Python package integrating recent fairness advances, compatible with Hugging Face transformers.

Result: A centralized, accessible tool for fairness in NLP, encouraging widespread adoption of bias mitigation techniques.

Conclusion: FairLangProc democratizes fairness in NLP by offering a common implementation platform for bias mitigation.

Abstract: The rise in usage of Large Language Models to near ubiquitousness in recent years has risen societal concern about their applications in decision-making contexts, such as organizational justice or healthcare. This, in turn, poses questions about the fairness of these models in critical settings, which leads to the developement of different procedures to address bias in Natural Language Processing. Although many datasets, metrics and algorithms have been proposed to measure and mitigate harmful prejudice in Natural Language Processing, their implementation is diverse and far from centralized. As a response, this paper presents FairLangProc, a comprehensive Python package providing a common implementation of some of the more recent advances in fairness in Natural Language Processing providing an interface compatible with the famous Hugging Face transformers library, aiming to encourage the widespread use and democratization of bias mitigation techniques. The implementation can be found on https://github.com/arturo-perez-peralta/FairLangProc.

[55] More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation

Yangtian Zi, Harshitha Menon, Arjun Guha

Main category: cs.CL

TL;DR: The paper investigates why LLMs perform poorly on specialized benchmarks like ParEval compared to general ones like HumanEval, introducing PartialOrderEval to analyze prompt specificity’s impact.

DetailsMotivation: To determine if LLMs' underperformance on specialized benchmarks is due to missing domain knowledge or insufficient prompt detail.

Method: Introduces PartialOrderEval to augment benchmarks with prompts of varying specificity, tested on HumanEval and ParEval subsets using Llama-3.x and Qwen2.5-Coder.

Result: LLMs show varying prompt sensitivity; key improvements come from explicit I/O specs, edge-case handling, and stepwise breakdowns.

Conclusion: Prompt specificity significantly impacts LLM performance, with detailed prompts improving results on specialized tasks.

Abstract: State-of-the-art Large Language Models (LLMs) achieve high pass@1 on general benchmarks like HumanEval but underperform on specialized suites such as ParEval. Is this due to LLMs missing domain knowledge or insufficient prompt detail is given? To answer this, we introduce PartialOrderEval, which augments any code generation benchmark with a partial order of prompts from minimal to maximally detailed. Applying it to HumanEval and both serial and OpenMP subsets of ParEval, we measure how pass@1 scales with prompt specificity. Our experiments with Llama-3.x and Qwen2.5-Coder demonstrate varying degrees of prompt sensitivity across different tasks, and a qualitative analysis highlights explicit I/O specifications, edge-case handling, and stepwise breakdowns as the key drivers of prompt detail improvement.

[56] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, Kai Chen

Main category: cs.CL

TL;DR: CompassVerifier is a lightweight verifier model for evaluating LLM outputs, addressing gaps in current methodologies with robustness and multi-domain competency.

DetailsMotivation: Current answer verification methods lack comprehensive benchmarks and robust verifiers, limiting their effectiveness across domains and edge cases.

Method: Developed CompassVerifier, a lightweight verifier model, and introduced VerifierBench, a benchmark with augmented data for training and evaluation.

Result: CompassVerifier shows multi-domain competency, handling diverse answer types and identifying invalid responses effectively.

Conclusion: CompassVerifier and VerifierBench aim to improve answer verification, evaluation protocols, and reinforcement learning research.

Abstract: Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.

[57] Bridging LLMs and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025

Long S. T. Nguyen, Khang H. N. Vo, Thu H. A. Nguyen, Tuan C. Bui, Duc Q. Nguyen, Thanh-Tung Tran, Anh D. Nguyen, Minh L. Nguyen, Fabien Baldacci, Thang H. Bui, Emanuel Di Nardo, Angelo Ciaramella, Son H. Le, Ihsan Ullah, Lorenzo Di Rocco, Tho T. Quan

Main category: cs.CL

TL;DR: The paper analyzes the XAI Challenge 2025, a hackathon focused on creating explainable AI (XAI) QA systems for education, using lightweight LLMs or hybrid systems. It highlights the challenge’s design, dataset, and broader implications for XAI in education.

DetailsMotivation: The need for transparency and interpretability in AI-driven education, particularly in real-world contexts, motivated the XAI Challenge 2025. The hackathon aimed to bridge the gap between LLMs and symbolic reasoning for explainability.

Method: The challenge involved building QA systems for university policy queries with logic-based explanations. Participants used lightweight LLMs or hybrid LLM-symbolic systems. A high-quality dataset was constructed using logic-based templates and expert validation.

Result: The challenge successfully demonstrated the feasibility of combining LLMs and symbolic reasoning for explainable AI in education. It provided insights into practical XAI solutions for real-world academic scenarios.

Conclusion: The XAI Challenge 2025 represents a pioneering effort in integrating LLMs and symbolic reasoning for explainability in education. It offers a model for future XAI-focused hackathons and educational AI systems.

Abstract: The growing integration of Artificial Intelligence (AI) into education has intensified the need for transparency and interpretability. While hackathons have long served as agile environments for rapid AI prototyping, few have directly addressed eXplainable AI (XAI) in real-world educational contexts. This paper presents a comprehensive analysis of the XAI Challenge 2025, a hackathon-style competition jointly organized by Ho Chi Minh City University of Technology (HCMUT) and the International Workshop on Trustworthiness and Reliability in Neurosymbolic AI (TRNS-AI), held as part of the International Joint Conference on Neural Networks (IJCNN 2025). The challenge tasked participants with building Question-Answering (QA) systems capable of answering student queries about university policies while generating clear, logic-based natural language explanations. To promote transparency and trustworthiness, solutions were required to use lightweight Large Language Models (LLMs) or hybrid LLM-symbolic systems. A high-quality dataset was provided, constructed via logic-based templates with Z3 validation and refined through expert student review to ensure alignment with real-world academic scenarios. We describe the challenge’s motivation, structure, dataset construction, and evaluation protocol. Situating the competition within the broader evolution of AI hackathons, we argue that it represents a novel effort to bridge LLMs and symbolic reasoning in service of explainability. Our findings offer actionable insights for future XAI-centered educational systems and competitive research initiatives.

[58] Pre-trained Transformer-Based Approach for Arabic Question Answering : A Comparative Study

Kholoud Alsubhi, Amani Jamal, Areej Alhothali

Main category: cs.CL

TL;DR: The paper evaluates pre-trained transformer models (AraBERTv2-base, AraBERTv0.2-large, AraELECTRA) for Arabic QA using four datasets, analyzing their performance and low-results causes.

DetailsMotivation: Despite QA's importance in NLP, Arabic QA lags due to limited research and datasets. Pre-trained models show promise, but their effectiveness in Arabic QA needs evaluation.

Method: Fine-tuned and compared three pre-trained models (AraBERTv2-base, AraBERTv0.2-large, AraELECTRA) on four Arabic QA datasets (Arabic-SQuAD, ARCD, AQAD, TyDiQA-GoldP).

Result: The models’ performance was evaluated, with some showing low results, prompting an analysis of the causes.

Conclusion: The study highlights the potential and challenges of pre-trained models in Arabic QA, emphasizing the need for further research and better datasets.

Abstract: Question answering(QA) is one of the most challenging yet widely investigated problems in Natural Language Processing (NLP). Question-answering (QA) systems try to produce answers for given questions. These answers can be generated from unstructured or structured text. Hence, QA is considered an important research area that can be used in evaluating text understanding systems. A large volume of QA studies was devoted to the English language, investigating the most advanced techniques and achieving state-of-the-art results. However, research efforts in the Arabic question-answering progress at a considerably slower pace due to the scarcity of research efforts in Arabic QA and the lack of large benchmark datasets. Recently many pre-trained language models provided high performance in many Arabic NLP problems. In this work, we evaluate the state-of-the-art pre-trained transformers models for Arabic QA using four reading comprehension datasets which are Arabic-SQuAD, ARCD, AQAD, and TyDiQA-GoldP datasets. We fine-tuned and compared the performance of the AraBERTv2-base model, AraBERTv0.2-large model, and AraELECTRA model. In the last, we provide an analysis to understand and interpret the low-performance results obtained by some models.

[59] Bridging LLMs and KGs without Fine-Tuning: Intermediate Probing Meets Subgraph-Aware Entity Descriptions

Bo Xue, Yi Xu, Yunchong Song, Jiaxin Ding, Luoyi Fu, Xinbing Wang

Main category: cs.CL

TL;DR: A novel framework combines LLMs and knowledge representation for efficient and effective KGC, achieving significant improvements in performance and computational efficiency.

DetailsMotivation: Traditional KGC methods struggle with KG sparsity, while LLMs offer rich knowledge but face computational challenges when fine-tuned. This work aims to bridge these gaps.

Method: Extracts context-aware hidden states from LLM intermediate layers, trains a data-efficient classifier, and uses subgraph sampling and SMI for semantic alignment.

Result: 47% relative improvement over non-fine-tuned LLM methods, matching fine-tuned LLM performance with 188x GPU memory efficiency and 26.11x speedup.

Conclusion: The framework successfully synergizes LLMs and KGs, offering a scalable and efficient solution for KGC.

Abstract: Traditional knowledge graph completion (KGC) methods rely solely on structural information, struggling with the inherent sparsity of knowledge graphs (KGs). By contrast, Large Language Models (LLMs) encapsulate extensive world knowledge and exhibit powerful context modeling capabilities, making them promising for mitigating the limitations of traditional methods. However, direct fine-tuning of LLMs for KGC, though effective, imposes substantial computational and memory overheads, while utilizing non-fine-tuned LLMs is efficient but yields suboptimal performance. In this work, we propose a novel framework that synergizes the strengths of LLMs with robust knowledge representation to enable effective and efficient KGC. We extract the context-aware hidden states of knowledge triples from the intermediate layers of LLMs, thereby capturing rich semantic and relational nuances. These representations are then utilized to train a data-efficient classifier tailored specifically for KGC tasks. To bridge the semantic gaps between LLMs and KGs, we employ subgraph sampling on KGs to generate model-friendly entity descriptions. We further adopt sliced mutual information (SMI) as a principled metric to quantify the task-specific information encoded in these representations. Extensive experiments on standard benchmarks validate the efficiency and effectiveness of our approach. We achieve a 47% relative improvement over previous methods based on non-fine-tuned LLMs and, to our knowledge, are the first to achieve classification performance comparable to fine-tuned LLMs while enhancing GPU memory efficiency by $188\times$ and accelerating training and inference by $26.11\times$.

[60] Domain-Independent Automatic Generation of Descriptive Texts for Time-Series Data

Kota Dohi, Aoi Ito, Harsh Purohit, Tomoya Nishida, Takashi Endo, Yohei Kawaguchi

Main category: cs.CL

TL;DR: Proposes a method to generate domain-independent descriptive texts for time-series data using a novel backward approach, creating the TACO dataset for training contrastive learning models.

DetailsMotivation: Addresses the challenge of scarce annotated time-series data for training models to generate descriptive texts.

Method: Introduces two approaches (forward and backward) for pairing time-series data with texts, focusing on the backward approach to create the TACO dataset.

Result: A contrastive learning model trained on TACO generates descriptive texts for time-series data in novel domains.

Conclusion: The backward approach and TACO dataset enable effective text generation for time-series data across domains.

Abstract: Due to scarcity of time-series data annotated with descriptive texts, training a model to generate descriptive texts for time-series data is challenging. In this study, we propose a method to systematically generate domain-independent descriptive texts from time-series data. We identify two distinct approaches for creating pairs of time-series data and descriptive texts: the forward approach and the backward approach. By implementing the novel backward approach, we create the Temporal Automated Captions for Observations (TACO) dataset. Experimental results demonstrate that a contrastive learning based model trained using the TACO dataset is capable of generating descriptive texts for time-series data in novel domains.

[61] A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans

Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga

Main category: cs.CL

TL;DR: The paper introduces a framework to evaluate PLMs’ knowledge of five semantic relations beyond hypernymy, comparing human and model performance using five metrics. Results show a significant gap between humans and models, with antonymy as the best-performing relation.

DetailsMotivation: To address the incomplete understanding of PLMs' semantic relation knowledge by expanding beyond hypernymy and comparing human and model performance.

Method: A comprehensive evaluation framework covering five semantic relations (hyponymy, holonymy, meronymy, antonymy, synonymy) and five metrics (soundness, completeness, symmetry, prototypicality, distinguishability). Six PLMs (four masked, two causal) were tested.

Result: Significant knowledge gap between humans and models across all relations. Causal models don’t always outperform masked models. Antonymy is the best-performing relation.

Conclusion: The study highlights the limitations of PLMs in semantic relation knowledge and provides a framework for future comparisons.

Abstract: Recently, much work has concerned itself with the enigma of what exactly pretrained language models~(PLMs) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Generally, only one relation has been considered, namely hypernymy. Furthermore, previous work did not measure humans’ performance on the same task as that performed by the PLMs. This means that at this point in time, there is only an incomplete view of the extent of these models’ semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use five metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, prototypicality, and distinguishability. Using these, we can fairly compare humans and models on the same task. Our extensive experiments involve six PLMs, four masked and two causal language models. The results reveal a significant knowledge gap between humans and models for all semantic relations. In general, causal language models, despite their wide use, do not always perform significantly better than masked language models. Antonymy is the outlier relation where all models perform reasonably well. The evaluation materials can be found at https://github.com/hancules/ProbeResponses.

[62] From Text to Trajectory: Exploring Complex Constraint Representation and Decomposition in Safe Reinforcement Learning

Pusen Dong, Tianchen Zhu, Yue Qiu, Haoyi Zhou, Jianxin Li

Main category: cs.CL

TL;DR: TTCT replaces manual cost functions in safe RL by using natural language constraints as training signals, achieving lower violation rates and zero-shot transfer capability.

DetailsMotivation: To eliminate the need for domain expertise and manual cost function design in safe RL with natural language constraints.

Method: Introduces TTCT, which translates textual constraints and trajectories into training signals, replacing manual cost functions.

Result: TTCT effectively understands constraints, reduces violation rates, and adapts to constraint-shift environments with zero-shot transfer.

Conclusion: TTCT offers a flexible, expert-free approach to safe RL with natural language constraints, demonstrating superior performance and adaptability.

Abstract: Safe reinforcement learning (RL) requires the agent to finish a given task while obeying specific constraints. Giving constraints in natural language form has great potential for practical scenarios due to its flexible transfer capability and accessibility. Previous safe RL methods with natural language constraints typically need to design cost functions manually for each constraint, which requires domain expertise and lacks flexibility. In this paper, we harness the dual role of text in this task, using it not only to provide constraint but also as a training signal. We introduce the Trajectory-level Textual Constraints Translator (TTCT) to replace the manually designed cost function. Our empirical results demonstrate that TTCT effectively comprehends textual constraint and trajectory, and the policies trained by TTCT can achieve a lower violation rate than the standard cost function. Extra studies are conducted to demonstrate that the TTCT has zero-shot transfer capability to adapt to constraint-shift environments.

[63] Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages

Farhana Shahid, Mona Elswah, Aditya Vashistha

Main category: cs.CL

TL;DR: The paper highlights challenges in AI-driven moderation for low-resource languages in the Global South, revealing systemic inequities beyond data scarcity, and proposes multi-stakeholder solutions.

DetailsMotivation: To address the struggles of AI moderation systems with low-resource languages and uncover socio-political factors exacerbating inequities.

Method: Semi-structured interviews with 22 AI experts working on harmful content detection in Tamil, Swahili, Maghrebi Arabic, and Quechua.

Result: Findings show systemic issues like data monopolies, lack of investment, and English-centric model designs, reflecting structural inequities.

Conclusion: The paper advocates for multi-stakeholder efforts to democratize data, strengthen local research, and develop language-aware solutions.

Abstract: Most social media users come from the Global South, where harmful content usually appears in local languages. Yet, AI-driven moderation systems struggle with low-resource languages spoken in these regions. Through semi-structured interviews with 22 AI experts working on harmful content detection in four low-resource languages: Tamil (South Asia), Swahili (East Africa), Maghrebi Arabic (North Africa), and Quechua (South America)–we examine systemic issues in building automated moderation tools for these languages. Our findings reveal that beyond data scarcity, socio-political factors such as tech companies’ monopoly on user data and lack of investment in moderation for low-profit Global South markets exacerbate historic inequities. Even if more data were available, the English-centric and data-intensive design of language models and preprocessing techniques overlooks the need to design for morphologically complex, linguistically diverse, and code-mixed languages. We argue these limitations are not just technical gaps caused by “data scarcity” but reflect structural inequities, rooted in colonial suppression of non-Western languages. We discuss multi-stakeholder approaches to strengthen local research capacity, democratize data access, and support language-aware solutions to improve automated moderation for low-resource languages.

[64] AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought

Weihua Zheng, Xin Huang, Zhengyuan Liu, Tarun Kumar Vangani, Bowei Zou, Xiyan Tao, Yuhao Wu, Ai Ti Aw, Nancy F. Chen, Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: AdaMCOT improves multilingual reasoning by dynamically routing thought processes in intermediary languages, enhancing performance and consistency across languages, especially for low-resource ones.

DetailsMotivation: Address performance gaps in multilingual reasoning due to imbalanced training data and scalability issues in existing methods.

Method: Introduces AdaMCOT, a framework using adaptive, reward-based routing of thought processes in intermediary languages without additional pretraining.

Result: Substantial improvements in factual reasoning quality and cross-lingual consistency, particularly for low-resource languages.

Conclusion: Adaptive reasoning paths bridge performance gaps between languages while preserving linguistic and cultural nuances.

Abstract: Large language models (LLMs) have shown impressive multilingual capabilities through pretraining on diverse corpora. Although these models show strong reasoning abilities, their performance varies significantly between languages due to the imbalanced distribution of training data. Existing approaches using sample-level translation for extensive multilingual pretraining and cross-lingual tuning face scalability challenges and often fail to capture nuanced reasoning processes across languages. In this paper, we introduce AdaMCOT (Adaptive Multilingual Chain-of-Thought), a framework that enhances multilingual factual reasoning by dynamically routing thought processes in intermediary “thinking languages” before generating target-language responses. AdaMCOT leverages a language-agnostic core and incorporates an adaptive, reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Our comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. An in-depth analysis of the model’s hidden states and semantic space further elucidates the underlying mechanism of our method. The results suggest that adaptive reasoning paths can effectively bridge the performance gap between high and low-resource languages while maintaining cultural and linguistic nuances.

[65] CLIPPER: Compression enables long-context synthetic data generation

Chau Minh Pham, Yapei Chang, Mohit Iyyer

Main category: cs.CL

TL;DR: CLIPPER is a compression-based method for generating high-quality synthetic data for narrative claim verification, improving model accuracy from 28% to 76%.

DetailsMotivation: Generating synthetic data for complex long-context reasoning tasks is challenging, especially for narrative claim verification.

Method: CLIPPER compresses books into chapter outlines and summaries, then uses these to generate claims and chain-of-thought reasoning.

Result: The method produces 19K synthetic claims, boosting model accuracy to 76% and setting a new state-of-the-art for sub-10B models.

Conclusion: CLIPPER enhances synthetic data quality, improving narrative claim verification and other understanding tasks.

Abstract: LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).

[66] M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs

Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim

Main category: cs.CL

TL;DR: The paper introduces M2S methods (Hyphenize, Numberize, Pythonize) to convert multi-turn adversarial prompts into single-turn queries, improving attack success rates and reducing manual effort in testing LLMs.

DetailsMotivation: Multi-turn human jailbreaks are effective but time-consuming; the goal is to streamline adversarial testing by simplifying prompts while maintaining or enhancing their potency.

Method: The M2S framework reformats multi-turn dialogues into structured single-turn prompts using Hyphenize, Numberize, and Pythonize techniques.

Result: M2S methods achieve attack success rates of 70.6% to 95.9% on the MHJ dataset, outperforming multi-turn attacks by up to 17.5 percentage points and reducing token usage by over half.

Conclusion: The M2S framework offers a scalable tool for red teaming and exposes vulnerabilities in current LLM defenses, particularly through contextual blindness exploitation.

Abstract: We introduce a novel framework for consolidating multi-turn adversarial jailbreak'' prompts into single-turn queries, significantly reducing the manual overhead required for adversarial testing of large language models (LLMs). While multi-turn human jailbreaks have been shown to yield high attack success rates, they demand considerable human effort and time. Our multi-turn-to-single-turn (M2S) methods -- Hyphenize, Numberize, and Pythonize -- systematically reformat multi-turn dialogues into structured single-turn prompts. Despite removing iterative back-and-forth interactions, these prompts preserve and often enhance adversarial potency: in extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, M2S methods achieve attack success rates from 70.6 percent to 95.9 percent across several state-of-the-art LLMs. Remarkably, the single-turn prompts outperform the original multi-turn attacks by as much as 17.5 percentage points while cutting token usage by more than half on average. Further analysis shows that embedding malicious requests in enumerated or code-like structures exploits contextual blindness’’, bypassing both native guardrails and external input-output filters. By converting multi-turn conversations into concise single-turn prompts, the M2S framework provides a scalable tool for large-scale red teaming and reveals critical weaknesses in contemporary LLM defenses.

[67] GPT is Devastated and LLaMA is Content: Emotion Representation Alignment in LLMs for Keyword-based Generation

Shadab Choudhury, Asha Kumar, Lara J. Martin

Main category: cs.CL

TL;DR: The paper introduces Representation Alignment to measure the gap between LLMs’ interpretation of emotions and human expectations, finding words like “angry” align better than VAD scales.

DetailsMotivation: To address the misalignment between how LLMs interpret emotions and human expectations in controlled text generation.

Method: Evaluated four emotion representations (Words, VAD dimensions, Emojis) using GPT-4 and LLaMA-3, measuring Representation Alignment, accuracy, and realism.

Result: People agreed more with LLM outputs conditioned on words than VAD scales, especially numeric VAD. Emotion perception varied by representation type and emotion.

Conclusion: Words outperform VAD scales in aligning LLM-generated emotions with human expectations, highlighting the importance of representation choice.

Abstract: In controlled text generation using large language models (LLMs), gaps arise between the language model’s interpretation of concepts and people’s expectations. We introduce the human evaluation task of Representation Alignment for measuring this gap. We selected four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis and evaluate them in the context of keyword-guided sentence generation using both GPT-4 and LLaMA-3. In addition to Representation Alignment, we also measure people’s judgments of the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., ``angry’’) rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. Furthermore, we found that the perception of how much a generated sentence conveys an emotion is dependent on both the representation type and which emotion it is.

[68] Ensemble Learning for Large Language Models in Text and Code Generation: A Survey

Mari Ashiga, Wei Jie, Fan Wu, Vardan Voskanyan, Fateme Dinmohammadi, Paul Brookes, Jingzhi Gong, Zheng Wang

Main category: cs.CL

TL;DR: The paper reviews ensemble techniques for LLMs, categorizing them into seven methods to improve output diversity, quality, and flexibility, with potential applications in multimodal LLMs.

DetailsMotivation: Address inconsistencies, biases, and closed-source limitations of individual LLMs by exploring ensemble approaches for better text and code generation.

Method: Categorizes LLM ensembles into seven methods: weight merging, knowledge fusion, mixture-of-experts, reward ensemble, output ensemble, routing, and cascading.

Result: Ensemble techniques improve diversity representation, output quality, and application flexibility.

Conclusion: The findings support model selection for real-world tasks and pave the way for extending ensembles to multimodal LLMs.

Abstract: Generative Pretrained Transformers (GPTs) are foundational Large Language Models (LLMs) for text generation. However, individual LLMs often produce inconsistent outputs and exhibit biases, limiting their representation of diverse language patterns. The closed-source nature of many powerful LLMs further restricts industry applications due to data privacy concerns. Inspired by successes in text generation, LLM ensemble techniques are now increasingly explored for code generation. This article reviews these emerging ensemble approaches to enhance understanding, encourage further research, and promote practical implementation in both text and code generation. We categorize LLM ensembles into seven main methods - weight merging, knowledge fusion, mixture-of-experts, reward ensemble, output ensemble, routing, and cascading - analyzing capabilities of those approaches. Our findings highlight key benefits such as improved diversity representation, enhanced output quality, and greater application flexibility. These insights aid model selection for real-world tasks and crucially, lay groundwork for extending ensemble strategies to multimodal LLMs.

[69] Why do LLMs attend to the first token?

Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, Razvan Pascanu

Main category: cs.CL

TL;DR: The paper investigates why LLMs develop attention sinks (focusing heavily on the first token) and how they use them, linking this behavior to avoiding over-mixing of information in Transformers.

DetailsMotivation: To understand why LLMs learn attention sinks and their functional role, addressing gaps in existing explanations.

Method: Theoretical analysis and empirical experiments exploring factors like context length, depth, and data packing.

Result: Attention sinks help LLMs avoid over-mixing, with their behavior influenced by architectural and training choices.

Conclusion: The study offers insights into attention sink utility, enhancing understanding of LLM attention patterns.

Abstract: Large Language Models (LLMs) tend to attend heavily to the first token in the sequence – creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it. Attention sinks have been connected to quantisation difficulties, security issues, and streaming attention. Yet, while many works have provided conditions in which they occur or not, a critical question remains shallowly answered: Why do LLMs learn such patterns and how are they being used? In this work, we argue theoretically and empirically that this mechanism provides a method for LLMs to avoid over-mixing, connecting this to existing lines of work that study mathematically how information propagates in Transformers. We conduct experiments to validate our theoretical intuitions and show how choices such as context length, depth, and data packing influence the sink behaviour. We hope that this study provides a new practical perspective on why attention sinks are useful in LLMs, leading to a better understanding of the attention patterns that form during training.

[70] Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs

Dongyang Fan, Vinko Sabolčec, Matin Ansaripour, Ayush Kumar Tarun, Martin Jaggi, Antoine Bosselut, Imanol Schlag

Main category: cs.CL

TL;DR: The study introduces the ‘data compliance gap’ (DCG) to measure performance differences in LLMs trained with or without web crawling opt-outs. Findings show minimal impact on general knowledge but declines in specialized domains like biomedical research.

DetailsMotivation: To understand how web crawling opt-outs and dataset filtering affect LLM performance, addressing the trade-off between data compliance and model capabilities.

Method: Conceptualized DCG, tested in pretraining from scratch and continual pretraining scenarios using 1.5B models, focusing on general and specialized domains.

Result: Compliance with opt-outs shows negligible impact on general knowledge (0% DCG) but reduces performance in specialized domains like biomedical research.

Conclusion: General-purpose LLMs can perform well with open data, but specialized domains may need copyrighted sources later in training, highlighting the need for balanced AI training policies.

Abstract: The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions. Our website is available at https://data-compliance.github.io/.

[71] The Multi-Round Diagnostic RAG Framework for Emulating Clinical Reasoning

Penglei Sun, Yixiang Chen, Xiang Li, Xiaowen Chu

Main category: cs.CL

TL;DR: The paper addresses the semantic gap in medical LLMs by introducing DiagnosGraph, a knowledge graph, and MRD-RAG, a multi-round dialogue framework, to improve diagnostic accuracy.

DetailsMotivation: The semantic gap between colloquial patient descriptions and professional medical terminology hinders the practical deployment of retrieval-augmented generation (RAG) for medical diagnosis.

Method: Constructed DiagnosGraph, a knowledge graph with 876 diseases, and introduced MRD-RAG, a multi-round dialogue framework, to refine diagnoses.

Result: MRD-RAG improved diagnostic performance on four benchmarks, validated by human physicians.

Conclusion: The approach enhances automated diagnosis accuracy and aligns it more closely with human clinical reasoning.

Abstract: In recent years, accurately and quickly deploying medical large language models (LLMs) has become a trend. Among these, retrieval-augmented generation (RAG) has garnered attention due to rapid deployment and privacy protection. However, the challenge hinder the practical deployment of RAG for medical diagnosis: the semantic gap between colloquial patient descriptions and the professional terminology within medical knowledge bases. We try to address the challenge from the data perspective and the method perspective. First, to address the semantic gap in existing knowledge bases, we construct DiagnosGraph, a generalist knowledge graph covering both modern medicine and Traditional Chinese Medicine. It contains 876 common diseases with the graph of 7,997 nodes and 37,201 triples. To bridge the gap between colloquial patient narratives and academic medical knowledge, DiagnosGraph also introduces $1,908$ medical record by formalizing the patient chief complaint and proposing a medical diagnosis. Second, we introduce the Multi-Round Diagnostic RAG (MRD-RAG) framework. It utilizes a multi-round dialogue to refine diagnostic possibilities, emulating the clinical reasoning of a physician. Experiments conducted on four medical benchmarks, with evaluations by human physicians, demonstrate that MRD-RAG enhances the diagnostic performance of LLMs, highlighting its potential to make automated diagnosis more accurate and human-aligned.

[72] Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis

Shahriar Noroozizadeh, Jeremy C. Weiss

Main category: cs.CL

TL;DR: The paper presents a pipeline using LLMs to extract and annotate time-localized clinical findings from case reports, validated on sepsis data with high accuracy.

DetailsMotivation: To address the incompleteness of structured clinical data by leveraging more complete but delayed case reports for training models.

Method: Constructed a pipeline to phenotype, extract, and annotate findings in case reports using LLMs, validated on sepsis data from PMOA and I2B2/MIMIC-IV.

Result: High recovery rates of clinical findings (event match rates ~0.75) and strong temporal ordering (concordance ~0.93).

Conclusion: LLMs can effectively time-localize clinical findings but have limitations; multimodal integration is suggested for improvement.

Abstract: Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary data structured streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the Pubmed-Open Access (PMOA) Subset. To validate our system, we apply it on PMOA and timeline annotations from I2B2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: O1-preview–0.755, Llama 3.3 70B Instruct–0.753) and strong temporal ordering (concordance: O1-preview–0.932, Llama 3.3 70B Instruct–0.932). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.

[73] Energy-Based Reward Models for Robust Language Model Alignment

Anamika Lochab, Ruqi Zhang

Main category: cs.CL

TL;DR: EBRM is a lightweight post-hoc framework enhancing Reward Models (RMs) by explicitly modeling reward distribution, improving robustness and generalization without retraining.

DetailsMotivation: Standard RMs struggle with complex human preferences and unseen data, requiring a solution to enhance their robustness and generalization.

Method: EBRM uses conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization to refine RMs.

Result: EBRM improves robustness and generalization, achieving up to 5.97% better performance in safety-critical tasks and delaying reward hacking.

Conclusion: EBRM is a scalable and effective enhancement for existing RMs, adaptable across models and tasks.

Abstract: Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.

[74] Science Hierarchography: Hierarchical Organization of Science Literature

Muhan Gao, Jash Shah, Weiqi Wang, Daniel Khashabi

Main category: cs.CL

TL;DR: The paper introduces SCIENCE HIERARCHOGRAPHY, a method to organize scientific literature into hierarchical structures for better insights into research density and gaps, using a hybrid approach of embedding-based clustering and LLM-based prompting.

DetailsMotivation: The rapid growth of scientific knowledge makes it challenging to track progress and conceptual links across disciplines. Existing tools lack the abstraction to represent the density and structure of research activity.

Method: A hybrid approach combining embedding-based clustering with LLM-based prompting, balancing scalability and semantic precision.

Result: The method achieves superior quality-speed trade-offs, captures interdisciplinary research dimensions, and improves interpretability for navigating literature.

Conclusion: SCIENCE HIERARCHOGRAPHY offers an effective alternative to traditional search methods, enhancing exploration of scientific literature.

Abstract: Scientific knowledge is growing rapidly, making it difficult to track progress and high-level conceptual links across broad disciplines. While tools like citation networks and search engines help retrieve related papers, they lack the abstraction needed to capture the needed to represent the density and structure of activity across subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that spans multiple levels of abstraction – from broad domains to specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve this goal, we develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. Compared to LLM-heavy methods like iterative tree construction, our approach achieves superior quality-speed trade-offs. Our hierarchies capture different dimensions of research contributions, reflecting the interdisciplinary and multifaceted nature of modern science. We evaluate its utility by measuring how effectively an LLM-based agent can navigate the hierarchy to locate target papers. Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo are available: https://github.com/JHU-CLSP/science-hierarchography

[75] Multilingual Performance Biases of Large Language Models in Education

Vansh Gupta, Sankalan Pal Chowdhury, Vilém Zouhar, Donya Rooein, Mrinmaya Sachan

Main category: cs.CL

TL;DR: LLMs perform variably in non-English educational tasks, with performance linked to training data volume. Practitioners should verify model efficacy in target languages before deployment.

DetailsMotivation: To assess the suitability of LLMs for educational tasks in non-English languages, given their English-centric nature.

Method: Evaluated popular LLMs on four educational tasks (misconception identification, feedback, tutoring, grading) across eight non-English languages and English.

Result: Performance correlates with language representation in training data; lower-resource languages show poorer results. Significant drops from English performance noted.

Conclusion: Practitioners should verify LLM performance in target languages before educational deployment due to variability in non-English tasks.

Abstract: Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in eight languages (Mandarin, Hindi, Arabic, German, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. Although the models perform reasonably well in most languages, the frequent performance drop from English is significant. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.

[76] MetaGen Blended RAG: Unlocking Zero-Shot Precision for Specialized Domain Question-Answering

Kunal Sawarkar, Shivam R. Solanki, Abhilasha Mangal

Main category: cs.CL

TL;DR: MetaGen Blended RAG improves enterprise search by enhancing semantic retrievers with metadata and hybrid queries, achieving high accuracy without fine-tuning.

DetailsMotivation: RAG struggles with domain-specific datasets due to semantic variability and lack of generalization in fine-tuning solutions.

Method: Uses a metadata generation pipeline and hybrid query indexes (dense and sparse vectors) to enrich semantic retrievers.

Result: Achieves 82% retrieval accuracy and 77% RAG accuracy on PubMedQA, surpassing zero-shot benchmarks and rivaling fine-tuned models.

Conclusion: MetaGen Blended RAG redefines enterprise search with unmatched generalization across specialized domains.

Abstract: Retrieval-Augmented Generation (RAG) struggles with domain-specific enterprise datasets, often isolated behind firewalls and rich in complex, specialized terminology unseen by LLMs during pre-training. Semantic variability across domains like medicine, networking, or law hampers RAG’s context precision, while fine-tuning solutions are costly, slow, and lack generalization as new data emerges. Achieving zero-shot precision with retrievers without fine-tuning still remains a key challenge. We introduce ‘MetaGen Blended RAG’, a novel enterprise search approach that enhances semantic retrievers through a metadata generation pipeline and hybrid query indexes using dense and sparse vectors. By leveraging key concepts, topics, and acronyms, our method creates metadata-enriched semantic indexes and boosted hybrid queries, delivering robust, scalable performance without fine-tuning. On the biomedical PubMedQA dataset, MetaGen Blended RAG achieves 82% retrieval accuracy and 77% RAG accuracy, surpassing all prior zero-shot RAG benchmarks and even rivaling fine-tuned models on that dataset, while also excelling on datasets like SQuAD and NQ. This approach redefines enterprise search using a new approach to building semantic retrievers with unmatched generalization across specialized domains.

[77] RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

Tianjiao Li, Mengran Yu, Chenyu Shi, Yanjun Zhao, Xiaojing Liu, Qiang Zhang, Qi Zhang, Xuanjing Huang, Jiayin Wang

Main category: cs.CL

TL;DR: The paper investigates poor performance of RLHF in colloquial subtitle translation, attributing it to divergence between the reward model and LLM. It proposes RIVAL, an adversarial training framework, to align the models and improve translation quality.

DetailsMotivation: The unexpected poor performance of RLHF in colloquial subtitle translation tasks, caused by divergence between the offline reward model and the online LLM due to distributional shift.

Method: Proposes RIVAL, an adversarial training framework where the RM and LLM iteratively update in a min-max game. The RM distinguishes strong/weak translations, while the LLM improves to close the gap. Quantitative preference rewards (e.g., BLEU) are also incorporated.

Result: RIVAL significantly improves translation performance over baselines, as demonstrated through extensive experiments.

Conclusion: The adversarial training framework RIVAL effectively addresses the divergence issue in RLHF for colloquial subtitle translation, enhancing translation quality.

Abstract: Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.

[78] ProRefine: Inference-Time Prompt Refinement with Textual Feedback

Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Isabelle Diana May-Xin Ng, Christopher M. Homan, Wei Wei

Main category: cs.CL

TL;DR: ProRefine is an inference-time prompt optimization method using LLMs to refine prompts dynamically, improving performance in multi-agent workflows without training or labels.

DetailsMotivation: Agentic workflows rely heavily on prompts, and sub-optimal prompts can degrade performance. ProRefine addresses this by optimizing prompts dynamically.

Method: ProRefine uses an agentic loop of LLMs to generate and apply textual feedback, refining prompts for multi-step reasoning tasks.

Result: ProRefine outperforms zero-shot Chain-of-Thought baselines by 3-37 percentage points on mathematical reasoning datasets.

Conclusion: ProRefine enhances accuracy and enables smaller models to match larger ones, democratizing access to high-performing AI.

Abstract: Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, play a substantial role in many cutting-edge commercial applications, and continue to fascinate researchers across nearly all fields for their potential to accomplish expensive, complex tasks that, until recently, only humans have been trusted to do. These workflows depend critically on the prompts used to provide the roles models play in such workflows. Poorly designed prompts that fail even slightly to guide individual agents can lead to sub-optimal performance that may snowball within a system of agents, limiting their reliability and scalability. To address this important problem of inference-time prompt optimization, we introduce ProRefine, an innovative inference-time optimization method that uses an agentic loop of LLMs to generate and apply textual feedback. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to approach the performance of their larger counterparts. This highlights its potential for building more cost-effective and powerful hybrid AI systems, thereby democratizing access to high-performing AI.

[79] ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng

Main category: cs.CL

TL;DR: A new benchmark for Chinese harmful content detection is introduced, featuring real-world data and expert annotations, alongside a knowledge-augmented baseline method to enhance smaller models’ performance.

DetailsMotivation: Address the scarcity of Chinese datasets for harmful content detection and improve detection accuracy by leveraging expert knowledge and LLMs.

Method: Develop a professionally annotated Chinese benchmark with six harm categories, create a knowledge rule base, and propose a knowledge-augmented baseline integrating expert rules and LLM knowledge.

Result: The benchmark and baseline method enable smaller models to perform comparably to state-of-the-art LLMs in Chinese harmful content detection.

Conclusion: The work provides a valuable resource and method for improving Chinese harmful content detection, bridging gaps in non-English datasets and leveraging expert knowledge effectively.

Abstract: Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.

[80] AI4Research: A Survey of Artificial Intelligence for Scientific Research

Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yuhang Zhou, Jiaqi Wang, Zhi Chen, Wanxiang Che

Main category: cs.CL

TL;DR: A survey on AI for Research (AI4Research) addressing gaps in understanding and development, introducing a taxonomy, identifying research frontiers, and compiling resources.

DetailsMotivation: The rapid advancements in AI, especially LLMs, have shown potential in scientific research, but a lack of comprehensive surveys hinders progress in AI4Research.

Method: The paper presents a systematic taxonomy for AI4Research tasks, identifies research gaps, and compiles multidisciplinary resources.

Result: A unified perspective on AI4Research with a taxonomy, highlighted future directions, and a compilation of applications and tools.

Conclusion: The survey aims to provide quick access to resources and stimulate innovation in AI4Research.

Abstract: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific research. These AI technologies primarily aim to develop systems that can autonomously conduct research processes across a wide range of scientific disciplines. Despite these significant strides, a comprehensive survey on AI for Research (AI4Research) remains absent, which hampers our understanding and impedes further development in this field. To address this gap, we present a comprehensive survey and offer a unified perspective on AI4Research. Specifically, the main contributions of our work are as follows: (1) Systematic taxonomy: We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research. (2) New frontiers: Then, we identify key research gaps and highlight promising future directions, focusing on the rigor and scalability of automated experiments, as well as the societal impact. (3) Abundant applications and resources: Finally, we compile a wealth of resources, including relevant multidisciplinary applications, data corpora, and tools. We hope our work will provide the research community with quick access to these resources and stimulate innovative breakthroughs in AI4Research.

[81] STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

Tek Raj Chhetri, Yibei Chen, Puja Trivedi, Dorota Jarecka, Saif Haobsh, Patrick Ray, Lydia Ng, Satrajit S. Ghosh

Main category: cs.CL

TL;DR: StructSense is a modular, task-agnostic framework for structured information extraction using LLMs, enhanced by domain-specific ontologies and iterative refinement via self-evaluative judges and human-in-the-loop mechanisms.

DetailsMotivation: The need to improve LLM-based structured information extraction in specialized domains, addressing domain sensitivity and cross-task generalizability limitations.

Method: StructSense integrates domain-specific symbolic knowledge (ontologies), self-evaluative judges for feedback loops, and human-in-the-loop validation.

Result: StructSense effectively overcomes domain sensitivity and cross-task generalizability issues, demonstrated in neuroscience tasks.

Conclusion: StructSense offers a scalable, adaptable solution for structured information extraction in specialized domains, leveraging LLMs with domain knowledge and iterative refinement.

Abstract: The ability to extract structured information from unstructured sources-such as free-text documents and scientific literature-is critical for accelerating scientific discovery and knowledge synthesis. Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including structured information extraction. However, their effectiveness often diminishes in specialized, domain-specific contexts that require nuanced understanding and expert-level domain knowledge. In addition, existing LLM-based approaches frequently exhibit poor transferability across tasks and domains, limiting their scalability and adaptability. To address these challenges, we introduce StructSense, a modular, task-agnostic, open-source framework for structured information extraction built on LLMs. StructSense is guided by domain-specific symbolic knowledge encoded in ontologies, enabling it to navigate complex domain content more effectively. It further incorporates agentic capabilities through self-evaluative judges that form a feedback loop for iterative refinement, and includes human-in-the-loop mechanisms to ensure quality and validation. We demonstrate that StructSense can overcome both the limitations of domain sensitivity and the lack of cross-task generalizability, as shown through its application to diverse neuroscience information extraction tasks.

[82] MemOS: A Memory OS for AI System

Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofen Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong

Main category: cs.CL

TL;DR: The paper proposes MemOS, a memory operating system for LLMs to address challenges in memory management, enabling efficient storage, retrieval, and evolution of knowledge.

DetailsMotivation: LLMs lack well-defined memory systems, hindering long-context reasoning, personalization, and knowledge consistency. Existing solutions like RAG are stateless and inefficient.

Method: Introduces MemOS, a system treating memory as a manageable resource, using MemCubes to unify representation, scheduling, and evolution of memory types.

Result: MemOS enables cost-efficient storage, retrieval, and flexible transitions between memory types, enhancing LLM capabilities.

Conclusion: MemOS provides a memory-centric framework for LLMs, supporting continual learning and personalized modeling with improved controllability and evolvability.

Abstract: Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.

[83] CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings

Kyeongkyu Lee, Seonghwan Yoon, Hongki Lim

Main category: cs.CL

TL;DR: CLARIFID is a framework for generating radiology reports that ensures diagnostic correctness by mimicking expert workflows, using multi-view images and reasoning-aware decoding.

DetailsMotivation: Current methods for radiology report generation lack clinical reliability and diagnostic comprehensiveness, often focusing on fluency over factual correctness.

Method: CLARIFID uses section-aware pretraining, Proximal Policy Optimization, reasoning-aware decoding, and a multi-view encoder. It enforces a two-step workflow (Findings to Impression) and uses re-ranking for coherence.

Result: CLARIFID outperforms baselines on the MIMIC-CXR dataset in both NLG metrics and clinically aware scores.

Conclusion: The framework enhances clinical efficacy by ensuring diagnostic correctness and coherent reasoning, addressing limitations of prior approaches.

Abstract: Automatic generation of radiology reports has the potential to alleviate radiologists’ significant workload, yet current methods struggle to deliver clinically reliable conclusions. In particular, most prior approaches focus on producing fluent text without effectively ensuring the factual correctness of the reports and often rely on single-view images, limiting diagnostic comprehensiveness. We propose CLARIFID, a novel framework that directly optimizes diagnostic correctness by mirroring the two-step workflow of experts. Specifically, CLARIFID (1) learns the logical flow from Findings to Impression through section-aware pretraining, (2) is fine-tuned with Proximal Policy Optimization in which the CheXbert F1 score of the Impression section serves as the reward, (3) enforces reasoning-aware decoding that completes “Findings” before synthesizing the “Impression”, and (4) fuses multiple chest X-ray views via a vision-transformer-based multi-view encoder. During inference, we apply a reasoning-aware next-token forcing strategy followed by report-level re-ranking, ensuring that the model first produces a comprehensive Findings section before synthesizing the Impression and thereby preserving coherent clinical reasoning. Experimental results on the MIMIC-CXR dataset demonstrate that our method achieves superior clinical efficacy and outperforms existing baselines on both standard NLG metrics and clinically aware scores.

[84] Large language models provide unsafe answers to patient-posed medical questions

Rachel L. Draelos, Samina Afreen, Barbara Blasko, Tiffany L. Brazile, Natasha Chase, Dimple Patel Desai, Jessica Evert, Heather L. Gardner, Lauren Herrmann, Aswathy Vaikom House, Stephanie Kass, Marianne Kavan, Kirshma Khemani, Amanda Koire, Lauren M. McDonald, Zahraa Rabeeah, Amy Shah

Main category: cs.CL

TL;DR: A study evaluates the safety of four LLM chatbots (Claude, Gemini, GPT-4o, Llama3-70B) for medical advice, finding significant differences in problematic and unsafe responses.

DetailsMotivation: Concerns about patient safety due to widespread use of LLM chatbots for medical advice.

Method: Physician-led red-teaming study using the HealthAdvice dataset (888 responses to 222 medical questions).

Result: Problematic responses ranged from 21.6% (Claude) to 43.2% (Llama); unsafe responses from 5% (Claude) to 13% (GPT-4o, Llama).

Conclusion: Millions may receive unsafe advice; improvements are needed for clinical safety of chatbots.

Abstract: Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots–Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta–on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women’s health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.

[85] Post-Completion Learning for Language Models

Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Chao Feng, Can Huang

Main category: cs.CL

TL;DR: Post-Completion Learning (PCL) extends training beyond the token, enhancing reasoning and self-evaluation by leveraging post-output space.

DetailsMotivation: Traditional training stops at , missing learning opportunities in post-completion space. PCL aims to improve reasoning and self-assessment.

Method: Uses white-box reinforcement learning for self-evaluation and reward alignment, combining dual-track SFT and RL for hybrid optimization.

Result: Consistent improvements over SFT and RL methods across datasets and models.

Conclusion: PCL offers a novel training approach, improving output quality without sacrificing deployment efficiency.

Abstract: Current language model training paradigms typically terminate learning upon reaching the end-of-sequence () token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.

[86] Memorization in Fine-Tuned Large Language Models

Danil Savine

Main category: cs.CL

TL;DR: The study explores memorization in fine-tuned LLMs, focusing on the medical domain. It uses membership inference and generation tasks to analyze memorization, revealing key factors like weight matrices, perplexity, and LoRA ranks.

DetailsMotivation: To understand how fine-tuning affects memorization in LLMs, especially in privacy-sensitive domains like medicine, using the PHEE dataset.

Method: Employs membership inference attacks and generation tasks to assess memorization, analyzing weight matrices, perplexity, and LoRA ranks.

Result: Value and Output matrices contribute more to memorization; lower perplexity correlates with higher memorization; higher LoRA ranks increase memorization but with diminishing returns.

Conclusion: The findings highlight trade-offs between model performance and privacy, aiding in developing responsible fine-tuning strategies for LLMs.

Abstract: This study investigates the mechanisms and factors influencing memorization in fine-tuned large language models (LLMs), with a focus on the medical domain due to its privacy-sensitive nature. We examine how different aspects of the fine-tuning process affect a model’s propensity to memorize training data, using the PHEE dataset of pharmacovigilance events. Our research employs two main approaches: a membership inference attack to detect memorized data, and a generation task with prompted prefixes to assess verbatim reproduction. We analyze the impact of adapting different weight matrices in the transformer architecture, the relationship between perplexity and memorization, and the effect of increasing the rank in low-rank adaptation (LoRA) fine-tuning. Key findings include: (1) Value and Output matrices contribute more significantly to memorization compared to Query and Key matrices; (2) Lower perplexity in the fine-tuned model correlates with increased memorization; (3) Higher LoRA ranks lead to increased memorization, but with diminishing returns at higher ranks. These results provide insights into the trade-offs between model performance and privacy risks in fine-tuned LLMs. Our findings have implications for developing more effective and responsible strategies for adapting large language models while managing data privacy concerns.

[87] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

Ishani Mondal, Meera Bharadwaj, Ayush Roy, Aparna Garimella, Jordan Lee Boyd-Graber

Main category: cs.CL

TL;DR: SMART-Editor is a framework for global coherence in layout and content editing across domains, outperforming baselines with RewardDPO and Reward-Refine strategies.

DetailsMotivation: To address the lack of global coherence in prior models performing local edits.

Method: Uses Reward-Refine (inference-time refinement) and RewardDPO (training-time preference optimization).

Result: Outperforms baselines by up to 15% in structured settings and shows advantages in natural images.

Conclusion: Reward-guided planning ensures semantically consistent and visually aligned edits.

Abstract: We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global coherence through two strategies: Reward-Refine, an inference-time rewardguided refinement method, and RewardDPO, a training-time preference optimization approach using reward-aligned layout pairs. To evaluate model performance, we introduce SMARTEdit-Bench, a benchmark covering multi-domain, cascading edit scenarios. SMART-Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward-Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually aligned edits.

[88] Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy

Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, Lizhe Zhang, Yan Liu, Bing Qin

Main category: cs.CL

TL;DR: The paper introduces Collaborative Chain-of-Agents (CoCoA), a framework to improve synergy between parametric and retrieved knowledge in RAG, enhancing LLM performance in knowledge-intensive tasks.

DetailsMotivation: Current RAG methods struggle to fully exploit knowledge during generation, limiting synergy between internal and external knowledge.

Method: Proposes CoCoA-zero (multi-agent RAG for conditional knowledge induction and reasoning) and CoCoA (long-chain training to fine-tune LLM for better knowledge integration).

Result: CoCoA-zero and CoCoA achieve superior performance on open-domain and multi-hop QA tasks.

Conclusion: The framework effectively enhances knowledge integration and reasoning in LLMs.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising framework for enhancing the capabilities of Large Language Models (LLMs), especially in knowledge-intensive tasks. Despite its advantages, current RAG methods often struggle to fully exploit knowledge during generation. In particular, the synergy between the model’s internal parametric knowledge and external retrieved knowledge remains limited. Retrieved contents may sometimes mislead generation, while certain generated content can guide the model toward more accurate outputs. In this work, we propose Collaborative Chain-of-Agents, a framework designed to enhance explicitly synergy over both parametric and retrieved knowledge. Specifically, we first introduce CoCoA-zero, a multi-agent RAG framework that first performs conditional knowledge induction and then reasons answers. Building on this, we develop CoCoA, a long-chain training strategy that synthesizes extended multi-agent reasoning trajectories from CoCoA-zero to fine-tune the LLM. This strategy enhances the model’s capability to explicitly integrate and jointly leverage parametric and retrieved knowledge. Experiments results show that CoCoA-zero and CoCoA achieve superior performance on open-domain and multi-hop QA tasks.

[89] SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Hui Li, Xiaoran Fan, Ming Zhang, Junjie Ye, Shihan Dou, Zhiheng Xi, Jingqi Tong, Yilong Wu, Baoyu Fan, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: The paper introduces SpeechRole-Data and SpeechRole-Eval to address the lack of systematic evaluation for Speech Role-Playing Agents (SRPAs), highlighting vocal characteristics and role-playing fidelity.

DetailsMotivation: Existing research on role-playing agents focuses on text, ignoring speech in interactive scenarios, creating a gap in SRPA evaluation.

Method: Constructed SpeechRole-Data (98 roles, 112k conversations) and proposed SpeechRole-Eval for multidimensional SRPA assessment.

Result: Experiments show challenges in vocal style consistency and role coherence for cascaded and end-to-end SRPAs.

Conclusion: The released data, code, and models aim to advance speech-driven multimodal role-playing research.

Abstract: Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.

[90] When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, Di Wang

Main category: cs.CL

TL;DR: The paper investigates the internal mechanisms behind sycophantic behavior in LLMs, identifying a two-stage process and highlighting the role of grammatical perspective in influencing this behavior.

DetailsMotivation: To understand why LLMs exhibit sycophantic behavior, agreeing with user opinions even when they contradict facts, and to explore the internal mechanisms driving this tendency.

Method: Systematic study of user opinions’ impact, logit-lens analysis, causal activation patching, and examination of grammatical perspective effects.

Result: Sycophancy arises from a late-layer output preference shift and deeper representational divergence, with first-person prompts inducing higher sycophancy rates than third-person ones.

Conclusion: Sycophancy in LLMs is a structural override of learned knowledge in deeper layers, with implications for AI alignment and truthful systems.

Abstract: Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (I believe...'') consistently induce higher sycophancy rates than third-person framings (They believe…’’) by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.

[91] Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems

Yebo Peng, Zixiang Liu, Yaoming Li, Zhizhuo Yang, Xinye Xu, Bowen Ye, Weijun Yuan, Zihan Wang, Tong Yang

Main category: cs.CL

TL;DR: Proof2Hybrid is an automated framework for creating proof-centric benchmarks to evaluate LLMs’ mathematical abilities, demonstrated with AlgGeoTest, revealing significant gaps in LLMs’ comprehension.

DetailsMotivation: Existing benchmarks for evaluating LLMs' mathematical capabilities are limited, especially for proof-centric problems, due to scalability and cost issues.

Method: Proposes Proof2Hybrid, an automated framework using Proof2X to convert proofs into verifiable questions, including hybrid-formatted “$m$-out-of-$n$ multiple judge questions” for robust evaluation.

Result: AlgGeoTest, a 456-item benchmark for algebraic geometry, exposed major deficits in LLMs’ understanding, offering a precise assessment of their mathematical skills.

Conclusion: Proof2Hybrid and AlgGeoTest enable deeper research into AI’s mathematical intelligence, addressing current evaluation limitations.

Abstract: Evaluating the mathematical capability of Large Language Models (LLMs) is a critical yet challenging frontier. Existing benchmarks fall short, particularly for proof-centric problems, as manual creation is unscalable and costly, leaving the true mathematical abilities of LLMs largely unassessed. To overcome these barriers, we propose Proof2Hybrid, the first fully automated framework that synthesizes high-quality, proof-centric benchmarks from natural language mathematical corpora. The key novelty of our solution is Proof2X, a roadmap of converting mathematical proofs into various kinds of questions that are easy to verify. Instructed by this roadmap, we propose a new type of hybrid-formatted questions, named ``$m$-out-of-$n$ multiple judge questions’’, specifically designed to enable robust, automatic evaluation while being resilient to guessing and superficial pattern matching inherent in traditional formats. As a demonstration of our framework, we introduce AlgGeoTest, a benchmark for algebraic geometry–a frontier domain of modern mathematics–comprising 456 challenging items. Our extensive evaluations on state-of-the-art LLMs using AlgGeoTest reveal profound deficits in their comprehension of algebraic geometry, providing a more precise measure of their true mathematical capabilities. Our framework and benchmark pave the way for a new wave of in-depth research into the mathematical intelligence of AI systems.

[92] Dynaword: From One-shot to Continuously Developed Datasets

Kenneth Enevoldsen, Kristian Nørgaard Jensen, Jan Kostkan, Balázs Szabó, Márton Kardos, Kirten Vad, Johan Heinsen, Andrea Blasi Núñez, Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Vahlstrup, Per Møldrup Dalum, Desmond Elliott, Lukas Galke, Peter Schneider-Kamp, Kristoffer Nielbo

Main category: cs.CL

TL;DR: The paper introduces Dynaword, a framework for open, community-updatable NLP datasets, and validates it with Danish Dynaword, showcasing scalability and open licensing.

DetailsMotivation: Address challenges in NLP datasets: restrictive licensing, static releases, and limited quality assurance.

Method: Propose Dynaword framework for community-driven dataset creation and updates, implemented in Danish Dynaword.

Result: Danish Dynaword has 4x more tokens than peers, open licensing, and community contributions, with quality tests.

Conclusion: Dynaword enables sustainable, scalable, and open NLP datasets through community collaboration.

Abstract: Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over four times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry and research. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution.

[93] LaMPE: Length-aware Multi-grained Positional Encoding for Adaptive Long-context Scaling Without Training

Sikui Zhang, Guangze Gao, Ziyun Gan, Chunfeng Yuan, Zefeng Lin, Houwen Peng, Bing Li, Weiming Hu

Main category: cs.CL

TL;DR: LaMPE introduces a training-free method for adaptive long-context scaling in LLMs by dynamically mapping input lengths and using multi-grained attention.

DetailsMotivation: Address performance degradation in LLMs when input exceeds the pretraining context window due to OOD behavior of RoPE.

Method: Proposes LaMPE, which uses a parametric scaled sigmoid function for dynamic length mapping and a multi-grained attention mechanism.

Result: Achieves significant performance improvements on long-context benchmarks compared to existing methods.

Conclusion: LaMPE is effective for adaptive long-context scaling in RoPE-based LLMs without requiring training.

Abstract: Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window, primarily due to the out-of-distribution (OOD) behavior of Rotary Position Embedding (RoPE). Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies, ignoring the dynamic relationship between input length and the model’s effective context window. To this end, we propose Length-aware Multi-grained Positional Encoding (LaMPE), a training-free method that fully utilizes the model’s effective context window for adaptive long-context scaling in LLMs. Motivated by the left-skewed frequency distribution of relative positions, LaMPE establishes a dynamic relationship between mapping length and input length through a parametric scaled sigmoid function to adaptively allocate positional capacity across varying input lengths. Meanwhile, LaMPE devises a novel multi-grained attention mechanism that strategically allocates positional resolution across different sequence regions to capture both fine-grained locality and long-range dependencies. Our method can be seamlessly applied to a wide range of RoPE-based LLMs without training. Extensive experiments on three representative LLMs across five mainstream long-context benchmarks demonstrate that LaMPE achieves significant performance improvements compared to existing length extrapolation methods. The code will be released at https://github.com/scar-on/LaMPE.

[94] VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu

Main category: cs.CL

TL;DR: VeOmni is a modular framework for efficient training of omni-modal LLMs, decoupling computation and communication for scalability.

DetailsMotivation: Existing frameworks for omni-modal LLMs are limited by entangled model definitions and parallel logic, hindering scalability and efficiency.

Method: VeOmni introduces model-centric distributed recipes and a flexible configuration interface to decouple communication from computation, enabling 3D parallelism.

Result: VeOmni achieves 2,800 tokens/sec/GPU throughput and scales to 160K context lengths on 128 GPUs for a 30B parameter MoE model.

Conclusion: VeOmni demonstrates superior efficiency and scalability for training large omni-modal LLMs.

Abstract: Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. We present VeOmni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. VeOmni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. VeOmni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. Using VeOmni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

cs.CV

[95] PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation

Zongyou Yang, Jonathan Loo

Main category: cs.CV

TL;DR: The paper introduces PyCAT4, an improved version of Pymaf, by integrating Transformer-based feature extraction, temporal fusion, and spatial pyramid structures for better 3D human pose estimation.

DetailsMotivation: To enhance the accuracy of 3D human pose estimation by optimizing the Pymaf network with advanced techniques like Transformers and multi-scale feature fusion.

Method: 1. Transformer-based self-attention for low-level feature capture. 2. Feature temporal fusion for video sequences. 3. Spatial pyramid structures for multi-scale feature fusion.

Result: PyCAT4 shows significant improvement in detection capability on COCO and 3DPW datasets.

Conclusion: The proposed enhancements advance human pose estimation technology, demonstrating the effectiveness of the new PyCAT4 model.

Abstract: Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network’s detection capability in human pose estimation, further advancing the development of human pose estimation technology.

[96] DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, Xin Dong

Main category: cs.CV

TL;DR: DreamVVT is a two-stage framework using Diffusion Transformers (DiTs) to enhance video virtual try-on (VVT) by leveraging unpaired data and pretrained models for better garment detail preservation and temporal consistency.

DetailsMotivation: Existing VVT methods rely on scarce paired datasets and fail to utilize advanced visual priors, leading to poor detail preservation and temporal inconsistency.

Method: DreamVVT uses a two-stage approach: (1) keyframe synthesis via a multi-frame try-on model with a VLM, and (2) video generation using skeleton maps, motion descriptions, and LoRA adapters for temporal coherence.

Result: DreamVVT outperforms existing methods in preserving garment details and maintaining temporal stability in real-world scenarios.

Conclusion: DreamVVT effectively addresses limitations of current VVT methods by leveraging unpaired data and pretrained models, achieving superior performance.

Abstract: Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. \textbf{In the second stage}, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page https://virtu-lab.github.io/

[97] VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering

Yiran Meng, Junhong Ye, Wei Zhou, Guanghui Yue, Xudong Mao, Ruomei Wang, Baoquan Zhao

Main category: cs.CV

TL;DR: VideoForest introduces a framework for cross-video QA using person-anchored hierarchical reasoning, outperforming existing methods in accuracy.

DetailsMotivation: Addressing challenges in cross-video understanding, such as connecting video streams and managing multi-source retrieval.

Method: Uses person-level features, ReID, tracking, a multi-granularity spanning tree, and multi-agent reasoning for hierarchical organization and query answering.

Result: Achieves 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization/reasoning.

Conclusion: VideoForest sets a new paradigm for cross-video understanding by unifying video streams via person-level features, enabling efficient reasoning.

Abstract: Cross-video question answering presents significant challenges beyond traditional single-video understanding, particularly in establishing meaningful connections across video streams and managing the complexity of multi-source information retrieval. We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning. Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training. VideoForest integrates three key innovations: 1) a human-anchored feature extraction mechanism that employs ReID and tracking algorithms to establish robust spatiotemporal relationships across multiple video sources; 2) a multi-granularity spanning tree structure that hierarchically organizes visual content around person-level trajectories; and 3) a multi-agent reasoning framework that efficiently traverses this hierarchical structure to answer complex cross-video queries. To evaluate our approach, we develop CrossVideoQA, a comprehensive benchmark dataset specifically designed for person-centric cross-video analysis. Experimental results demonstrate VideoForest’s superior performance in cross-video reasoning tasks, achieving 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization and reasoning, significantly outperforming existing methods. Our work establishes a new paradigm for cross-video understanding by unifying multiple video streams through person-level features, enabling sophisticated reasoning across distributed visual information while maintaining computational efficiency.

[98] Elucidating the Role of Feature Normalization in IJEPA

Adam Colton

Main category: cs.CV

TL;DR: Replacing layer normalization (LN) with DynTanh in IJEPA preserves token energy hierarchy, improving model performance and fixing artifacts.

DetailsMotivation: LN disrupts the natural energy hierarchy of visual tokens, masking semantically important regions and causing artifacts.

Method: Replace LN with DynTanh activation to preserve token energies and prioritize high-energy tokens.

Result: Improved ImageNet accuracy (38% to 42.7%) and reduced RMSE (by 0.08) on NYU Depth V2.

Conclusion: Preserving token energy hierarchy is key for effective self-supervised visual learning.

Abstract: In the standard image joint embedding predictive architecture (IJEPA), features at the output of the teacher encoder are layer normalized (LN) before serving as a distillation target for the student encoder and predictor. We propose that this feature normalization disrupts the natural energy hierarchy of visual tokens, where high-energy tokens (those with larger L2 norms) encode semantically important image regions. LN forces all features to have identical L2 norms, effectively equalizing their energies and preventing the model from prioritizing semantically rich regions. We find that IJEPA models trained with feature LN exhibit loss maps with significant checkerboard-like artifacts. We propose that feature LN be replaced with a DynTanh activation as the latter better preserves token energies and allows high-energy tokens to greater contribute to the prediction loss. We show that IJEPA trained with feature DynTanh exhibits a longer-tailed loss distribution and fixes the checkerboard artifacts in the loss map. Our empirical results show that our simple modification improves ImageNet linear probe accuracy from 38% to 42.7% for ViT-Small and reduces RMSE by 0.08 on NYU Depth V2 monocular depth estimation. These results suggest that preserving natural token energies is crucial for effective self-supervised visual representation learning.

[99] DepthGait: Multi-Scale Cross-Level Feature Fusion of RGB-Derived Depth and Silhouette Sequences for Robust Gait Recognition

Xinzhu Li, Juepeng Zheng, Yikun Chen, Xudong Mao, Guanghui Yue, Wei Zhou, Chenlei Lv, Ruomei Wang, Fan Zhou, Baoquan Zhao

Main category: cs.CV

TL;DR: DepthGait introduces RGB-derived depth maps and silhouettes for robust gait recognition, outperforming existing methods.

DetailsMotivation: Existing 2D representations like silhouettes and skeletons lack sufficient cues for viewpoint variations and finer gait details.

Method: DepthGait estimates depth maps from RGB sequences and uses them alongside silhouettes, employing a multi-scale cross-level fusion scheme.

Result: Achieves state-of-the-art performance with high rank-1 accuracy on challenging datasets.

Conclusion: DepthGait effectively enhances gait recognition by leveraging depth maps and silhouettes, setting a new benchmark.

Abstract: Robust gait recognition requires highly discriminative representations, which are closely tied to input modalities. While binary silhouettes and skeletons have dominated recent literature, these 2D representations fall short of capturing sufficient cues that can be exploited to handle viewpoint variations, and capture finer and meaningful details of gait. In this paper, we introduce a novel framework, termed DepthGait, that incorporates RGB-derived depth maps and silhouettes for enhanced gait recognition. Specifically, apart from the 2D silhouette representation of the human body, the proposed pipeline explicitly estimates depth maps from a given RGB image sequence and uses them as a new modality to capture discriminative features inherent in human locomotion. In addition, a novel multi-scale and cross-level fusion scheme has also been developed to bridge the modality gap between depth maps and silhouettes. Extensive experiments on standard benchmarks demonstrate that the proposed DepthGait achieves state-of-the-art performance compared to peer methods and attains an impressive mean rank-1 accuracy on the challenging datasets.

[100] GENIE: Gaussian Encoding for Neural Radiance Fields Interactive Editing

Mikołaj Zieliński, Krzysztof Byrski, Tomasz Szczepanik, Przemysław Spurek

Main category: cs.CV

TL;DR: GENIE combines NeRF’s photorealistic rendering with Gaussian Splatting’s editable structure, enabling real-time, interactive scene manipulation.

DetailsMotivation: To bridge the gap between NeRF's high-fidelity rendering and Gaussian Splatting's editable, explicit representation for better scene manipulation and interaction.

Method: Hybrid model using trainable feature embeddings for Gaussians, conditioned on a NeRF network via Ray-Traced Gaussian Proximity Search (RT-GPS) and a multi-resolution hash grid.

Result: Enables real-time, locality-aware editing and dynamic interaction while maintaining photorealistic quality.

Conclusion: GENIE successfully merges implicit and explicit representations, offering intuitive editing, dynamic interaction, and compatibility with physics-based simulation.

Abstract: Neural Radiance Fields (NeRF) and Gaussian Splatting (GS) have recently transformed 3D scene representation and rendering. NeRF achieves high-fidelity novel view synthesis by learning volumetric representations through neural networks, but its implicit encoding makes editing and physical interaction challenging. In contrast, GS represents scenes as explicit collections of Gaussian primitives, enabling real-time rendering, faster training, and more intuitive manipulation. This explicit structure has made GS particularly well-suited for interactive editing and integration with physics-based simulation. In this paper, we introduce GENIE (Gaussian Encoding for Neural Radiance Fields Interactive Editing), a hybrid model that combines the photorealistic rendering quality of NeRF with the editable and structured representation of GS. Instead of using spherical harmonics for appearance modeling, we assign each Gaussian a trainable feature embedding. These embeddings are used to condition a NeRF network based on the k nearest Gaussians to each query point. To make this conditioning efficient, we introduce Ray-Traced Gaussian Proximity Search (RT-GPS), a fast nearest Gaussian search based on a modified ray-tracing pipeline. We also integrate a multi-resolution hash grid to initialize and update Gaussian features. Together, these components enable real-time, locality-aware editing: as Gaussian primitives are repositioned or modified, their interpolated influence is immediately reflected in the rendered output. By combining the strengths of implicit and explicit representations, GENIE supports intuitive scene manipulation, dynamic interaction, and compatibility with physical simulation, bridging the gap between geometry-based editing and neural rendering. The code can be found under (https://github.com/MikolajZielinski/genie)

[101] RefineSeg: Dual Coarse-to-Fine Learning for Medical Image Segmentation

Anghong Du, Nay Aung, Theodoros N. Arvanitis, Stefan K. Piechnik, Joao A C Lima, Steffen E. Petersen, Le Zhang

Main category: cs.CV

TL;DR: A novel coarse-to-fine segmentation framework uses noisy coarse annotations to approximate precise labels, outperforming weakly supervised methods and nearing fully supervised performance.

DetailsMotivation: High-quality pixel-level medical image annotations are costly and require expertise, prompting the need for a method that works with coarse annotations.

Method: The framework models inaccurate regions in coarse annotations using transition matrices, jointly training on multiple sets to refine outputs and infer true segmentation.

Result: Validated on cardiac imaging datasets (ACDC, MSCMRseg, UK Biobank), the method surpasses weakly supervised approaches and approaches fully supervised performance.

Conclusion: The proposed framework effectively leverages coarse annotations to achieve robust segmentation, reducing reliance on costly precise labels.

Abstract: High-quality pixel-level annotations of medical images are essential for supervised segmentation tasks, but obtaining such annotations is costly and requires medical expertise. To address this challenge, we propose a novel coarse-to-fine segmentation framework that relies entirely on coarse-level annotations, encompassing both target and complementary drawings, despite their inherent noise. The framework works by introducing transition matrices in order to model the inaccurate and incomplete regions in the coarse annotations. By jointly training on multiple sets of coarse annotations, it progressively refines the network’s outputs and infers the true segmentation distribution, achieving a robust approximation of precise labels through matrix-based modeling. To validate the flexibility and effectiveness of the proposed method, we demonstrate the results on two public cardiac imaging datasets, ACDC and MSCMRseg, and further evaluate its performance on the UK Biobank dataset. Experimental results indicate that our approach surpasses the state-of-the-art weakly supervised methods and closely matches the fully supervised approach.

[102] How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes

Mahnoor Fatima Saad, Ziad Al-Halah

Main category: cs.CV

TL;DR: The paper introduces a method for generating acoustic profiles in indoor scenes based on user-defined material configurations, using an encoder-decoder model to predict Room Impulse Responses (RIRs).

DetailsMotivation: To enable dynamic control over acoustic profiles in indoor scenes by allowing users to specify material configurations (e.g., carpeted floors, acoustic tiles) for tailored sound environments.

Method: A novel encoder-decoder approach encodes audio-visual scene properties and generates RIRs conditioned on user-provided material specifications.

Result: The model effectively encodes material information and generates high-fidelity RIRs, outperforming baselines and state-of-the-art methods.

Conclusion: The proposed method successfully addresses material-controlled acoustic profile generation, validated by a new benchmark dataset (Acoustic Wonderland Dataset).

Abstract: How would the sound in a studio change with a carpeted floor and acoustic tiles on the walls? We introduce the task of material-controlled acoustic profile generation, where, given an indoor scene with specific audio-visual characteristics, the goal is to generate a target acoustic profile based on a user-defined material configuration at inference time. We address this task with a novel encoder-decoder approach that encodes the scene’s key properties from an audio-visual observation and generates the target Room Impulse Response (RIR) conditioned on the material specifications provided by the user. Our model enables the generation of diverse RIRs based on various material configurations defined dynamically at inference time. To support this task, we create a new benchmark, the Acoustic Wonderland Dataset, designed for developing and evaluating material-aware RIR prediction methods under diverse and challenging settings. Our results demonstrate that the proposed model effectively encodes material information and generates high-fidelity RIRs, outperforming several baselines and state-of-the-art methods.

[103] MIDAR: Mimicking LiDAR Detection for Traffic Applications with a Lightweight Plug-and-Play Model

Tianheng Zhu, Yiheng Feng

Main category: cs.CV

TL;DR: MIDAR is a LiDAR detection mimicking model for traffic simulators, bridging the gap between high-fidelity but unscalable simulators and scalable but perception-limited ones.

DetailsMotivation: The need for realistic LiDAR detection modeling in scalable traffic simulators for cooperative perception applications.

Method: MIDAR uses vehicle-level features and a GRU-enhanced APPNP architecture to predict LiDAR detections, validated on the nuScenes dataset.

Result: Achieves an AUC of 0.909 in approximating CenterPoint’s detections, with practical validation in CP-based traffic applications.

Conclusion: MIDAR successfully integrates realistic detection into scalable simulators, enhancing applications requiring precise vehicle observations.

Abstract: As autonomous driving (AD) technology advances, increasing research has focused on leveraging cooperative perception (CP) data collected from multiple AVs to enhance traffic applications. Due to the impracticality of large-scale real-world AV deployments, simulation has become the primary approach in most studies. While game-engine-based simulators like CARLA generate high-fidelity raw sensor data (e.g., LiDAR point clouds) which can be used to produce realistic detection outputs, they face scalability challenges in multi-AV scenarios. In contrast, microscopic traffic simulators such as SUMO scale efficiently but lack perception modeling capabilities. To bridge this gap, we propose MIDAR, a LiDAR detection mimicking model that approximates realistic LiDAR detections using vehicle-level features readily available from microscopic traffic simulators. Specifically, MIDAR predicts true positives (TPs) and false negatives (FNs) from ideal LiDAR detection results based on the spatial layouts and dimensions of surrounding vehicles. A Refined Multi-hop Line-of-Sight (RM-LoS) graph is constructed to encode the occlusion relationships among vehicles, upon which MIDAR employs a GRU-enhanced APPNP architecture to propagate features from the ego AV and occluding vehicles to the prediction target. MIDAR achieves an AUC of 0.909 in approximating the detection results generated by CenterPoint, a mainstream 3D LiDAR detection model, on the nuScenes AD dataset. Two CP-based traffic applications further validate the necessity of such realistic detection modeling, particularly for tasks requiring accurate individual vehicle observations (e.g., position, speed, lane index). As demonstrated in the applications, MIDAR can be seamlessly integrated into traffic simulators and trajectory datasets and will be open-sourced upon publication.

[104] Evaluation and Analysis of Deep Neural Transformers and Convolutional Neural Networks on Modern Remote Sensing Datasets

J. Alex Hurt, Trevor M. Bajkowski, Grant J. Scott, Curt H. Davis

Main category: cs.CV

TL;DR: The paper compares transformer-based and convolutional neural networks for object detection in satellite imagery, showing transformers achieve state-of-the-art performance.

DetailsMotivation: To understand how transformer-based networks perform on remote sensing data compared to traditional CNNs, given their success in other CV tasks.

Method: Evaluates 11 detection algorithms (5 transformer-based, 6 CNNs) on 3 high-resolution remote sensing datasets, training and testing 33 models.

Result: Transformer-based architectures demonstrate superior performance on satellite imagery benchmarks.

Conclusion: Transformers are a promising alternative to CNNs for remote sensing object detection, achieving state-of-the-art results.

Abstract: In 2012, AlexNet established deep convolutional neural networks (DCNNs) as the state-of-the-art in CV, as these networks soon led in visual tasks for many domains, including remote sensing. With the publication of Visual Transformers, we are witnessing the second modern leap in computational vision, and as such, it is imperative to understand how various transformer-based neural networks perform on satellite imagery. While transformers have shown high levels of performance in natural language processing and CV applications, they have yet to be compared on a large scale to modern remote sensing data. In this paper, we explore the use of transformer-based neural networks for object detection in high-resolution electro-optical satellite imagery, demonstrating state-of-the-art performance on a variety of publicly available benchmark data sets. We compare eleven distinct bounding-box detection and localization algorithms in this study, of which seven were published since 2020, and all eleven since 2015. The performance of five transformer-based architectures is compared with six convolutional networks on three state-of-the-art opensource high-resolution remote sensing imagery datasets ranging in size and complexity. Following the training and evaluation of thirty-three deep neural models, we then discuss and analyze model performance across various feature extraction methodologies and detection algorithms.

[105] T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates

Zhitao Wang, Hengyu Man, Wenrui Li, Xingtao Wang, Xiaopeng Fan, Debin Zhao

Main category: cs.CV

TL;DR: T-GVC is a novel video coding framework combining low-level motion tracking and high-level semantics for ultra-low bitrate scenarios, outperforming traditional and neural codecs.

DetailsMotivation: Existing methods are limited by domain specificity or reliance on text guidance, failing to capture fine-grained motion details, leading to unrealistic reconstructions.

Method: T-GVC uses semantic-aware sparse motion sampling and integrates trajectory-aligned loss constraints into diffusion processes for training-free guidance.

Result: T-GVC outperforms traditional and neural video codecs in ultra-low bitrate conditions and achieves more precise motion control than text-guided methods.

Conclusion: T-GVC introduces a new direction for generative video coding by leveraging geometric motion modeling, ensuring realistic and coherent reconstructions.

Abstract: Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding for Ultra-Low Bitrate (ULB) scenarios by leveraging powerful generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or excessive dependence on high-level text guidance, which tend to inadequately capture fine-grained motion details, leading to unrealistic or incoherent reconstructions. To address these challenges, we propose Trajectory-Guided Generative Video Coding (dubbed T-GVC), a novel framework that bridges low-level motion tracking with high-level semantic understanding. T-GVC features a semantic-aware sparse motion sampling pipeline that extracts pixel-wise motion as sparse trajectory points based on their semantic importance, significantly reducing the bitrate while preserving critical temporal semantic information. In addition, by integrating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free guidance mechanism in latent space to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that T-GVC outperforms both traditional and neural video codecs under ULB conditions. Furthermore, additional experiments confirm that our framework achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.

[106] VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction

Rongxin Jiang, Robert Long, Chenghao Gu, Mingrui Yan

Main category: cs.CV

TL;DR: VisuCraft enhances LVLMs for creative content generation by integrating a multimodal extractor and dynamic prompt module, outperforming baselines in creativity and instruction adherence.

DetailsMotivation: Address limitations of LVLMs in visual fidelity, creativity, and adherence to nuanced user instructions in long-form text generation.

Method: Combines a multimodal structured information extractor (E) and dynamic prompt generation module (G) to optimize prompts for LVLMs.

Result: Outperforms baseline LVLMs in tasks like story generation and poetry, especially in creativity and instruction adherence.

Conclusion: VisuCraft unlocks new potential for LVLMs in advanced creative AI applications.

Abstract: This paper introduces VisuCraft, a novel framework designed to significantly enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation. Existing LVLMs often exhibit limitations in maintaining high visual fidelity, genuine creativity, and precise adherence to nuanced user instructions when generating long-form texts. VisuCraft addresses these challenges by integrating a multimodal structured information extractor (E) and a dynamic prompt generation module (G). The extractor distills fine-grained visual attributes from input images into a rich, structured representation, which the dynamic prompt module then combines with user instructions to create highly optimized prompts for underlying LVLMs (e.g., LLaVA, InstructBLIP). Evaluated on the self-constructed ImageStoryGen-500K dataset using VisuGen Metrics (Visual Grounding, Creativity, and Instruction Adherence), VisuCraft consistently outperforms baseline LVLMs across tasks like story generation and poetry composition. Our results demonstrate remarkable improvements, particularly in creativity and instruction adherence, validating VisuCraft’s effectiveness in producing imaginative, visually grounded, and user-aligned long-form creative text. This work unlocks new potential for LVLMs in sophisticated creative AI applications.

[107] RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation

Mehrdad Moradi, Kamran Paynabar

Main category: cs.CV

TL;DR: The paper introduces robust denoising diffusion models for unsupervised anomaly segmentation using contaminated data, outperforming existing methods.

DetailsMotivation: Traditional diffusion models require normal data for training, limiting real-world applicability. This work addresses scenarios with only contaminated (mixed normal/anomalous) unlabeled data.

Method: The authors reinterpret denoising diffusion probabilistic models as nonlinear regression, applying robust regression to derive a robust version.

Result: The proposed method achieves up to 8.08% higher AUROC and 10.37% higher AUPRC on MVTec datasets compared to existing approaches.

Conclusion: The robust diffusion model framework is flexible and effective for anomaly segmentation with contaminated data, advancing the field beyond traditional assumptions.

Abstract: Recent advancements in diffusion models have demonstrated significant success in unsupervised anomaly segmentation. For anomaly segmentation, these models are first trained on normal data; then, an anomalous image is noised to an intermediate step, and the normal image is reconstructed through backward diffusion. Unlike traditional statistical methods, diffusion models do not rely on specific assumptions about the data or target anomalies, making them versatile for use across different domains. However, diffusion models typically assume access to normal data for training, limiting their applicability in realistic settings. In this paper, we propose novel robust denoising diffusion models for scenarios where only contaminated (i.e., a mix of normal and anomalous) unlabeled data is available. By casting maximum likelihood estimation of the data as a nonlinear regression problem, we reinterpret the denoising diffusion probabilistic model through a regression lens. Using robust regression, we derive a robust version of denoising diffusion probabilistic models. Our novel framework offers flexibility in constructing various robust diffusion models. Our experiments show that our approach outperforms current state of the art diffusion models, for unsupervised anomaly segmentation when only contaminated data is available. Our method outperforms existing diffusion-based approaches, achieving up to 8.08% higher AUROC and 10.37% higher AUPRC on MVTec datasets. The implementation code is available at: https://github.com/mehrdadmoradi124/RDDPM

[108] UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Jinting Wang, Shan Yang, Chenxing Li, Dong Yu, Li Liu

Main category: cs.CV

TL;DR: UniCUE is a unified framework for directly generating speech from Cued Speech videos, avoiding intermediate text and leveraging visual-semantic cues for improved performance.

DetailsMotivation: Existing methods for CSV2S rely on intermediate text, leading to error propagation and misalignment. Direct methods struggle with multimodal complexity and data scarcity.

Method: UniCUE integrates CSR for fine-grained visual-semantic cues, using a pose-aware visual processor, semantic alignment pool, and VisioPhonetic adapter.

Result: UniCUE achieves state-of-the-art performance on the UniCUE-HI dataset, a large-scale Mandarin CS dataset.

Conclusion: UniCUE effectively addresses challenges in CSV2S by unifying understanding and generation tasks, demonstrating superior performance.

Abstract: Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffers from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating an understanding task (CSR) that provides fine-grained CS visual-semantic cues to guide speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual-semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments on this dataset demonstrate that UniCUE achieves state-of-the-art performance across multiple evaluation metrics.

[109] Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

Vebjørn Haug Kåsene, Pierre Lison

Main category: cs.CV

TL;DR: Off-the-shelf LVLMs can perform VLN tasks but lag behind specialized models, achieving a 41% success rate on R2R.

DetailsMotivation: Explore the potential of off-the-shelf LVLMs for VLN tasks and their adaptability to low-level and panoramic action spaces.

Method: Fine-tune Qwen2.5-VL-3B-Instruct on the R2R dataset and evaluate performance in both action paradigms.

Result: 41% success rate on R2R, showing capability but inferiority to specialized models.

Conclusion: Off-the-shelf LVLMs can support VLN but are less effective than purpose-built models.

Abstract: Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in this task, most current VLM systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored. Furthermore, while older VLN approaches used low-level action spaces with egocentric views and atomic actions (such as “turn left” or “move forward”), newer models tend to favor panoramic action spaces with discrete navigable viewpoints. This paper investigates (1) whether off-the-shelf LVLMs (fine-tuned without architectural modifications or simulator-based training) can effectively support VLN tasks and (2) whether such models can support both low-level and panoramic action paradigms. To this end, we fine-tune the open-source model Qwen2.5-VL-3B-Instruct on the Room-to-Room (R2R) dataset and evaluate its empirical performance across both low-level and panoramic action spaces. The best resulting model achieves a 41% success rate on the R2R test set, demonstrating that while off-the-shelf LVLMs can learn to perform Vision-and-Language Navigation, they still lag behind models specifically designed for this task.

[110] How Diffusion Prior Landscapes Shape the Posterior in Blind Deconvolution

Minh-Hai Nguyen, Edouard Pauwels, Pierre Weiss

Main category: cs.CV

TL;DR: MAP estimation in blind deconvolution favors blurry solutions with sparsity-promoting priors. Diffusion-based priors reveal blurry images have higher likelihoods, but local minimizers correspond to sharp images. Gradient descent can find these solutions, suggesting good initialization is key.

DetailsMotivation: To address the limitation of MAP estimation favoring blurry solutions in blind deconvolution by exploring diffusion-based priors and their likelihood landscape.

Method: Empirical examination of the prior’s likelihood landscape and theoretical analysis of the blind deblurring posterior, validated with numerical experiments.

Result: MAP produces sharp filters and blurry images, but local minimizers correspond to sharp, natural images. Gradient descent can find these solutions.

Conclusion: Overcoming MAP’s limitations requires good initialization to local minima, with implications for designing better priors and optimization techniques.

Abstract: The Maximum A Posteriori (MAP) estimation is a widely used framework in blind deconvolution to recover sharp images from blurred observations. The estimated image and blur filter are defined as the maximizer of the posterior distribution. However, when paired with sparsity-promoting image priors, MAP estimation has been shown to favors blurry solutions, limiting its effectiveness. In this paper, we revisit this result using diffusion-based priors, a class of models that capture realistic image distributions. Through an empirical examination of the prior’s likelihood landscape, we uncover two key properties: first, blurry images tend to have higher likelihoods; second, the landscape contains numerous local minimizers that correspond to natural images. Building on these insights, we provide a theoretical analysis of the blind deblurring posterior. This reveals that the MAP estimator tends to produce sharp filters (close to the Dirac delta function) and blurry solutions. However local minimizers of the posterior, which can be obtained with gradient descent, correspond to realistic, natural images, effectively solving the blind deconvolution problem. Our findings suggest that overcoming MAP’s limitations requires good local initialization to local minima in the posterior landscape. We validate our analysis with numerical experiments, demonstrating the practical implications of our insights for designing improved priors and optimization techniques.

[111] Infrared Object Detection with Ultra Small ConvNets: Is ImageNet Pretraining Still Useful?

Srikanth Muralidharan, Heitor R. Medeiros, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli

Main category: cs.CV

TL;DR: The paper examines the impact of ImageNet pre-training on ultra-small models (<1M parameters) for infrared object detection, finding diminishing robustness returns beyond a certain model size.

DetailsMotivation: To understand if pre-training benefits small models for embedded devices, especially in robustness for out-of-distribution tasks.

Method: Constructs ultra-small backbone families using scaling laws and evaluates them on three datasets for infrared object detection.

Result: ImageNet pre-training helps but offers diminishing robustness gains for very small models in out-of-distribution scenarios.

Conclusion: Practitioners should use pre-training and avoid overly small models for robustness in varied conditions.

Abstract: Many real-world applications require recognition models that are robust to different operational conditions and modalities, but at the same time run on small embedded devices, with limited hardware. While for normal size models, pre-training is known to be very beneficial in accuracy and robustness, for small models, that can be employed for embedded and edge devices, its effect is not clear. In this work, we investigate the effect of ImageNet pretraining on increasingly small backbone architectures (ultra-small models, with $<$1M parameters) with respect to robustness in downstream object detection tasks in the infrared visual modality. Using scaling laws derived from standard object recognition architectures, we construct two ultra-small backbone families and systematically study their performance. Our experiments on three different datasets reveal that while ImageNet pre-training is still useful, beyond a certain capacity threshold, it offers diminishing returns in terms of out-of-distribution detection robustness. Therefore, we advise practitioners to still use pre-training and, when possible avoid too small models as while they might work well for in-domain problems, they are brittle when working conditions are different.

[112] X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio

Chenxu Zhang, Zenan Li, Hongyi Xu, You Xie, Xiaochen Zhao, Tianpei Gu, Guoxian Song, Xin Chen, Chao Liang, Jianwen Jiang, Linjie Luo

Main category: cs.CV

TL;DR: X-Actor is an audio-driven framework for lifelike, emotionally expressive talking head videos, using a two-stage pipeline for long-form, high-fidelity animations.

DetailsMotivation: To overcome limitations of prior methods focusing on lip sync and short-range fidelity, enabling nuanced, long-form emotional performances.

Method: Two-stage decoupled pipeline: autoregressive diffusion model for facial motion prediction and diffusion-based video synthesis for high-fidelity animations.

Result: Produces cinematic-style performances, excelling in long-range, emotionally rich animations with state-of-the-art results.

Conclusion: X-Actor advances audio-driven portrait animation, enabling coherent, infinite-length emotional performances.

Abstract: We present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior methods that emphasize lip synchronization and short-range visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-form portrait performance capturing nuanced, dynamically evolving emotions that flow coherently with the rhythm and content of speech. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive diffusion model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length emotionally-rich motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting.

[113] Towards Robust Image Denoising with Scale Equivariance

Dawei Zhang, Xiaojie Guo

Main category: cs.CV

TL;DR: The paper proposes a scale-equivariant framework for robust blind image denoising, addressing generalization gaps in out-of-distribution noise conditions.

DetailsMotivation: Existing image denoising models struggle with generalization, especially under spatially variant noise (OOD conditions). Scale equivariance is explored as a solution.

Method: The framework includes a Heterogeneous Normalization Module (HNM) for feature stabilization and an Interactive Gating Module (IGM) for information modulation.

Result: The model outperforms state-of-the-art methods on synthetic and real-world benchmarks, particularly for spatially heterogeneous noise.

Conclusion: Scale-equivariant structures enhance OOD robustness in denoising, with HNM and IGM proving effective for handling varying noise patterns.

Abstract: Despite notable advances in image denoising, existing models often struggle to generalize beyond in-distribution noise patterns, particularly when confronted with out-of-distribution (OOD) conditions characterized by spatially variant noise. This generalization gap remains a fundamental yet underexplored challenge. In this work, we investigate \emph{scale equivariance} as a core inductive bias for improving OOD robustness. We argue that incorporating scale-equivariant structures enables models to better adapt from training on spatially uniform noise to inference on spatially non-uniform degradations. Building on this insight, we propose a robust blind denoising framework equipped with two key components: a Heterogeneous Normalization Module (HNM) and an Interactive Gating Module (IGM). HNM stabilizes feature distributions and dynamically corrects features under varying noise intensities, while IGM facilitates effective information modulation via gated interactions between signal and feature paths. Extensive evaluations demonstrate that our model consistently outperforms state-of-the-art methods on both synthetic and real-world benchmarks, especially under spatially heterogeneous noise. Code will be made publicly available.

[114] Sparsity and Total Variation Constrained Multilayer Linear Unmixing for Hyperspectral Imagery

Gang Yang

Main category: cs.CV

TL;DR: A novel hyperspectral unmixing method (STVMLU) combines sparsity and total variation constraints for improved accuracy, using ADMM for optimization.

DetailsMotivation: Hyperspectral unmixing is crucial for preprocessing in imagery applications, but existing methods lack accuracy in capturing spatial similarity and sparsity.

Method: STVMLU integrates TV for spatial similarity and L1/2-norm for sparsity, optimized via ADMM to extract endmembers and abundances simultaneously.

Result: STVMLU outperforms other algorithms in experimental evaluations.

Conclusion: The proposed STVMLU method enhances hyperspectral unmixing accuracy by addressing spatial and sparsity constraints effectively.

Abstract: Hyperspectral unmixing aims at estimating material signatures (known as endmembers) and the corresponding proportions (referred to abundances), which is a critical preprocessing step in various hyperspectral imagery applications. This study develops a novel approach called sparsity and total variation (TV) constrained multilayer linear unmixing (STVMLU) for hyperspectral imagery. Specifically, based on a multilayer matrix factorization model, to improve the accuracy of unmixing, a TV constraint is incorporated to consider adjacent spatial similarity. Additionally, a L1/2-norm sparse constraint is adopted to effectively characterize the sparsity of the abundance matrix. For optimizing the STVMLU model, the method of alternating direction method of multipliers (ADMM) is employed, which allows for the simultaneous extraction of endmembers and their corresponding abundance matrix. Experimental results illustrate the enhanced performance of the proposed STVMLU when compared to other algorithms.

[115] Diffusion Models with Adaptive Negative Sampling Without External Resources

Alakh Desai, Nuno Vasconcelos

Main category: cs.CV

TL;DR: ANSWER is a training-free technique for diffusion models that leverages classifier-free guidance to improve prompt adherence without explicit negative prompts, outperforming baselines.

DetailsMotivation: Diffusion models vary in prompt adherence and quality. Negative prompting improves compliance, but explicit negative prompts are lossy and incomplete.

Method: Develops ANSWER, a sampling procedure using classifier-free guidance to account for positive and negative conditions from a single prompt.

Result: Outperforms baselines on benchmarks and is preferred by humans 2x more.

Conclusion: ANSWER enhances prompt faithfulness in diffusion models without external resources or explicit negative prompts.

Abstract: Diffusion models (DMs) have demonstrated an unparalleled ability to create diverse and high-fidelity images from text prompts. However, they are also well-known to vary substantially regarding both prompt adherence and quality. Negative prompting was introduced to improve prompt compliance by specifying what an image must not contain. Previous works have shown the existence of an ideal negative prompt that can maximize the odds of the positive prompt. In this work, we explore relations between negative prompting and classifier-free guidance (CFG) to develop a sampling procedure, {\it Adaptive Negative Sampling Without External Resources} (ANSWER), that accounts for both positive and negative conditions from a single prompt. This leverages the internal understanding of negation by the diffusion model to increase the odds of generating images faithful to the prompt. ANSWER is a training-free technique, applicable to any model that supports CFG, and allows for negative grounding of image concepts without an explicit negative prompts, which are lossy and incomplete. Experiments show that adding ANSWER to existing DMs outperforms the baselines on multiple benchmarks and is preferred by humans 2x more over the other methods.

[116] CloudBreaker: Breaking the Cloud Covers of Sentinel-2 Images using Multi-Stage Trained Conditional Flow Matching on Sentinel-1

Saleh Sakib Ahmed, Sara Nowreen, M. Sohel Rahman

Main category: cs.CV

TL;DR: CloudBreaker generates high-quality multi-spectral Sentinel-2 signals from Sentinel-1 data, overcoming cloud and nighttime limitations in satellite imagery.

DetailsMotivation: Cloud cover and nighttime conditions limit satellite-based remote sensing, making multi-spectral imagery unreliable. Sentinel-1 radar data, unaffected by these issues, offers a consistent alternative.

Method: A multi-stage training approach using conditional latent flow matching, integrating cosine scheduling with flow matching for the first time.

Result: Achieved FID score of 0.7432 for optical imagery, and SSIM scores of 0.6156 (NDWI) and 0.6874 (NDVI), indicating high fidelity and structural similarity.

Conclusion: CloudBreaker is a promising solution for remote sensing applications where multi-spectral data is typically unavailable or unreliable.

Abstract: Cloud cover and nighttime conditions remain significant limitations in satellite-based remote sensing, often restricting the availability and usability of multi-spectral imagery. In contrast, Sentinel-1 radar images are unaffected by cloud cover and can provide consistent data regardless of weather or lighting conditions. To address the challenges of limited satellite imagery, we propose CloudBreaker, a novel framework that generates high-quality multi-spectral Sentinel-2 signals from Sentinel-1 data. This includes the reconstruction of optical (RGB) images as well as critical vegetation and water indices such as NDVI and NDWI.We employed a novel multi-stage training approach based on conditional latent flow matching and, to the best of our knowledge, are the first to integrate cosine scheduling with flow matching. CloudBreaker demonstrates strong performance, achieving a Frechet Inception Distance (FID) score of 0.7432, indicating high fidelity and realism in the generated optical imagery. The model also achieved Structural Similarity Index Measure (SSIM) of 0.6156 for NDWI and 0.6874 for NDVI, indicating a high degree of structural similarity. This establishes CloudBreaker as a promising solution for a wide range of remote sensing applications where multi-spectral data is typically unavailable or unreliable

[117] Separating Shared and Domain-Specific LoRAs for Multi-Domain Learning

Yusaku Takama, Ning Ding, Tatsuya Yokota, Toru Tamaki

Main category: cs.CV

TL;DR: Proposes a method to separate shared and domain-specific LoRAs into distinct subspaces for better multi-domain learning.

DetailsMotivation: Existing architectures may not effectively capture domain-specific information due to unclear separation of shared and domain-specific LoRAs.

Method: Ensures shared and domain-specific LoRAs exist in different subspaces (column and left null subspaces of pre-trained weights).

Result: Applied to action recognition with UCF101, Kinetics400, and HMDB51 datasets, showing effectiveness in some cases.

Conclusion: The method improves domain-specific information capture, with analysis of LoRA weight dimensions providing insights.

Abstract: Existing architectures of multi-domain learning have two types of adapters: shared LoRA for all domains and domain-specific LoRA for each particular domain. However, it remains unclear whether this structure effectively captures domain-specific information. In this paper, we propose a method that ensures that shared and domain-specific LoRAs exist in different subspaces; specifically, the column and left null subspaces of the pre-trained weights. We apply the proposed method to action recognition with three datasets (UCF101, Kinetics400, and HMDB51) and demonstrate its effectiveness in some cases along with the analysis of the dimensions of LoRA weights.

[118] MoExDA: Domain Adaptation for Edge-based Action Recognition

Takuya Sugimoto, Ning Ding, Toru Tamaki

Main category: cs.CV

TL;DR: MoExDA addresses static bias in action recognition by combining RGB and edge frames, improving generalization with lower computational cost.

DetailsMotivation: Static bias in action recognition models reduces generalization performance, prompting the need for a solution.

Method: Proposes MoExDA, a lightweight domain adaptation method using RGB and edge frames to mitigate static bias.

Result: Effectively suppresses static bias with lower computational cost, enhancing action recognition robustness.

Conclusion: MoExDA offers a practical solution for static bias, outperforming previous approaches in efficiency and robustness.

Abstract: Modern action recognition models suffer from static bias, leading to reduced generalization performance. In this paper, we propose MoExDA, a lightweight domain adaptation between RGB and edge information using edge frames in addition to RGB frames to counter the static bias issue. Experiments demonstrate that the proposed method effectively suppresses static bias with a lower computational cost, allowing for more robust action recognition than previous approaches.

[119] Adversarial Attention Perturbations for Large Object Detection Transformers

Zachary Yahn, Selim Furkan Tekin, Fatih Ilhan, Sihao Hu, Tiansheng Huang, Yichang Xu, Margaret Loper, Ling Liu

Main category: cs.CV

TL;DR: AFOG is a novel adversarial attack method targeting object detection transformers and CNNs, using learnable attention to focus perturbations, outperforming existing methods by up to 83%.

DetailsMotivation: Existing adversarial attacks are limited to CNN-based detectors or ineffective against transformer-based ones, highlighting the need for a unified, architecture-agnostic approach.

Method: AFOG employs a learnable attention mechanism to target vulnerable regions, integrates feature losses, and iteratively injects perturbations for stealth and efficiency.

Result: AFOG improves attack performance by up to 30.6% over baselines and outperforms existing methods by up to 83% on twelve large detection transformers.

Conclusion: AFOG is an effective, stealthy, and efficient adversarial attack method, demonstrating superior performance across both transformer and CNN-based detectors.

Abstract: Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking CNN-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional CNN-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG’s attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and CNN-based object detectors by up to 83% with superior speed and imperceptibility. Code is available at https://github.com/zacharyyahn/AFOG.

[120] Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models

Fan Yang, Yihao Huang, Jiayi Zhu, Ling Shi, Geguang Pu, Jin Song Dong, Kailong Wang

Main category: cs.CV

TL;DR: A method called In-Generation Detection (IGD) uses predicted noise in diffusion models to detect NSFW content during generation, achieving 91.32% accuracy.

DetailsMotivation: Existing methods focus on pre- or post-generation detection, leaving the in-generation phase unexplored for NSFW content.

Method: IGD leverages predicted noise during the diffusion process as an internal signal to identify NSFW content.

Result: IGD achieves 91.32% detection accuracy across seven NSFW categories, outperforming seven baselines.

Conclusion: IGD is a simple yet effective approach for detecting NSFW content during the diffusion process.

Abstract: Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy of 91.32% over naive and adversarial NSFW prompts, outperforming seven baseline methods.

[121] Multi-Granularity Feature Calibration via VFM for Domain Generalized Semantic Segmentation

Xinhui Li, Xiaojie Guo

Main category: cs.CV

TL;DR: MGFC enhances DGSS by hierarchical feature calibration, outperforming state-of-the-art methods.

DetailsMotivation: Improving generalization in semantic segmentation across unseen domains by leveraging VFMs with multi-granularity adaptation.

Method: Proposes MGFC, a framework for coarse-to-fine feature alignment: global context, category-level discriminability, and spatial detail enhancement.

Result: MGFC outperforms existing DGSS approaches in benchmark datasets.

Conclusion: Multi-granularity adaptation is effective for domain-generalized semantic segmentation.

Abstract: Domain Generalized Semantic Segmentation (DGSS) aims to improve the generalization ability of models across unseen domains without access to target data during training. Recent advances in DGSS have increasingly exploited vision foundation models (VFMs) via parameter-efficient fine-tuning strategies. However, most existing approaches concentrate on global feature fine-tuning, while overlooking hierarchical adaptation across feature levels, which is crucial for precise dense prediction. In this paper, we propose Multi-Granularity Feature Calibration (MGFC), a novel framework that performs coarse-to-fine alignment of VFM features to enhance robustness under domain shifts. Specifically, MGFC first calibrates coarse-grained features to capture global contextual semantics and scene-level structure. Then, it refines medium-grained features by promoting category-level feature discriminability. Finally, fine-grained features are calibrated through high-frequency spatial detail enhancement. By performing hierarchical and granularity-aware calibration, MGFC effectively transfers the generalization strengths of VFMs to the domain-specific task of DGSS. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art DGSS approaches, highlighting the effectiveness of multi-granularity adaptation for the semantic segmentation task of domain generalization.

[122] Enhancing Long Video Question Answering with Scene-Localized Frame Grouping

Xuyi Yang, Wenhao Zhang, Hongbo Jin, Lin Liu, Hongbo Xu, Yongwei Nie, Fei Yu, Fei Ma

Main category: cs.CV

TL;DR: The paper introduces SceneQA and the LVSQA dataset to improve MLLMs’ long video understanding by focusing on scene-based detail perception. The SLFG method enhances performance without altering model architecture.

DetailsMotivation: Current MLLMs struggle with long video understanding due to resource constraints and inefficient frame processing. Existing methods don't align with real-world needs.

Method: Proposes SLFG, which combines frames into coherent scenes using scene localization and dynamic reassembly, improving MLLMs’ understanding without architectural changes.

Result: SLFG performs exceptionally in long video benchmarks, demonstrating enhanced scene perception and reasoning.

Conclusion: The SceneQA task and SLFG method address practical challenges in long video understanding, offering a plug-and-play solution with strong performance.

Abstract: Current Multimodal Large Language Models (MLLMs) often perform poorly in long video understanding, primarily due to resource limitations that prevent them from processing all video frames and their associated information. Efficiently extracting relevant information becomes a challenging task. Existing frameworks and evaluation tasks focus on identifying specific frames containing core objects from a large number of irrelevant frames, which does not align with the practical needs of real-world applications. To address this issue, we propose a new scenario under the video question-answering task, SceneQA, which emphasizes scene-based detail perception and reasoning abilities. And we develop the LVSQA dataset to support the SceneQA task, which is built upon carefully selected videos from LVBench and contains a new collection of question-answer pairs to promote a more fair evaluation of MLLMs’ scene perception abilities in long videos. Inspired by human cognition, we introduce a novel method called SLFG. The core idea of SLFG is to combine individual frames into semantically coherent scene frames. By leveraging scene localization methods and dynamic frame reassembly mechanisms, SLFG significantly enhances the understanding capabilities of existing MLLMs in long videos. SLFG requires no modification to the original model architecture and boasts excellent plug-and-play usability. Experimental results show that this method performs exceptionally well in several long video benchmark tests. Code and dataset will be released at http://www.slfg.pkuzwh.cn.

[123] Generalized Compressed Sensing for Image Reconstruction with Diffusion Probabilistic Models

Ling-Qi Zhang, Zahra Kadkhodaie, Eero P. Simoncelli, David H. Brainard

Main category: cs.CV

TL;DR: The paper introduces a method for optimizing linear measurements for high-dimensional signal reconstruction using neural network priors, outperforming PCA, ICA, and CS with lower reconstruction error and skewed measurement distributions.

DetailsMotivation: Existing methods like PCA, ICA, and CS rely on simplistic signal statistics, but natural signals (e.g., images) have richer structure. The goal is to leverage neural network priors for better measurement optimization.

Method: A neural network trained for denoising (diffusion model) is used to derive optimized linear measurements. The approach is tested on natural image datasets.

Result: The proposed method yields measurements differing from PCA, ICA, or CS, with lower mean squared error and skewed distributions. Perceptual loss (SSIM) optimization further alters measurements.

Conclusion: Incorporating natural signal statistics via neural network priors improves linear measurement design, highlighting the need for tailored approaches beyond traditional methods.

Abstract: We examine the problem of selecting a small set of linear measurements for reconstructing high-dimensional signals. Well-established methods for optimizing such measurements include principal component analysis (PCA), independent component analysis (ICA) and compressed sensing (CS) based on random projections, all of which rely on axis- or subspace-aligned statistical characterization of the signal source. However, many naturally occurring signals, including photographic images, contain richer statistical structure. To exploit such structure, we introduce a general method for obtaining an optimized set of linear measurements for efficient image reconstruction, where the signal statistics are expressed by the prior implicit in a neural network trained to perform denoising (known as a “diffusion model”). We demonstrate that the optimal measurements derived for two natural image datasets differ from those of PCA, ICA, or CS, and result in substantially lower mean squared reconstruction error. Interestingly, the marginal distributions of the measurement values are asymmetrical (skewed), substantially more so than those of previous methods. We also find that optimizing with respect to perceptual loss, as quantified by structural similarity (SSIM), leads to measurements different from those obtained when optimizing for MSE. Our results highlight the importance of incorporating the specific statistical regularities of natural signals when designing effective linear measurements.

[124] SA-3DGS: A Self-Adaptive Compression Method for 3D Gaussian Splatting

Liheng Zhang, Weihao Yu, Zubo Lu, Haozhi Gu, Jin Huang

Main category: cs.CV

TL;DR: SA-3DGS reduces storage costs in 3D Gaussian Splatting by identifying and pruning insignificant Gaussians, improving compression and rendering quality.

DetailsMotivation: Current methods for compressing Gaussian models in 3D scenes fail to accurately identify insignificant Gaussians, leading to poor pruning and rendering performance.

Method: SA-3DGS learns importance scores for Gaussians, uses importance-aware clustering for compression, and repairs the codebook to maintain quality.

Result: Achieves up to 66x compression while preserving or enhancing rendering quality, outperforming other pruning-based methods.

Conclusion: SA-3DGS offers an efficient solution for reducing storage demands in 3D Gaussian Splatting without compromising quality, with strong generalization.

Abstract: Recent advancements in 3D Gaussian Splatting have enhanced efficient and high-quality novel view synthesis. However, representing scenes requires a large number of Gaussian points, leading to high storage demands and limiting practical deployment. The latest methods facilitate the compression of Gaussian models but struggle to identify truly insignificant Gaussian points in the scene, leading to a decline in subsequent Gaussian pruning, compression quality, and rendering performance. To address this issue, we propose SA-3DGS, a method that significantly reduces storage costs while maintaining rendering quality. SA-3DGS learns an importance score to automatically identify the least significant Gaussians in scene reconstruction, thereby enabling effective pruning and redundancy reduction. Next, the importance-aware clustering module compresses Gaussians attributes more accurately into the codebook, improving the codebook’s expressive capability while reducing model size. Finally, the codebook repair module leverages contextual scene information to repair the codebook, thereby recovering the original Gaussian point attributes and mitigating the degradation in rendering quality caused by information loss. Experimental results on several benchmark datasets show that our method achieves up to 66x compression while maintaining or even improving rendering quality. The proposed Gaussian pruning approach is not only adaptable to but also improves other pruning-based methods (e.g., LightGaussian), showcasing excellent performance and strong generalization ability.

[125] Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

Yiming Wu, Zhenghao Chen, Huan Wang, Dong Xu

Main category: cs.CV

TL;DR: The paper introduces VDMini, a lightweight Video Diffusion Model (VDM) variant, using pruning and consistency loss to reduce computational cost and inference time while maintaining video quality.

DetailsMotivation: High computational cost and slow inference time hinder the deployment of VDMs, prompting the need for an efficient compression method.

Method: Prunes redundant blocks from shallower layers (focused on individual content) while preserving deeper layers (crucial for motion dynamics). Introduces Individual Content and Motion Dynamics (ICMD) Consistency Loss, combining Individual Content Distillation (ICD) Loss and Multi-frame Content Adversarial (MCA) Loss.

Result: Achieves significant speedups (2.5×, 1.4×, 1.25×) for I2V and T2V tasks while maintaining video quality on benchmarks like UCF101 and VBench.

Conclusion: VDMini effectively balances efficiency and performance, enabling faster inference without compromising video generation quality.

Abstract: The high computational cost and slow inference time are major obstacles to deploying Video Diffusion Models (VDMs). To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} (\textit{e.g.,} coherence of the entire video), while shallower layers are more focused on \textbf{individual content} (\textit{e.g.,} individual frames). Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Moreover, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM to VDMini. In particular, we first use the Individual Content Distillation (ICD) Loss to preserve the consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$, 1.4 $\times$, and 1.25 $\times$ speed up for the I2V method SF-V, the T2V method T2V-Turbo-v2, and the T2V method HunyuanVideo, while maintaining the quality of the generated videos on several benchmarks including UCF101, VBench-T2V, and VBench-I2V.

[126] MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention

Qi Xie, Yongjia Ma, Donglin Di, Xuehao Gao, Xun Yang

Main category: cs.CV

TL;DR: MoCA is a novel Video Diffusion Model using a Diffusion Transformer with Mixture of Cross-Attention, improving identity consistency in text-to-video generation.

DetailsMotivation: Existing T2V methods struggle with fine-grained facial dynamics and temporal identity coherence.

Method: MoCA integrates Mixture of Cross-Attention layers into DiT blocks, uses Hierarchical Temporal Pooling, Temporal-Aware Cross-Attention Experts, and a Latent Video Perceptual Loss.

Result: MoCA outperforms existing T2V methods by over 5% in Face similarity on the CelebIPVid dataset.

Conclusion: MoCA effectively enhances identity preservation in T2V generation, demonstrating superior performance.

Abstract: Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built on a Diffusion Transformer (DiT) backbone, incorporating a Mixture of Cross-Attention mechanism inspired by the Mixture-of-Experts paradigm. Our framework improves inter-frame identity consistency by embedding MoCA layers into each DiT block, where Hierarchical Temporal Pooling captures identity features over varying timescales, and Temporal-Aware Cross-Attention Experts dynamically model spatiotemporal relationships. We further incorporate a Latent Video Perceptual Loss to enhance identity coherence and fine-grained details across video frames. To train this model, we collect CelebIPVid, a dataset of 10,000 high-resolution videos from 1,000 diverse individuals, promoting cross-ethnicity generalization. Extensive experiments on CelebIPVid show that MoCA outperforms existing T2V methods by over 5% across Face similarity.

[127] IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features

Anand Kumar, Jiteng Mu, Nuno Vasconcelos

Main category: cs.CV

TL;DR: A training-free framework, IntroStyle, uses diffusion model features for style attribution, outperforming existing methods without needing custom datasets or retraining.

DetailsMotivation: Address concerns about intellectual property and demand for preventing specific artistic style generation by offering a resource-efficient solution.

Method: Uses features from a diffusion model (IntroStyle) for style attribution, avoiding external modules or retraining. Introduces ArtSplit dataset for evaluation.

Result: Outperforms state-of-the-art methods on WikiArt and DomainNet datasets, handling dynamic artistic styles robustly.

Conclusion: IntroStyle provides an efficient, superior alternative for style attribution without the need for extensive resources or training.

Abstract: Text-to-image (T2I) models have recently gained widespread adoption. This has spurred concerns about safeguarding intellectual property rights and an increasing demand for mechanisms that prevent the generation of specific artistic styles. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications. We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining. This is denoted as Introspective Style attribution (IntroStyle) and is shown to have superior performance to state-of-the-art models for style attribution. We also introduce a synthetic dataset of Artistic Style Split (ArtSplit) to isolate artistic style and evaluate fine-grained style attribution performance. Our experimental results on WikiArt and DomainNet datasets show that \ours is robust to the dynamic nature of artistic styles, outperforming existing methods by a wide margin.

[128] Multi-human Interactive Talking Dataset

Zeyu Zhu, Weijia Wu, Mike Zheng Shou

Main category: cs.CV

TL;DR: The paper introduces MIT, a dataset for multi-human talking video generation, and CovOG, a baseline model for this task.

DetailsMotivation: Existing studies focus on single-person talking videos, lacking realism for multi-human interactions.

Method: An automatic pipeline collects and annotates multi-person conversational videos. CovOG integrates a Multi-Human Pose Encoder and Interactive Audio Driver.

Result: The MIT dataset includes 12 hours of footage with fine-grained annotations. CovOG demonstrates feasibility but highlights challenges.

Conclusion: MIT serves as a benchmark for future research in multi-human talking video generation.

Abstract: Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.

[129] Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation

Hyebin Cho, Jaehyup Lee

Main category: cs.CV

TL;DR: FaceMat introduces a trimap-free, uncertainty-aware framework for occlusion-aware face matting, improving face filter robustness in occluded scenarios.

DetailsMotivation: Face filters degrade under occlusions (e.g., hands, hair). The paper addresses this by proposing face matting to separate occlusions from facial regions.

Method: FaceMat uses a two-stage training pipeline: a teacher model predicts alpha mattes and uncertainty with NLL loss, guiding a student model via adaptive knowledge distillation.

Result: FaceMat outperforms state-of-the-art methods, enhancing visual quality and robustness in real-world video scenarios.

Conclusion: The framework, supported by the CelebAMat dataset, advances occlusion-aware face matting without auxiliary inputs, suitable for real-time applications.

Abstract: Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git

[130] CHARM: Collaborative Harmonization across Arbitrary Modalities for Modality-agnostic Semantic Segmentation

Lekang Wen, Jing Xiao, Liang Liao, Jiajun Chen, Mi Wang

Main category: cs.CV

TL;DR: CHARM is a framework for modality-agnostic semantic segmentation, focusing on harmonizing modalities while preserving their strengths, outperforming baselines.

DetailsMotivation: Existing methods homogenize modalities, diluting their strengths. CHARM aims to harmonize them for better complementarity.

Method: Uses Mutual Perception Unit (MPU) for implicit alignment and a dual-path optimization strategy (CoL and InE).

Result: CHARM outperforms baselines, especially on fragile modalities.

Conclusion: Shifts focus from homogenization to harmonization, achieving true cross-modal complementarity.

Abstract: Modality-agnostic Semantic Segmentation (MaSS) aims to achieve robust scene understanding across arbitrary combinations of input modality. Existing methods typically rely on explicit feature alignment to achieve modal homogenization, which dilutes the distinctive strengths of each modality and destroys their inherent complementarity. To achieve cooperative harmonization rather than homogenization, we propose CHARM, a novel complementary learning framework designed to implicitly align content while preserving modality-specific advantages through two components: (1) Mutual Perception Unit (MPU), enabling implicit alignment through window-based cross-modal interaction, where modalities serve as both queries and contexts for each other to discover modality-interactive correspondences; (2) A dual-path optimization strategy that decouples training into Collaborative Learning Strategy (CoL) for complementary fusion learning and Individual Enhancement Strategy (InE) for protected modality-specific optimization. Experiments across multiple datasets and backbones indicate that CHARM consistently outperform the baselines, with significant increment on the fragile modalities. This work shifts the focus from model homogenization to harmonization, enabling cross-modal complementarity for true harmony in diversity.

[131] CORE-ReID: Comprehensive Optimization and Refinement through Ensemble fusion in Domain Adaptation for person re-identification

Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima, Katsuyoshi Hotta

Main category: cs.CV

TL;DR: The paper introduces CORE-ReID, a framework for Unsupervised Domain Adaptation in Person Re-identification, using CycleGAN for data generation and ensemble fusion for feature learning, achieving superior performance.

DetailsMotivation: To address challenges in Unsupervised Domain Adaptation for Person Re-identification by harmonizing image differences and improving feature learning with pseudo-labels.

Method: Uses CycleGAN for diverse data generation, teacher-student networks for multi-level clustering, and an Ensemble Fusion component for fine-grained feature integration.

Result: Demonstrates significant performance gains in Mean Average Precision and Top-k accuracy on three UDA benchmarks.

Conclusion: CORE-ReID is an advanced, effective solution for UDA in Person ReID, with clear fusion features and high accuracy.

Abstract: This study introduces a novel framework, “Comprehensive Optimization and Refinement through Ensemble Fusion in Domain Adaptation for Person Re-identification (CORE-ReID)”, to address an Unsupervised Domain Adaptation (UDA) for Person Re-identification (ReID). The framework utilizes CycleGAN to generate diverse data that harmonizes differences in image characteristics from different camera sources in the pre-training stage. In the fine-tuning stage, based on a pair of teacher-student networks, the framework integrates multi-view features for multi-level clustering to derive diverse pseudo labels. A learnable Ensemble Fusion component that focuses on fine-grained local information within global features is introduced to enhance learning comprehensiveness and avoid ambiguity associated with multiple pseudo-labels. Experimental results on three common UDAs in Person ReID demonstrate significant performance gains over state-of-the-art approaches. Additional enhancements, such as Efficient Channel Attention Block and Bidirectional Mean Feature Normalization mitigate deviation effects and adaptive fusion of global and local features using the ResNet-based model, further strengthening the framework. The proposed framework ensures clarity in fusion features, avoids ambiguity, and achieves high ac-curacy in terms of Mean Average Precision, Top-1, Top-5, and Top-10, positioning it as an advanced and effective solution for the UDA in Person ReID. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID.

[132] SSFMamba: Symmetry-driven Spatial-Frequency Feature Fusion for 3D Medical Image Segmentation

Bo Zhang, Yifan Zhang, Shuo Yan, Yu Bai, Zheng Zhang, Wu Liu, Xiuzhuang Zhou, Wendong Wang

Main category: cs.CV

TL;DR: SSFMamba, a Mamba-based network, integrates spatial and frequency domain features for 3D medical image segmentation, outperforming state-of-the-art methods.

DetailsMotivation: Current methods overlook unique frequency domain properties and fail to leverage complementary strengths of spatial and frequency domains.

Method: SSFMamba uses a dual-branch architecture with Mamba blocks for feature fusion, a 3D multi-directional scanning mechanism, and frequency domain global context extraction.

Result: Outperforms state-of-the-art methods on BraTS2020 and BraTS2023 datasets across various metrics.

Conclusion: SSFMamba effectively combines spatial and frequency domain features, enhancing global context and local detail preservation in 3D medical image segmentation.

Abstract: In light of the spatial domain’s limited capacity for modeling global context in 3D medical image segmentation, emerging approaches have begun to incorporate frequency domain representations. However, straightforward feature extraction strategies often overlook the unique properties of frequency domain information, such as conjugate symmetry. They also fail to account for the fundamental differences in data distribution between the spatial and frequency domains, which can ultimately dilute or obscure the complementary strengths that frequency-based representations offer. In this paper, we propose SSFMamba, a Mamba based Symmetry-driven Spatial-Frequency feature fusion network for 3D medical image segmentation. SSFMamba employs a complementary dual-branch architecture that extracts features from both the spatial and frequency domains, and leverages a Mamba block to fuse these heterogeneous features to preserve global context while reinforcing local details. In the frequency domain branch, we harness Mamba’s exceptional capability to extract global contextual information in conjunction with the synergistic effect of frequency domain features to further enhance global modeling. Moreover, we design a 3D multi-directional scanning mechanism to strengthen the fusion of local and global cues. Extensive experiments on the BraTS2020 and BraTS2023 datasets demonstrate that our approach consistently outperforms state-of-the-art methods across various evaluation metrics.

[133] RobustGS: Unified Boosting of Feedforward 3D Gaussian Splatting under Low-Quality Conditions

Anran Wu, Long Peng, Xin Di, Xueyuan Dai, Chen Wu, Yang Wang, Xueyang Fu, Yang Cao, Zheng-Jun Zha

Main category: cs.CV

TL;DR: RobustGS enhances feedforward 3D Gaussian Splatting (3DGS) by improving robustness under adverse imaging conditions like noise or low light, using a multi-view feature enhancement module and semantic-aware state-space model.

DetailsMotivation: Existing feedforward 3DGS methods assume clean input images, but real-world conditions (e.g., noise, rain) degrade reconstruction quality. RobustGS addresses this gap.

Method: Proposes RobustGS with a Generalized Degradation Learner for degradation-awareness and a semantic-aware state-space model for feature enhancement and cross-view aggregation.

Result: RobustGS consistently achieves state-of-the-art reconstruction quality under various degradations when integrated into existing methods.

Conclusion: RobustGS is a plug-and-play solution that significantly improves the robustness and quality of feedforward 3DGS in challenging real-world conditions.

Abstract: Feedforward 3D Gaussian Splatting (3DGS) overcomes the limitations of optimization-based 3DGS by enabling fast and high-quality reconstruction without the need for per-scene optimization. However, existing feedforward approaches typically assume that input multi-view images are clean and high-quality. In real-world scenarios, images are often captured under challenging conditions such as noise, low light, or rain, resulting in inaccurate geometry and degraded 3D reconstruction. To address these challenges, we propose a general and efficient multi-view feature enhancement module, RobustGS, which substantially improves the robustness of feedforward 3DGS methods under various adverse imaging conditions, enabling high-quality 3D reconstruction. The RobustGS module can be seamlessly integrated into existing pretrained pipelines in a plug-and-play manner to enhance reconstruction robustness. Specifically, we introduce a novel component, Generalized Degradation Learner, designed to extract generic representations and distributions of multiple degradations from multi-view inputs, thereby enhancing degradation-awareness and improving the overall quality of 3D reconstruction. In addition, we propose a novel semantic-aware state-space model. It first leverages the extracted degradation representations to enhance corrupted inputs in the feature space. Then, it employs a semantic-aware strategy to aggregate semantically similar information across different views, enabling the extraction of fine-grained cross-view correspondences and further improving the quality of 3D representations. Extensive experiments demonstrate that our approach, when integrated into existing methods in a plug-and-play manner, consistently achieves state-of-the-art reconstruction quality across various types of degradations.

[134] Exploring Fairness across Fine-Grained Attributes in Large Vision-Language Models

Zaiying Zhao, Toshihiko Yamasaki

Main category: cs.CV

TL;DR: The study explores fairness in Large Vision-Language Models (LVLMs) beyond traditional demographic attributes, revealing biases influenced by cultural, environmental, and behavioral factors.

DetailsMotivation: Address the gap in fairness evaluation of LVLMs, which currently focuses narrowly on demographic attributes like race and gender.

Method: Construct an open-set knowledge base of bias attributes using LLMs and assess LVLM fairness across finer-grained attributes.

Result: LVLMs show biased outputs across diverse attributes, with cultural, environmental, and behavioral factors impacting decisions more than demographics.

Conclusion: Fairness in LVLMs requires broader evaluation beyond demographics, emphasizing the influence of non-traditional attributes.

Abstract: The rapid expansion of applications using Large Vision-Language Models (LVLMs), such as GPT-4o, has raised significant concerns about their fairness. While existing studies primarily focus on demographic attributes such as race and gender, fairness across a broader range of attributes remains largely unexplored. In this study, we construct an open-set knowledge base of bias attributes leveraging Large Language Models (LLMs) and evaluate the fairness of LVLMs across finer-grained attributes. Our experimental results reveal that LVLMs exhibit biased outputs across a diverse set of attributes and further demonstrate that cultural, environmental, and behavioral factors have a more pronounced impact on LVLM decision-making than traditional demographic attributes.

[135] Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification

Bo Zhang, Xu Xinan, Shuo Yan, Yu Bai, Zheng Zhang, Wufan Wang, Wendong Wang

Main category: cs.CV

TL;DR: Proposed $C^2Aug$ enhances pseudo-bag diversity in MIL-based WSI classification by sampling instances from all same-class bags, coupled with contrastive learning to improve feature discrimination.

DetailsMotivation: Existing pseudo-bag augmentation methods lack diversity due to limited sampling, and increased critical instances reduce model performance on slides with small tumor areas.

Method: $C^2Aug$ samples instances from all same-class bags for diversity and uses bag-level and group-level contrastive learning to improve feature discrimination.

Result: $C^2Aug$ outperforms state-of-the-art methods across multiple metrics.

Conclusion: The proposed method effectively addresses diversity and performance issues in MIL-based WSI classification.

Abstract: Recent pseudo-bag augmentation methods for Multiple Instance Learning (MIL)-based Whole Slide Image (WSI) classification sample instances from a limited number of bags, resulting in constrained diversity. To address this issue, we propose Contrastive Cross-Bag Augmentation ($C^2Aug$) to sample instances from all bags with the same class to increase the diversity of pseudo-bags. However, introducing new instances into the pseudo-bag increases the number of critical instances (e.g., tumor instances). This increase results in a reduced occurrence of pseudo-bags containing few critical instances, thereby limiting model performance, particularly on test slides with small tumor areas. To address this, we introduce a bag-level and group-level contrastive learning framework to enhance the discrimination of features with distinct semantic meanings, thereby improving model performance. Experimental results demonstrate that $C^2Aug$ consistently outperforms state-of-the-art approaches across multiple evaluation metrics.

[136] Augmenting Continual Learning of Diseases with LLM-Generated Visual Concepts

Jiantao Tan, Peixian Ma, Kanghao Chen, Zhiming Dai, Ruixuan Wang

Main category: cs.CV

TL;DR: A novel framework for continual learning in medical image classification uses LLM-generated visual concepts and cross-modal attention to improve performance.

DetailsMotivation: Existing methods for continual learning in medical image classification lack rich semantic information from multimodal data, relying only on simplistic templates.

Method: The proposed framework dynamically constructs a visual concept pool using LLMs, filters redundancy, and integrates concepts via a cross-modal image-concept attention module with an attention loss.

Result: Experiments show state-of-the-art performance on medical and natural image datasets.

Conclusion: The method effectively leverages semantic guidance from visual concepts, demonstrating superiority in continual learning tasks.

Abstract: Continual learning is essential for medical image classification systems to adapt to dynamically evolving clinical environments. The integration of multimodal information can significantly enhance continual learning of image classes. However, while existing approaches do utilize textual modality information, they solely rely on simplistic templates with a class name, thereby neglecting richer semantic information. To address these limitations, we propose a novel framework that harnesses visual concepts generated by large language models (LLMs) as discriminative semantic guidance. Our method dynamically constructs a visual concept pool with a similarity-based filtering mechanism to prevent redundancy. Then, to integrate the concepts into the continual learning process, we employ a cross-modal image-concept attention module, coupled with an attention loss. Through attention, the module can leverage the semantic knowledge from relevant visual concepts and produce class-representative fused features for classification. Experiments on medical and natural image datasets show our method achieves state-of-the-art performance, demonstrating the effectiveness and superiority of our method. We will release the code publicly.

[137] AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni, Pooyan Fazli

Main category: cs.CV

TL;DR: AVATAR improves multimodal video reasoning by addressing data inefficiency and vanishing advantages with off-policy training and Temporal Advantage Shaping, outperforming baselines.

DetailsMotivation: Current methods like GRPO face data inefficiency, vanishing advantages, and uniform credit assignment, limiting performance in multimodal video reasoning.

Method: AVATAR uses off-policy training for better sample efficiency and Temporal Advantage Shaping (TAS) for focused credit assignment.

Result: AVATAR outperforms Qwen2.5-Omni by +5.4 on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, with 35% higher sample efficiency.

Conclusion: AVATAR effectively addresses key limitations in multimodal reasoning, achieving superior performance and efficiency.

Abstract: Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while demonstrating over 35% higher sample efficiency.

[138] Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning

Tianjiao Jiang, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi

Main category: cs.CV

TL;DR: The paper introduces Causal CLIP Adapter (CCA), a framework for few-shot learning that disentangles CLIP’s visual features using ICA, enhancing cross-modal alignment for better performance.

DetailsMotivation: Existing few-shot learning methods rely on entangled representations, requiring implicit unmixing, which hinders adaptation. CCA aims to explicitly disentangle features for improved efficiency and accuracy.

Method: CCA uses unsupervised ICA to disentangle CLIP’s visual features, reducing trainable parameters. It enhances cross-modal alignment via fine-tuning and cross-attention mechanisms.

Result: CCA outperforms state-of-the-art methods on 11 benchmark datasets, showing robustness to distributional shifts while maintaining computational efficiency.

Conclusion: CCA effectively disentangles and aligns multimodal representations, improving few-shot learning performance and robustness.

Abstract: Few-shot learning (FSL) often requires effective adaptation of models using limited labeled data. However, most existing FSL methods rely on entangled representations, requiring the model to implicitly recover the unmixing process to obtain disentangled representations using only limited supervision, which hinders effective adaptation. Recent theoretical studies show that multimodal contrastive learning methods, such as CLIP, can disentangle latent representations up to linear transformations. In light of this, we propose the Causal CLIP Adapter (CCA), a novel framework that explicitly disentangles visual features extracted from CLIP using unsupervised Independent Component Analysis (ICA). This removes the need to learn the unmixing process from the labeled data, thereby reducing the number of trainable parameters and mitigating overfitting. Taking a step further, while ICA can obtain visual disentangled representations, it may also disrupt CLIP’s intra- and inter-modal alignment. To counteract this, CCA further leverages CLIP’s inherent cross-modal alignment by enhancing it in two ways: unidirectionally, through fine-tuning a CLIP-based text classifier, and bidirectionally, via a cross-attention mechanism that enriches visual and textual representations through mutual interaction. Both unimodal and cross-modal classification outputs can be effectively combined linearly to improve classification accuracy. Extensive experiments on 11 benchmark datasets demonstrate that our method consistently outperforms state-of-the-art approaches in terms of few-shot performance and robustness to distributional shifts, while maintaining computational efficiency. Code will be available at https://github.com/tianjiao-j/CCA.

[139] H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction

Heng Jia, Linchao Zhu, Na Zhao

Main category: cs.CV

TL;DR: H3R is a hybrid framework combining volumetric latent fusion and attention-based feature aggregation for generalizable 3D reconstruction, outperforming existing methods in speed and accuracy.

DetailsMotivation: Addressing the trade-off between geometric precision (explicit methods) and robustness (implicit methods) in multi-view correspondence modeling.

Method: Integrates a latent volume for geometric consistency and a camera-aware Transformer for adaptive correspondence refinement, leveraging Plücker coordinates.

Result: Achieves state-of-the-art performance with significant PSNR improvements (0.59 dB, 1.06 dB, 0.22 dB on RealEstate10K, ACID, DTU datasets) and converges 2x faster.

Conclusion: H3R enhances generalization and efficiency, resolving semantic-spatial mismatches and supporting high-resolution, variable-number input views.

Abstract: Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Pl"ucker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2$\times$ faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and high-resolution input views while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at https://github.com/JiaHeng-DLUT/H3R.

[140] Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

Sai Ma, Zhuang Li, John A Taylor

Main category: cs.CV

TL;DR: A new dataset, Landsat30-AU, addresses gaps in satellite imagery datasets by focusing on long-term, multi-satellite, low-resolution archives. It includes image-caption pairs and VQA samples, improving VLM performance for satellite imagery.

DetailsMotivation: Existing datasets lack long-term, multi-satellite, low-resolution imagery, limiting affordable and bias-robust global monitoring. Landsat30-AU fills this gap.

Method: Landsat30-AU is built from 30-meter resolution imagery over 36 years, with two components: image-caption pairs and VQA samples. A bootstrapped pipeline ensures quality.

Result: Off-the-shelf VLMs perform poorly on satellite imagery, but fine-tuning Qwen2.5-VL-7B on Landsat30-AU significantly improves performance.

Conclusion: Landsat30-AU enables better VLMs for satellite imagery, demonstrating the need for specialized datasets and fine-tuning.

Abstract: Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing 196,262 image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from \textbf{0.74} to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

[141] COFFEE: A Shadow-Resilient Real-Time Pose Estimator for Unknown Tumbling Asteroids using Sparse Neural Networks

Arion Zimmermann, Soon-Jo Chung, Fred Hadaegh

Main category: cs.CV

TL;DR: COFFEE is a real-time pose estimation framework for asteroids, addressing biases from shadows and outperforming classical and deep learning methods in accuracy and speed.

DetailsMotivation: Accurate state estimation of unknown space bodies is crucial but challenged by shadows and computational constraints of existing methods.

Method: COFFEE uses prior sun phase angle data, detects shadow-invariant features, and employs a Sparse Neural Network with an attention-based Graph Neural Network for feature matching.

Result: The pipeline is bias-free, more accurate than classical methods, and significantly faster than deep learning alternatives.

Conclusion: COFFEE provides a robust, efficient solution for real-time pose estimation of asteroids, mitigating shadow-induced biases and computational inefficiencies.

Abstract: The accurate state estimation of unknown bodies in space is a critical challenge with applications ranging from the tracking of space debris to the shape estimation of small bodies. A necessary enabler to this capability is to find and track features on a continuous stream of images. Existing methods, such as SIFT, ORB and AKAZE, achieve real-time but inaccurate pose estimates, whereas modern deep learning methods yield higher quality features at the cost of more demanding computational resources which might not be available on space-qualified hardware. Additionally, both classical and data-driven methods are not robust to the highly opaque self-cast shadows on the object of interest. We show that, as the target body rotates, these shadows may lead to large biases in the resulting pose estimates. For these objects, a bias in the real-time pose estimation algorithm may mislead the spacecraft’s state estimator and cause a mission failure, especially if the body undergoes a chaotic tumbling motion. We present COFFEE, the Celestial Occlusion Fast FEature Extractor, a real-time pose estimation framework for asteroids designed to leverage prior information on the sun phase angle given by sun-tracking sensors commonly available onboard spacecraft. By associating salient contours to their projected shadows, a sparse set of features are detected, invariant to the motion of the shadows. A Sparse Neural Network followed by an attention-based Graph Neural Network feature matching model are then jointly trained to provide a set of correspondences between successive frames. The resulting pose estimation pipeline is found to be bias-free, more accurate than classical pose estimation pipelines and an order of magnitude faster than other state-of-the-art deep learning pipelines on synthetic data as well as on renderings of the tumbling asteroid Apophis.

[142] Uint: Building Uint Detection Dataset

Haozhou Zhai, Yanzhe Gao, Tianjiang Hu

Main category: cs.CV

TL;DR: A synthetic fire scene dataset for building units, created using drone-captured images and enhancement techniques, improves generalization in fire detection tasks.

DetailsMotivation: Addressing the shortage of annotated fire-related data for building units to enhance fire early warning and rescue operations.

Method: Combines real multi-story scenes, motion blur, brightness adjustment, and large models to generate fire effects, simulating drone conditions.

Result: A dataset of 1,978 images covering diverse building scenarios, reducing risks and costs of real data collection.

Conclusion: The dataset enhances fire unit detection and is scalable, available at https://github.com/boilermakerr/FireUnitData.

Abstract: Fire scene datasets are crucial for training robust computer vision models, particularly in tasks such as fire early warning and emergency rescue operations. However, among the currently available fire-related data, there is a significant shortage of annotated data specifically targeting building units.To tackle this issue, we introduce an annotated dataset of building units captured by drones, which incorporates multiple enhancement techniques. We construct backgrounds using real multi-story scenes, combine motion blur and brightness adjustment to enhance the authenticity of the captured images, simulate drone shooting conditions under various circumstances, and employ large models to generate fire effects at different locations.The synthetic dataset generated by this method encompasses a wide range of building scenarios, with a total of 1,978 images. This dataset can effectively improve the generalization ability of fire unit detection, providing multi-scenario and scalable data while reducing the risks and costs associated with collecting real fire data. The dataset is available at https://github.com/boilermakerr/FireUnitData.

[143] UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

Chengyu Bai, Jintao Chen, Xiang Bai, Yilong Chen, Qi She, Ming Lu, Shanghang Zhang

Main category: cs.CV

TL;DR: The paper introduces UniEdit-I, a training-free framework for enabling image editing in unified vision-language models (VLMs) through iterative understanding, editing, and verifying steps.

DetailsMotivation: Existing unified VLMs lack easy image editing capabilities despite their strong understanding and generation abilities. The paper aims to bridge this gap.

Method: UniEdit-I uses three iterative steps: understanding (semantic analysis and prompt modification), editing (time-adaptive offset for coherent edits), and verifying (alignment checks and feedback).

Result: The method achieves SOTA performance on the GEdit-Bench benchmark using BLIP3-o.

Conclusion: UniEdit-I successfully enables high-fidelity image editing in unified VLMs without additional training, showcasing its potential for future VLM development.

Abstract: In recent years, unified vision-language models (VLMs) have rapidly advanced, effectively tackling both visual understanding and generation tasks within a single design. While many unified VLMs have explored various design choices, the recent hypothesis from OpenAI’s GPT-4o suggests a promising generation pipeline: Understanding VLM->Visual Feature->Projector->Diffusion Model->Image. The understanding VLM is frozen, and only the generation-related modules are trained. This pipeline maintains the strong capability of understanding VLM while enabling the image generation ability of the unified VLM. Although this pipeline has shown very promising potential for the future development of unified VLM, how to easily enable image editing capability is still unexplored. In this paper, we introduce a novel training-free framework named UniEdit-I to enable the unified VLM with image editing capability via three iterative steps: understanding, editing, and verifying. 1. The understanding step analyzes the source image to create a source prompt through structured semantic analysis and makes minimal word replacements to form the target prompt based on the editing instruction. 2. The editing step introduces a time-adaptive offset, allowing for coherent editing from coarse to fine throughout the denoising process. 3. The verification step checks the alignment between the target prompt and the intermediate edited image, provides automatic consistency scores and corrective feedback, and determines whether to stop early or continue the editing loop. This understanding, editing, and verifying loop iterates until convergence, delivering high-fidelity editing in a training-free manner. We implemented our method based on the latest BLIP3-o and achieved state-of-the-art (SOTA) performance on the GEdit-Bench benchmark.

[144] SARD: Segmentation-Aware Anomaly Synthesis via Region-Constrained Diffusion with Discriminative Mask Guidance

Yanshu Wang, Xichen Xu, Xiaoning Lei, Guoyang Xie

Main category: cs.CV

TL;DR: SARD is a diffusion-based framework for anomaly synthesis, improving spatial controllability and regional fidelity by freezing backgrounds and selectively updating anomaly regions, outperforming existing methods.

DetailsMotivation: Enhancing robustness of industrial anomaly detection systems by synthesizing realistic and spatially precise anomalies, addressing limitations of current diffusion-based methods.

Method: Proposes SARD with Region-Constrained Diffusion (RCD) to preserve backgrounds and update anomaly regions, and Discriminative Mask Guidance (DMG) for joint evaluation of global and local fidelity.

Result: Outperforms existing methods on MVTec-AD and BTAD datasets in segmentation accuracy and visual quality.

Conclusion: SARD sets a new state-of-the-art for pixel-level anomaly synthesis, improving spatial controllability and fidelity.

Abstract: Synthesizing realistic and spatially precise anomalies is essential for enhancing the robustness of industrial anomaly detection systems. While recent diffusion-based methods have demonstrated strong capabilities in modeling complex defect patterns, they often struggle with spatial controllability and fail to maintain fine-grained regional fidelity. To overcome these limitations, we propose SARD (Segmentation-Aware anomaly synthesis via Region-constrained Diffusion with discriminative mask Guidance), a novel diffusion-based framework specifically designed for anomaly generation. Our approach introduces a Region-Constrained Diffusion (RCD) process that preserves the background by freezing it and selectively updating only the foreground anomaly regions during the reverse denoising phase, thereby effectively reducing background artifacts. Additionally, we incorporate a Discriminative Mask Guidance (DMG) module into the discriminator, enabling joint evaluation of both global realism and local anomaly fidelity, guided by pixel-level masks. Extensive experiments on the MVTec-AD and BTAD datasets show that SARD surpasses existing methods in segmentation accuracy and visual quality, setting a new state-of-the-art for pixel-level anomaly synthesis.

[145] LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing

Liangyang Ouyang, Jiafeng Mao

Main category: cs.CV

TL;DR: LORE is a training-free image editing method that optimizes inverted noise to address semantic bias in text-driven editing, outperforming baselines in quality and alignment.

DetailsMotivation: The structural limitation in inversion-based editing methods causes semantic bias toward the source concept, suppressing attention to the target concept, especially for dissimilar semantics.

Method: LORE directly optimizes the inverted noise without architectural changes or fine-tuning, improving generalization and controllability.

Result: LORE outperforms baselines on PIEBench, SmartEdit, and GapEdit in semantic alignment, image quality, and background fidelity.

Conclusion: Latent-space optimization via LORE enables stable, controllable, and general-purpose concept replacement in text-driven image editing.

Abstract: Text-driven image editing enables users to flexibly modify visual content through natural language instructions, and is widely applied to tasks such as semantic object replacement, insertion, and removal. While recent inversion-based editing methods using rectified flow models have achieved promising results in image quality, we identify a structural limitation in their editing behavior: the semantic bias toward the source concept encoded in the inverted noise tends to suppress attention to the target concept. This issue becomes particularly critical when the source and target semantics are dissimilar, where the attention mechanism inherently leads to editing failure or unintended modifications in non-target regions. In this paper, we systematically analyze and validate this structural flaw, and introduce LORE, a training-free and efficient image editing method. LORE directly optimizes the inverted noise, addressing the core limitations in generalization and controllability of existing approaches, enabling stable, controllable, and general-purpose concept replacement, without requiring architectural modification or model fine-tuning. We conduct comprehensive evaluations on three challenging benchmarks: PIEBench, SmartEdit, and GapEdit. Experimental results show that LORE significantly outperforms strong baselines in terms of semantic alignment, image quality, and background fidelity, demonstrating the effectiveness and scalability of latent-space optimization for general-purpose image editing.

[146] ChartCap: Mitigating Hallucination of Dense Chart Captioning

Junyoung Lim, Jaewoo Ahn, Gunhee Kim

Main category: cs.CV

TL;DR: ChartCap is a new dataset of 565K real-world chart images with dense captions, designed to improve caption accuracy and reduce hallucinations in vision language models.

DetailsMotivation: Existing datasets for chart captions suffer from extraneous information and lack structural detail, making accurate caption generation difficult.

Method: A four-stage pipeline generates captions from discernible chart data, with cycle consistency-based human verification for quality control. A new metric, Visual Consistency Score, evaluates caption quality.

Result: Models fine-tuned on ChartCap produce more accurate, informative captions with fewer hallucinations, outperforming other models and human annotations.

Conclusion: ChartCap addresses dataset limitations and improves caption quality, demonstrating its effectiveness through superior model performance.

Abstract: Generating accurate, informative, and hallucination-free captions for charts remains challenging for vision language models, primarily due to the lack of large-scale, high-quality datasets of real-world charts. However, existing real-world chart datasets suffer from the inclusion of extraneous information that cannot be inferred from the chart and failure to sufficiently capture structural elements and key insights. Therefore, we introduce ChartCap, a large-scale dataset of 565K real-world chart images paired with type-specific, dense captions that exclude extraneous information and highlight both structural elements and key insights in detail. To build ChartCap, we design a four-stage pipeline that generates captions using only the discernible data from the chart and employ a cycle consistency-based human verification, which accelerates quality control without sacrificing accuracy. Additionally, we propose a novel metric, the Visual Consistency Score, which evaluates caption quality by measuring the similarity between the chart regenerated from a caption and the original chart, independent of reference captions. Extensive experiments confirms that models fine-tuned on ChartCap consistently generate more accurate and informative captions with reduced hallucinations, surpassing both open-source and proprietary models and even human-annotated captions.

[147] SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision

Zhaoxu Li, Chenqi Kong, Yi Yu, Qiangqiang Wu, Xinghao Jiang, Ngai-Man Cheung, Bihan Wen, Alex Kot, Xudong Jiang

Main category: cs.CV

TL;DR: The paper addresses hallucination issues in Large Vision-Language Models (LVLMs) caused by stylized images, proposing a novel mechanism (SAVER) to mitigate them.

DetailsMotivation: Hallucination in LVLMs limits real-world applicability, especially with stylized images in critical scenarios like game scenes, art education, and medical analysis.

Method: Constructed a dataset of photographic and stylized images, benchmarked 13 LVLMs, and proposed SAVER, a mechanism adjusting outputs based on visual attention patterns.

Result: Stylized images induce more hallucinations than photographic ones; SAVER achieves state-of-the-art hallucination mitigation.

Conclusion: SAVER effectively reduces hallucinations in LVLMs for stylized images, enhancing their reliability in diverse applications.

Abstract: Large Vision-Language Models (LVLMs) recently achieve significant breakthroughs in understanding complex visual-textual contexts. However, hallucination issues still limit their real-world applicability. Although previous mitigation methods effectively reduce hallucinations in photographic images, they largely overlook the potential risks posed by stylized images, which play crucial roles in critical scenarios such as game scene understanding, art education, and medical analysis. In this work, we first construct a dataset comprising photographic images and their corresponding stylized versions with carefully annotated caption labels. We then conduct head-to-head comparisons on both discriminative and generative tasks by benchmarking 13 advanced LVLMs on the collected datasets. Our findings reveal that stylized images tend to induce significantly more hallucinations than their photographic counterparts. To address this issue, we propose Style-Aware Visual Early Revision SAVER, a novel mechanism that dynamically adjusts LVLMs’ final outputs based on the token-level visual attention patterns, leveraging early-layer feedback to mitigate hallucinations caused by stylized images. Extensive experiments demonstrate that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.

[148] Advancing Precision in Multi-Point Cloud Fusion Environments

Ulugbek Alibekov, Vanessa Staderini, Philipp Schneider, Doris Antensteiner

Main category: cs.CV

TL;DR: A study on visual industrial inspection using point clouds, introducing a synthetic dataset, distance metrics, and a CloudCompare plugin for improved accuracy and efficiency.

DetailsMotivation: To enhance automated industrial inspection systems by improving point cloud evaluation and matching methods.

Method: Evaluates point clouds and multi-point cloud matching, introduces a synthetic dataset, and develops a CloudCompare plugin for merging and visualizing defects.

Result: Improved accuracy and efficiency in automated inspection systems through the proposed methods and tools.

Conclusion: The research advances industrial inspection by providing robust tools and datasets for point cloud analysis.

Abstract: This research focuses on visual industrial inspection by evaluating point clouds and multi-point cloud matching methods. We also introduce a synthetic dataset for quantitative evaluation of registration method and various distance metrics for point cloud comparison. Additionally, we present a novel CloudCompare plugin for merging multiple point clouds and visualizing surface defects, enhancing the accuracy and efficiency of automated inspection systems.

[149] Duplex-GS: Proxy-Guided Weighted Blending for Real-Time Order-Independent Gaussian Splatting

Weihang Liu, Yuke Li, Yuxuan Li, Jingyi Yu, Xin Lou

Main category: cs.CV

TL;DR: Duplex-GS introduces a dual-hierarchy framework for 3D Gaussian Splatting, combining proxy Gaussian representations with order-independent rendering to improve efficiency and quality.

DetailsMotivation: Current 3DGS methods rely on costly sequential alpha-blending, causing overhead on resource-limited platforms.

Method: Uses proxy Gaussians, cell proxies for local management, and cell search rasterization. Integrates with OIT for weighted sum rendering to eliminate artifacts.

Result: Achieves 1.5-4x speedup over OIT-based methods, reduces radix sort overhead by 52.2%-86.9%, and maintains high-quality rendering.

Conclusion: Duplex-GS validates the OIT paradigm in 3DGS, offering efficiency and quality improvements without degradation.

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated remarkable rendering fidelity and efficiency. However, these methods still rely on computationally expensive sequential alpha-blending operations, resulting in significant overhead, particularly on resource-constrained platforms. In this paper, we propose Duplex-GS, a dual-hierarchy framework that integrates proxy Gaussian representations with order-independent rendering techniques to achieve photorealistic results while sustaining real-time performance. To mitigate the overhead caused by view-adaptive radix sort, we introduce cell proxies for local Gaussians management and propose cell search rasterization for further acceleration. By seamlessly combining our framework with Order-Independent Transparency (OIT), we develop a physically inspired weighted sum rendering technique that simultaneously eliminates “popping” and “transparency” artifacts, yielding substantial improvements in both accuracy and efficiency. Extensive experiments on a variety of real-world datasets demonstrate the robustness of our method across diverse scenarios, including multi-scale training views and large-scale environments. Our results validate the advantages of the OIT rendering paradigm in Gaussian Splatting, achieving high-quality rendering with an impressive 1.5 to 4 speedup over existing OIT based Gaussian Splatting approaches and 52.2% to 86.9% reduction of the radix sort overhead without quality degradation.

[150] Monocular Depth Estimation with Global-Aware Discretization and Local Context Modeling

Heng Wu, Qian Zhang, Guixu Zhang

Main category: cs.CV

TL;DR: A novel monocular depth estimation method using local and global cues with GLKAM and GBPM modules achieves competitive results on NYU-V2 and KITTI datasets.

DetailsMotivation: Addressing the ambiguity in monocular depth estimation due to the ill-posed nature of recovering 3D from 2D.

Method: Combines Gated Large Kernel Attention Module (GLKAM) for local multi-scale info and Global Bin Prediction Module (GBPM) for global depth distribution.

Result: Outperforms existing methods on NYU-V2 and KITTI datasets.

Conclusion: The proposed modules effectively improve depth estimation accuracy.

Abstract: Accurate monocular depth estimation remains a challenging problem due to the inherent ambiguity that stems from the ill-posed nature of recovering 3D structure from a single view, where multiple plausible depth configurations can produce identical 2D projections. In this paper, we present a novel depth estimation method that combines both local and global cues to improve prediction accuracy. Specifically, we propose the Gated Large Kernel Attention Module (GLKAM) to effectively capture multi-scale local structural information by leveraging large kernel convolutions with a gated mechanism. To further enhance the global perception of the network, we introduce the Global Bin Prediction Module (GBPM), which estimates the global distribution of depth bins and provides structural guidance for depth regression. Extensive experiments on the NYU-V2 and KITTI dataset demonstrate that our method achieves competitive performance and outperforms existing approaches, validating the effectiveness of each proposed component.

[151] Unifying Locality of KANs and Feature Drift Compensation for Data-free Continual Face Forgery Detection

Tianshuo Zhang, Siran Peng, Li Gao, Haoyuan Zhang, Xiangyu Zhu, Zhen Lei

Main category: cs.CV

TL;DR: The paper proposes a KAN-based framework (KAN-CFD) for continual face forgery detection, addressing catastrophic forgetting with a Domain-Group KAN Detector and a data-free replay strategy.

DetailsMotivation: Face forgery detectors degrade on old tasks when learning new ones (catastrophic forgetting). KANs' local plasticity suits continual learning but struggles with high-dimensional images and feature overlap.

Method: Introduces KAN-CFD with DG-KD (adapts KANs for images) and FS-KDCP (prevents input space overlap without prior task data).

Result: The method achieves superior performance and reduces forgetting.

Conclusion: KAN-CFD effectively addresses catastrophic forgetting in continual face forgery detection.

Abstract: The rapid advancements in face forgery techniques necessitate that detectors continuously adapt to new forgery methods, thus situating face forgery detection within a continual learning paradigm. However, when detectors learn new forgery types, their performance on previous types often degrades rapidly, a phenomenon known as catastrophic forgetting. Kolmogorov-Arnold Networks (KANs) utilize locally plastic splines as their activation functions, enabling them to learn new tasks by modifying only local regions of the functions while leaving other areas unaffected. Therefore, they are naturally suitable for addressing catastrophic forgetting. However, KANs have two significant limitations: 1) the splines are ineffective for modeling high-dimensional images, while alternative activation functions that are suitable for images lack the essential property of locality; 2) in continual learning, when features from different domains overlap, the mapping of different domains to distinct curve regions always collapses due to repeated modifications of the same regions. In this paper, we propose a KAN-based Continual Face Forgery Detection (KAN-CFD) framework, which includes a Domain-Group KAN Detector (DG-KD) and a data-free replay Feature Separation strategy via KAN Drift Compensation Projection (FS-KDCP). DG-KD enables KANs to fit high-dimensional image inputs while preserving locality and local plasticity. FS-KDCP avoids the overlap of the KAN input spaces without using data from prior tasks. Experimental results demonstrate that the proposed method achieves superior performance while notably reducing forgetting.

[152] Neovascularization Segmentation via a Multilateral Interaction-Enhanced Graph Convolutional Network

Tao Chen, Dan Zhang, Da Chen, Huazhu Fu, Kai Jin, Shanshan Wang, Laurent D. Cohen, Yitian Zhao, Quanyong Yi, Jiong Zhang

Main category: cs.CV

TL;DR: The paper introduces MTG-Net, a novel network for segmenting CNV regions and vessels in OCTA images, addressing challenges like irregular shapes and imaging artifacts. It also releases the first public CNV dataset (CNVSeg).

DetailsMotivation: Accurate CNV segmentation in OCTA images is crucial for wet AMD assessment, but existing methods face challenges due to irregular shapes, artifacts, and lack of public datasets.

Method: MTG-Net integrates region and vessel morphological information using a multi-task framework and graph-based modules (MIGR and MRGR) for cross-task reasoning, along with an uncertainty-weighted loss.

Result: MTG-Net achieves a Dice score of 87.21% for region segmentation and 88.12% for vessel segmentation, outperforming existing methods.

Conclusion: The proposed MTG-Net and CNVSeg dataset advance CNV segmentation, offering improved accuracy and addressing key challenges in wet AMD diagnosis.

Abstract: Choroidal neovascularization (CNV), a primary characteristic of wet age-related macular degeneration (wet AMD), represents a leading cause of blindness worldwide. In clinical practice, optical coherence tomography angiography (OCTA) is commonly used for studying CNV-related pathological changes, due to its micron-level resolution and non-invasive nature. Thus, accurate segmentation of CNV regions and vessels in OCTA images is crucial for clinical assessment of wet AMD. However, challenges existed due to irregular CNV shapes and imaging limitations like projection artifacts, noises and boundary blurring. Moreover, the lack of publicly available datasets constraints the CNV analysis. To address these challenges, this paper constructs the first publicly accessible CNV dataset (CNVSeg), and proposes a novel multilateral graph convolutional interaction-enhanced CNV segmentation network (MTG-Net). This network integrates both region and vessel morphological information, exploring semantic and geometric duality constraints within the graph domain. Specifically, MTG-Net consists of a multi-task framework and two graph-based cross-task modules: Multilateral Interaction Graph Reasoning (MIGR) and Multilateral Reinforcement Graph Reasoning (MRGR). The multi-task framework encodes rich geometric features of lesion shapes and surfaces, decoupling the image into three task-specific feature maps. MIGR and MRGR iteratively reason about higher-order relationships across tasks through a graph mechanism, enabling complementary optimization for task-specific objectives. Additionally, an uncertainty-weighted loss is proposed to mitigate the impact of artifacts and noise on segmentation accuracy. Experimental results demonstrate that MTG-Net outperforms existing methods, achieving a Dice socre of 87.21% for region segmentation and 88.12% for vessel segmentation.

[153] AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding

Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe

Main category: cs.CV

TL;DR: AlignCAT is a novel framework for weakly supervised visual grounding (VG) that improves cross-modal reasoning by combining coarse-grained and fine-grained alignment modules to address category and attribute ambiguities.

DetailsMotivation: Existing weakly supervised VG methods struggle with subtle semantic differences in text expressions due to category and attribute ambiguities.

Method: AlignCAT uses a coarse-grained alignment module for category consistency and a fine-grained alignment module for attribute consistency, leveraging linguistic cues to filter misaligned queries and enhance contrastive learning.

Result: AlignCAT outperforms existing methods on RefCOCO, RefCOCO+, and RefCOCOg benchmarks for weakly supervised VG tasks.

Conclusion: AlignCAT effectively addresses ambiguities in weakly supervised VG by improving visual-linguistic alignment and demonstrates superior performance on standard benchmarks.

Abstract: Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.

[154] Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration

Ting Lei, Shaofeng Yin, Qingchao Chen, Yuxin Peng, Yang Liu

Main category: cs.CV

TL;DR: INP-CC improves open-vocabulary HOI detection by integrating interaction-aware prompts and concept calibration, outperforming state-of-the-art models.

DetailsMotivation: Current HOI detection methods struggle with fine-grained region-level interaction detection and encoding textual descriptions of visual appearances.

Method: Proposes INP-CC, featuring an interaction-aware prompt generator and language model-guided concept calibration with negative sampling.

Result: Significantly outperforms state-of-the-art models on SWIG-HOI and HICO-DET datasets.

Conclusion: INP-CC enhances HOI detection by focusing on key interaction patterns and refining concept representations.

Abstract: Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set. Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model’s ability to capture detailed HOI relationships. To address these issues, we propose INteraction-aware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model’s attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection. Furthermore, we refine HOI concept representations through language model-guided calibration, which helps distinguish diverse HOI concepts by investigating visual similarities across categories. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions. Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and HICO-DET datasets. Code is available at https://github.com/ltttpku/INP-CC.

[155] GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations

Xinwei Liu, Xiaojun Jia, Yuan Xun, Simeng Qin, Xiaochun Cao

Main category: cs.CV

TL;DR: GeoShield is a novel adversarial framework designed to protect geoprivacy by disrupting VLMs’ ability to infer locations from images, outperforming prior methods.

DetailsMotivation: The rise of VLMs like GPT-4o poses significant risks to geoprivacy by inferring locations from shared images, necessitating robust defense mechanisms.

Method: GeoShield uses feature disentanglement, exposure element identification, and scale-adaptive enhancement to optimize adversarial perturbations for privacy protection.

Result: GeoShield outperforms existing methods in black-box settings, providing strong privacy protection with minimal impact on image quality.

Conclusion: GeoShield is the first practical solution to defend against geolocation inference by VLMs, addressing critical privacy concerns.

Abstract: Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users’ locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.

[156] The Power of Many: Synergistic Unification of Diverse Augmentations for Efficient Adversarial Robustness

Wang Yu-Hang, Shiwei Li, Jianxiang Liao, Li Bohan, Jian Liu, Wenfei Yin

Main category: cs.CV

TL;DR: The paper proposes the Universal Adversarial Augmenter (UAA), a plug-and-play framework for efficient adversarial defense by leveraging diverse augmentation strategies without online adversarial example generation.

DetailsMotivation: Adversarial Training (AT) is costly and degrades standard performance, while existing data augmentation methods offer limited robustness or high overhead. A need exists for an efficient, robust defense mechanism.

Method: UAA pre-computes universal transformations offline, decoupling perturbation generation from training, and efficiently generates unique adversarial perturbations during training.

Result: UAA achieves state-of-the-art robustness on multiple benchmarks without online adversarial example generation, proving its efficiency and effectiveness.

Conclusion: UAA provides a practical, efficient solution for robust model training, advancing data-augmentation-based adversarial defenses.

Abstract: Adversarial perturbations pose a significant threat to deep learning models. Adversarial Training (AT), the predominant defense method, faces challenges of high computational costs and a degradation in standard performance. While data augmentation offers an alternative path, existing techniques either yield limited robustness gains or incur substantial training overhead. Therefore, developing a defense mechanism that is both highly efficient and strongly robust is of paramount importance.In this work, we first conduct a systematic analysis of existing augmentation techniques, revealing that the synergy among diverse strategies – rather than any single method – is crucial for enhancing robustness. Based on this insight, we propose the Universal Adversarial Augmenter (UAA) framework, which is characterized by its plug-and-play nature and training efficiency. UAA decouples the expensive perturbation generation process from model training by pre-computing a universal transformation offline, which is then used to efficiently generate unique adversarial perturbations for each sample during training.Extensive experiments conducted on multiple benchmarks validate the effectiveness of UAA. The results demonstrate that UAA establishes a new state-of-the-art (SOTA) for data-augmentation-based adversarial defense strategies , without requiring the online generation of adversarial examples during training. This framework provides a practical and efficient pathway for building robust models,Our code is available in the supplementary materials.

[157] ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow

Shanshan Guo, Xiwen Liang, Junfan Lin, Yuzheng Zhuang, Liang Lin, Xiaodan Liang

Main category: cs.CV

TL;DR: ActionSink, a novel robot manipulation framework, improves low-level action precision by reformulating actions as self-supervised ‘action flows’ from videos, outperforming prior methods by 7.9% on LIBERO.

DetailsMotivation: Low precision in low-level action estimation limits robot manipulation performance, despite progress in high-level perception and planning.

Method: ActionSink uses action-caused optical flows (‘action flows’) in a self-supervised manner, with a coarse-to-fine matcher and dynamic integrator for precise estimation.

Result: Achieved 7.9% higher success rate on LIBERO and 8% gain on LIBERO-Long.

Conclusion: ActionSink effectively enhances action estimation, advancing learning-based robot manipulation.

Abstract: Language-instructed robot manipulation has garnered significant interest due to the potential of learning from collected data. While the challenges in high-level perception and planning are continually addressed along the progress of general large pre-trained models, the low precision of low-level action estimation has emerged as the key limiting factor in manipulation performance. To this end, this paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations in the field of learning-based robot manipulation. As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called “action flow”, in a self-supervised manner, which are then used to be retrieved and integrated to enhance the action estimation. Specifically, ActionSink incorporates two primary modules. The first module is a coarse-to-fine action flow matcher, which continuously refines the accuracy of action flow via iterative retrieval and denoising process. The second module is a dynamic action flow integrator, which employs a working memory pool that dynamically and efficiently manages the historical action flows that should be used to integrate to enhance the current action estimation. In this module, a multi-layer fusion module is proposed to integrate direct estimation and action flows from both the current and the working memory, achieving highly accurate action estimation through a series of estimation-integration processes. Our ActionSink framework outperformed prior SOTA on the LIBERO benchmark by a 7.9% success rate, and obtained nearly an 8% accuracy gain on the challenging long-horizon visual task LIBERO-Long.

[158] FastInit: Fast Noise Initialization for Temporally Consistent Video Generation

Chengyu Bai, Yuming Li, Zhongyu Zhao, Jintao Chen, Peidong Jia, Qi She, Ming Lu, Shanghang Zhang

Main category: cs.CV

TL;DR: FastInit introduces a fast noise initialization method (VNPNet) to replace iterative refinement, improving video generation efficiency and temporal consistency.

DetailsMotivation: Addressing the computational cost and temporal inconsistency in video generation caused by iterative refinement methods like FreeInit.

Method: FastInit uses a Video Noise Prediction Network (VNPNet) to generate refined noise in one forward pass, trained on a dataset of text prompts, random noise, and refined noise pairs.

Result: FastInit enhances video quality and temporal consistency across frames while reducing computational costs.

Conclusion: FastInit offers a practical, efficient solution for video generation, with plans to release code and dataset.

Abstract: Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released.

[159] VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation

Yufei Xue, Yushi Huang, Jiawei Shao, Jun Zhang

Main category: cs.CV

TL;DR: VLMQ is a novel importance-aware PTQ framework for VLMs, addressing modality discrepancy and achieving SOTA performance under low-bit quantization.

DetailsMotivation: Existing Hessian-based PTQ methods for LLMs treat all tokens equally, leading to severe performance drops in VLMs due to modality discrepancy (limited text tokens vs. redundant vision tokens).

Method: VLMQ optimizes an importance-aware objective for enhanced Hessian with token-level importance factors, uses a lightweight block-wise backward pass for efficiency, and retains parallelized weight updates.

Result: VLMQ achieves a 16.45% improvement on MME-RealWorld under 2-bit quantization and demonstrates SOTA performance across 8 benchmarks for VLMs ranging from 0.5B to 32B.

Conclusion: VLMQ effectively addresses the challenges of PTQ for VLMs, offering significant performance gains, especially in low-bit settings.

Abstract: Post-training quantization (PTQ) has emerged as an effective approach for compressing large models and accelerating their inference without retraining. While PTQ has been extensively studied in the context of large language models (LLMs), its applicability to vision-language models (VLMs) remains underexplored. In this paper, we identify a modality discrepancy (\emph{i.e.}, limited text tokens \emph{vs.} excessive and redundant vision tokens) of VLMs. However, existing Hessian-based LLM PTQ methods treat all tokens equally during quantization, resulting in severe performance drops when applied to VLMs. Motivated by this observation, we propose a novel importance-aware PTQ framework tailored for VLMs, dubbed VLMQ. Specifically, to address vision token redundancy, VLMQ 1) optimizes an importance-aware objective that yields an enhanced Hessian with token-level importance factors, while retaining compatibility with parallelized weight updates, and 2) ensures efficiency and effectiveness by computing these factors via a single lightweight block-wise backward pass, guided by a theoretical connection to token-level perturbations. Extensive evaluations on 8 benchmarks across 0.5B$\sim$32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf{16.45%} improvement on MME-RealWorld under 2-bit quantization.

[160] Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing

Hongyu Shen, Junfeng Ni, Yixin Chen, Weishuo Li, Mingtao Pei, Siyuan Huang

Main category: cs.CV

TL;DR: Gaussian Instance Tracing (GIT) improves 3D segmentation in Gaussian Splatting by refining 2D masks and using adaptive density control for sharper boundaries.

DetailsMotivation: Existing methods for 2D-to-3D segmentation in Gaussian Splatting produce noisy results due to inconsistent 2D masks and lack of semantic refinement.

Method: GIT introduces an instance weight matrix to correct 2D inconsistencies and an adaptive density control mechanism to split/prune ambiguous Gaussians.

Result: The method achieves cleaner 3D assets and better segmentation in both online and offline settings.

Conclusion: GIT enables applications like hierarchical segmentation, object extraction, and scene editing by improving 3D segmentation coherence.

Abstract: We address the challenge of lifting 2D visual segmentation to 3D in Gaussian Splatting. Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. Leveraging the inherent consistency of Gaussians in 3D, we use this matrix to identify and correct 2D segmentation inconsistencies. Furthermore, since each Gaussian ideally corresponds to a single object, we propose a GIT-guided adaptive density control mechanism to split and prune ambiguous Gaussians during training, resulting in sharper and more coherent 2D and 3D segmentation boundaries. Experimental results show that our method extracts clean 3D assets and consistently improves 3D segmentation in both online (e.g., self-prompting) and offline (e.g., contrastive lifting) settings, enabling applications such as hierarchical segmentation, object extraction, and scene editing.

[161] Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models

Hyungjin Kim, Seokho Ahn, Young-Duk Seo

Main category: cs.CV

TL;DR: DrUM is a novel method for personalized T2I diffusion models, using condition-level modeling and a transformer-based adapter for better accuracy without fine-tuning.

DetailsMotivation: Existing methods rely on prompt-level modeling, leading to inaccurate personalization due to limited input token capacity.

Method: Integrates user profiling with a transformer-based adapter for condition-level modeling in latent space.

Result: Strong performance on large-scale datasets, compatible with open-source text encoders and foundation T2I models.

Conclusion: DrUM effectively addresses limitations of prompt-level modeling, enabling accurate personalized generation.

Abstract: Personalized generation in T2I diffusion models aims to naturally incorporate individual user preferences into the generation process with minimal user intervention. However, existing studies primarily rely on prompt-level modeling with large-scale models, often leading to inaccurate personalization due to the limited input token capacity of T2I diffusion models. To address these limitations, we propose DrUM, a novel method that integrates user profiling with a transformer-based adapter to enable personalized generation through condition-level modeling in the latent space. DrUM demonstrates strong performance on large-scale datasets and seamlessly integrates with open-source text encoders, making it compatible with widely used foundation T2I models without requiring additional fine-tuning.

[162] Zero-shot Shape Classification of Nanoparticles in SEM Images using Vision Foundation Models

Freida Barnatan, Emunah Goldstein, Einav Kalimian, Orchen Madar, Avi Huri, David Zitoun, Ya’akov Mandelbaum, Moshe Amitay

Main category: cs.CV

TL;DR: A zero-shot classification pipeline using SAM and DINOv2 achieves high-precision nanoparticle shape classification without extensive training or labeled data, outperforming traditional methods.

DetailsMotivation: To address the limitations of conventional deep learning methods, which require large labeled datasets and intensive training, for nanoparticle morphology analysis in SEM images.

Method: Combines Segment Anything Model (SAM) for segmentation and DINOv2 for feature embedding with a lightweight classifier, enabling zero-shot classification.

Result: Outperforms fine-tuned YOLOv11 and ChatGPT o4-mini-high baselines, showing robustness to small datasets and domain shifts.

Conclusion: Foundation models like SAM and DINOv2 offer an efficient, accessible alternative to traditional deep learning for nanoparticle image analysis.

Abstract: Accurate and efficient characterization of nanoparticle morphology in Scanning Electron Microscopy (SEM) images is critical for ensuring product quality in nanomaterial synthesis and accelerating development. However, conventional deep learning methods for shape classification require extensive labeled datasets and computationally demanding training, limiting their accessibility to the typical nanoparticle practitioner in research and industrial settings. In this study, we introduce a zero-shot classification pipeline that leverages two vision foundation models: the Segment Anything Model (SAM) for object segmentation and DINOv2 for feature embedding. By combining these models with a lightweight classifier, we achieve high-precision shape classification across three morphologically diverse nanoparticle datasets

  • without the need for extensive parameter fine-tuning. Our methodology outperforms a fine-tuned YOLOv11 and ChatGPT o4-mini-high baselines, demonstrating robustness to small datasets, subtle morphological variations, and domain shifts from natural to scientific imaging. Quantitative clustering metrics on PCA plots of the DINOv2 features are discussed as a means of assessing the progress of the chemical synthesis. This work highlights the potential of foundation models to advance automated microscopy image analysis, offering an alternative to traditional deep learning pipelines in nanoparticle research which is both more efficient and more accessible to the user.

[163] FFHQ-Makeup: Paired Synthetic Makeup Dataset with Facial Consistency Across Multiple Styles

Xingchao Yang, Shiori Ueda, Yuantian Huang, Tomoya Akiyama, Takafumi Taketomi

Main category: cs.CV

TL;DR: The paper introduces FFHQ-Makeup, a high-quality synthetic makeup dataset addressing the lack of paired bare-makeup images for beauty-related tasks.

DetailsMotivation: Existing paired makeup datasets are hard to collect or lack realism and consistency, limiting research in beauty-related applications.

Method: The authors propose an improved makeup transfer method to create 90K paired images (18K identities × 5 styles) from the FFHQ dataset, preserving identity and expression.

Result: FFHQ-Makeup offers a large-scale, high-quality dataset with consistent bare-makeup pairs, filling a gap in the field.

Conclusion: FFHQ-Makeup is a valuable resource for future research in beauty-related tasks, addressing the lack of realistic paired datasets.

Abstract: Paired bare-makeup facial images are essential for a wide range of beauty-related tasks, such as virtual try-on, facial privacy protection, and facial aesthetics analysis. However, collecting high-quality paired makeup datasets remains a significant challenge. Real-world data acquisition is constrained by the difficulty of collecting large-scale paired images, while existing synthetic approaches often suffer from limited realism or inconsistencies between bare and makeup images. Current synthetic methods typically fall into two categories: warping-based transformations, which often distort facial geometry and compromise the precision of makeup; and text-to-image generation, which tends to alter facial identity and expression, undermining consistency. In this work, we present FFHQ-Makeup, a high-quality synthetic makeup dataset that pairs each identity with multiple makeup styles while preserving facial consistency in both identity and expression. Built upon the diverse FFHQ dataset, our pipeline transfers real-world makeup styles from existing datasets onto 18K identities by introducing an improved makeup transfer method that disentangles identity and makeup. Each identity is paired with 5 different makeup styles, resulting in a total of 90K high-quality bare-makeup image pairs. To the best of our knowledge, this is the first work that focuses specifically on constructing a makeup dataset. We hope that FFHQ-Makeup fills the gap of lacking high-quality bare-makeup paired datasets and serves as a valuable resource for future research in beauty-related tasks.

[164] MVTOP: Multi-View Transformer-based Object Pose-Estimation

Lukas Ranftl, Felix Brendel, Bertram Drost, Carsten Steger

Main category: cs.CV

TL;DR: MVTOP is a transformer-based method for multi-view rigid object pose estimation, resolving ambiguities by fusing view-specific features early and modeling multi-view geometry via lines of sight. It outperforms single-view and existing multi-view methods.

DetailsMotivation: To address pose ambiguities in multi-view scenarios that single-view or post-processing methods cannot resolve, enabling versatile and accurate pose estimation.

Method: Uses early fusion of view-specific features and models multi-view geometry via lines of sight from camera centers. Assumes known camera parameters but allows variability per inference.

Result: Outperforms single-view and existing multi-view methods on a synthetic dataset and achieves competitive results on YCB-V.

Conclusion: MVTOP is a versatile, end-to-end trainable method that reliably resolves pose ambiguities without additional data like depth.

Abstract: We present MVTOP, a novel transformer-based method for multi-view rigid object pose estimation. Through an early fusion of the view-specific features, our method can resolve pose ambiguities that would be impossible to solve with a single view or with a post-processing of single-view poses. MVTOP models the multi-view geometry via lines of sight that emanate from the respective camera centers. While the method assumes the camera interior and relative orientations are known for a particular scene, they can vary for each inference. This makes the method versatile. The use of the lines of sight enables MVTOP to correctly predict the correct pose with the merged multi-view information. To show the model’s capabilities, we provide a synthetic data set that can only be solved with such holistic multi-view approaches since the poses in the dataset cannot be solved with just one view. Our method outperforms single-view and all existing multi-view approaches on our dataset and achieves competitive results on the YCB-V dataset. To the best of our knowledge, no holistic multi-view method exists that can resolve such pose ambiguities reliably. Our model is end-to-end trainable and does not require any additional data, e.g., depth.

[165] Ultralight Polarity-Split Neuromorphic SNN for Event-Stream Super-Resolution

Chuanzhi Xu, Haoxian Zhou, Langyi Chen, Yuk Ying Chung, Qiang Qu

Main category: cs.CV

TL;DR: Proposes an ultra-lightweight, stream-based event-to-event super-resolution method using Spiking Neural Networks (SNNs) for real-time deployment on resource-constrained devices, with novel encoding and loss strategies.

DetailsMotivation: Event cameras have high temporal resolution and low latency but limited spatial resolution, hindering fine-grained perception tasks.

Method: Uses SNNs with a Dual-Forward Polarity-Split Event Encoding and Learnable Spatio-temporal Polarity-aware Loss (LearnSTPLoss) to balance temporal, spatial, and polarity consistency.

Result: Achieves competitive super-resolution performance while reducing model size and inference time.

Conclusion: The lightweight design allows embedding into event cameras or use as preprocessing for downstream tasks.

Abstract: Event cameras offer unparalleled advantages such as high temporal resolution, low latency, and high dynamic range. However, their limited spatial resolution poses challenges for fine-grained perception tasks. In this work, we propose an ultra-lightweight, stream-based event-to-event super-resolution method based on Spiking Neural Networks (SNNs), designed for real-time deployment on resource-constrained devices. To further reduce model size, we introduce a novel Dual-Forward Polarity-Split Event Encoding strategy that decouples positive and negative events into separate forward paths through a shared SNN. Furthermore, we propose a Learnable Spatio-temporal Polarity-aware Loss (LearnSTPLoss) that adaptively balances temporal, spatial, and polarity consistency using learnable uncertainty-based weights. Experimental results demonstrate that our method achieves competitive super-resolution performance on multiple datasets while significantly reducing model size and inference time. The lightweight design enables embedding the module into event cameras or using it as an efficient front-end preprocessing for downstream vision tasks.

[166] Robust Single-Stage Fully Sparse 3D Object Detection via Detachable Latent Diffusion

Wentao Qu, Guofeng Mei, Jing Wang, Yujiao Wu, Xiaoshui Huang, Liang Xiao

Main category: cs.CV

TL;DR: RSDNet introduces a single-stage sparse 3D detection network with a detachable latent framework (DLF) for efficient and robust object detection using DDPMs.

DetailsMotivation: Existing methods rely on multi-step iterations, limiting efficiency. RSDNet aims to improve efficiency and robustness in 3D object detection.

Method: RSDNet uses lightweight denoising networks (DAEs) in latent feature spaces, reformulates DDPM mechanisms, and introduces semantic-geometric guidance for sparse detection.

Result: RSDNet achieves state-of-the-art detection performance on public benchmarks with single-step inference.

Conclusion: RSDNet offers a robust, efficient solution for 3D object detection, outperforming existing methods.

Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have shown success in robust 3D object detection tasks. Existing methods often rely on the score matching from 3D boxes or pre-trained diffusion priors. However, they typically require multi-step iterations in inference, which limits efficiency. To address this, we propose a \textbf{R}obust single-stage fully \textbf{S}parse 3D object \textbf{D}etection \textbf{Net}work with a Detachable Latent Framework (DLF) of DDPMs, named RSDNet. Specifically, RSDNet learns the denoising process in latent feature spaces through lightweight denoising networks like multi-level denoising autoencoders (DAEs). This enables RSDNet to effectively understand scene distributions under multi-level perturbations, achieving robust and reliable detection. Meanwhile, we reformulate the noising and denoising mechanisms of DDPMs, enabling DLF to construct multi-type and multi-level noise samples and targets, enhancing RSDNet robustness to multiple perturbations. Furthermore, a semantic-geometric conditional guidance is introduced to perceive the object boundaries and shapes, alleviating the center feature missing problem in sparse representations, enabling RSDNet to perform in a fully sparse detection pipeline. Moreover, the detachable denoising network design of DLF enables RSDNet to perform single-step detection in inference, further enhancing detection efficiency. Extensive experiments on public benchmarks show that RSDNet can outperform existing methods, achieving state-of-the-art detection.

[167] Beyond Meme Templates: Limitations of Visual Similarity Measures in Meme Matching

Muzhaffar Hazman, Susan McKeever, Josephine Griffith

Main category: cs.CV

TL;DR: The paper addresses limitations in current meme-matching methods, which focus only on template-based memes, and proposes broader approaches, including segment-wise similarity and multimodal models.

DetailsMotivation: Existing meme-matching methods are limited to template-based memes, excluding non-template-based ones and hindering comprehensive meme analysis and linking to meme dictionaries.

Method: The study introduces a broader meme-matching formulation, evaluates conventional and segment-wise similarity measures, and explores a prompting-based approach using a multimodal large language model.

Result: Segment-wise similarity outperforms whole-image measures for non-template-based memes, but matching memes via shared visual elements remains challenging.

Conclusion: More sophisticated techniques are needed for accurate meme matching beyond template-based approaches.

Abstract: Internet memes, now a staple of digital communication, play a pivotal role in how users engage within online communities and allow researchers to gain insight into contemporary digital culture. These engaging user-generated content are characterised by their reuse of visual elements also found in other memes. Matching instances of memes via these shared visual elements, called Meme Matching, is the basis of a wealth of meme analysis approaches. However, most existing methods assume that every meme consists of a shared visual background, called a Template, with some overlaid text, thereby limiting meme matching to comparing the background image alone. Current approaches exclude the many memes that are not template-based and limit the effectiveness of automated meme analysis and would not be effective at linking memes to contemporary web-based meme dictionaries. In this work, we introduce a broader formulation of meme matching that extends beyond template matching. We show that conventional similarity measures, including a novel segment-wise computation of the similarity measures, excel at matching template-based memes but fall short when applied to non-template-based meme formats. However, the segment-wise approach was found to consistently outperform the whole-image measures on matching non-template-based memes. Finally, we explore a prompting-based approach using a pretrained Multimodal Large Language Model for meme matching. Our results highlight that accurately matching memes via shared visual elements, not just background templates, remains an open challenge that requires more sophisticated matching techniques.

[168] V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models

Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, Youngjae Yu

Main category: cs.CV

TL;DR: The paper proposes ReDPO, a distillation method combining DPO and SFT, and V.I.P., a dataset framework, to reduce computational costs in text-to-video models while maintaining performance.

DetailsMotivation: High computational costs of text-to-video models in resource-constrained environments necessitate efficient distillation methods without performance degradation.

Method: ReDPO integrates DPO and SFT for targeted property recovery, while V.I.P. filters high-quality datasets for calibrated training.

Result: Achieved 36.2% and 67.5% parameter reduction in VideoCrafter2 and AnimateDiff, respectively, with maintained or improved performance.

Conclusion: ReDPO and V.I.P. enable efficient, high-quality video generation, validated on leading T2V models.

Abstract: With growing interest in deploying text-to-video (T2V) models in resource-constrained environments, reducing their high computational cost has become crucial, leading to extensive research on pruning and knowledge distillation methods while maintaining performance. However, existing distillation methods primarily rely on supervised fine-tuning (SFT), which often leads to mode collapse as pruned models with reduced capacity fail to directly match the teacher’s outputs, ultimately resulting in degraded quality. To address this challenge, we propose an effective distillation method, ReDPO, that integrates DPO and SFT. Our approach leverages DPO to guide the student model to focus on recovering only the targeted properties, rather than passively imitating the teacher, while also utilizing SFT to enhance overall performance. We additionally propose V.I.P., a novel framework for filtering and curating high-quality pair datasets, along with a step-by-step online approach for calibrated training. We validate our method on two leading T2V models, VideoCrafter2 and AnimateDiff, achieving parameter reduction of 36.2% and 67.5% each, while maintaining or even surpassing the performance of full models. Further experiments demonstrate the effectiveness of both ReDPO and V.I.P. framework in enabling efficient and high-quality video generation. Our code and videos are available at https://jiiiisoo.github.io/VIP.github.io/.

[169] Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation

Gang Dai, Yifan Zhang, Yutao Qin, Qiangya Guo, Shuangping Huang, Shuicheng Yan

Main category: cs.CV

TL;DR: DiffBrush is a diffusion-based model for generating handwritten text lines, addressing style imitation and content accuracy through content-decoupled style learning and multi-scale content learning.

DetailsMotivation: Existing methods focus on isolated words, but realistic handwritten text requires modeling relationships between words (e.g., alignment, spacing). Generating entire text lines is more comprehensive but challenging.

Method: DiffBrush uses content-decoupled style learning (disentangling style from content via column- and row-wise masking) and multi-scale content learning (line and word discriminators for coherence and accuracy).

Result: DiffBrush outperforms in generating high-quality text lines, excelling in style reproduction and content preservation.

Conclusion: DiffBrush is a promising solution for handwritten text-line generation, balancing style and content effectively.

Abstract: Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text lines emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns encompassing both intra- and inter-word relationships, and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better capture intra-word and inter-word style patterns by using column- and row-wise masking; and (2) multi-scale content learning, which employs line and word discriminators to ensure global coherence and local accuracy of textual content. Extensive experiments show that DiffBrush excels in generating high-quality text lines, particularly in style reproduction and content preservation. Code is available at https://github.com/dailenson/DiffBrush.

[170] EgoPrompt: Prompt Pool Learning for Egocentric Action Recognition

Huaihai Lyu, Chaofan Chen, Yuheng Ji, Changsheng Xu

Main category: cs.CV

TL;DR: EgoPrompt is a prompt learning-based framework for egocentric action recognition, addressing the fragmentation of verb and noun components by unifying their representations through a prompt pool and attention-based fusion.

DetailsMotivation: Existing approaches treat verb and noun components independently, ignoring their semantic and contextual relationships, leading to fragmented representations and poor generalization.

Method: EgoPrompt uses a Unified Prompt Pool to interact verb and noun representations, decomposing them into fine-grained patterns and fusing them via attention. It also introduces Diverse Pool Criteria for training.

Result: EgoPrompt achieves state-of-the-art performance on Ego4D, EPIC-Kitchens, and EGTEA datasets in within-dataset, cross-dataset, and base-to-novel generalization benchmarks.

Conclusion: EgoPrompt effectively unifies verb and noun representations, improving egocentric action recognition performance through cross-component interaction and diverse prompt training.

Abstract: Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun component) from the first-person perspective. However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt, to conduct the egocentric action recognition task. Building on the existing prompting strategy to capture the component-specific knowledge, we construct a Unified Prompt Pool space to establish interaction between the two types of component representations. Specifically, the component representations (from verbs and nouns) are first decomposed into fine-grained patterns with the prompt pair form. Then, these pattern-level representations are fused through an attention-based mechanism to facilitate cross-component interaction. To ensure the prompt pool is informative, we further introduce a novel training objective, Diverse Pool Criteria. This objective realizes our goals from two perspectives: Prompt Selection Frequency Regularization and Prompt Knowledge Orthogonalization. Extensive experiments are conducted on the Ego4D, EPIC-Kitchens, and EGTEA datasets. The results consistently show that EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.

[171] Efficient Multi-Slide Visual-Language Feature Fusion for Placental Disease Classification

Hang Guo, Qing Zhang, Zixuan Gao, Siyuan Yang, Shulin Peng, Xiang Tao, Ting Yu, Yan Wang, Qingli Li

Main category: cs.CV

TL;DR: EmmPD is a multimodal framework for placental disease diagnosis using WSIs, addressing patch selection and global context loss with a two-stage patch selection and hybrid fusion module.

DetailsMotivation: Accurate placental disease prediction is crucial but hindered by computational challenges and loss of global context in existing WSI methods.

Method: Proposes EmmPD with a two-stage patch selection module and hybrid multimodal fusion using adaptive graph learning and textual reports.

Result: Achieves state-of-the-art performance on self-constructed and public datasets.

Conclusion: EmmPD effectively balances efficiency and feature preservation, improving placental disease diagnosis.

Abstract: Accurate prediction of placental diseases via whole slide images (WSIs) is critical for preventing severe maternal and fetal complications. However, WSI analysis presents significant computational challenges due to the massive data volume. Existing WSI classification methods encounter critical limitations: (1) inadequate patch selection strategies that either compromise performance or fail to sufficiently reduce computational demands, and (2) the loss of global histological context resulting from patch-level processing approaches. To address these challenges, we propose an Efficient multimodal framework for Patient-level placental disease Diagnosis, named EmmPD. Our approach introduces a two-stage patch selection module that combines parameter-free and learnable compression strategies, optimally balancing computational efficiency with critical feature preservation. Additionally, we develop a hybrid multimodal fusion module that leverages adaptive graph learning to enhance pathological feature representation and incorporates textual medical reports to enrich global contextual understanding. Extensive experiments conducted on both a self-constructed patient-level Placental dataset and two public datasets demonstrating that our method achieves state-of-the-art diagnostic performance. The code is available at https://github.com/ECNU-MultiDimLab/EmmPD.

[172] Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation

Jun Luo, Zijing Zhao, Yang Liu

Main category: cs.CV

TL;DR: SDGPA introduces a method for zero-shot domain adaptive semantic segmentation using synthetic data generation and progressive adaptation to handle distribution shifts without target domain images.

DetailsMotivation: Addressing limitations of deep learning models in handling distribution shifts between training and test data, especially when no target images are available.

Method: Utilizes a text-to-image diffusion model to generate synthetic target-style images, crops and edits small patches for spatial precision, and employs progressive adaptation for stable learning.

Result: Achieves state-of-the-art performance in zero-shot semantic segmentation.

Conclusion: SDGPA effectively tackles domain adaptation challenges with synthetic data and progressive learning, demonstrating superior performance.

Abstract: Deep learning-based semantic segmentation models achieve impressive results yet remain limited in handling distribution shifts between training and test data. In this paper, we present SDGPA (Synthetic Data Generation and Progressive Adaptation), a novel method that tackles zero-shot domain adaptive semantic segmentation, in which no target images are available, but only a text description of the target domain’s style is provided. To compensate for the lack of target domain training data, we utilize a pretrained off-the-shelf text-to-image diffusion model, which generates training images by transferring source domain images to target style. Directly editing source domain images introduces noise that harms segmentation because the layout of source images cannot be precisely maintained. To address inaccurate layouts in synthetic data, we propose a method that crops the source image, edits small patches individually, and then merges them back together, which helps improve spatial precision. Recognizing the large domain gap, SDGPA constructs an augmented intermediate domain, leveraging easier adaptation subtasks to enable more stable model adaptation to the target domain. Additionally, to mitigate the impact of noise in synthetic data, we design a progressive adaptation strategy, ensuring robust learning throughout the training process. Extensive experiments demonstrate that our method achieves state-of-the-art performance in zero-shot semantic segmentation. The code is available at https://github.com/ROUJINN/SDGPA

[173] BaroPoser: Real-time Human Motion Tracking from IMUs and Barometers in Everyday Devices

Libo Zhang, Xinyu Yi, Feng Xu

Main category: cs.CV

TL;DR: BaroPoser combines IMU and barometric data from smartphones/smartwatches to improve human pose and global translation estimation on uneven terrain, outperforming IMU-only methods.

DetailsMotivation: Existing IMU-based methods struggle with pose accuracy and are limited to flat terrain due to sparse sensor data and lack of uneven terrain datasets.

Method: BaroPoser integrates IMU and barometric data to estimate height changes, uses a local thigh coordinate frame for better motion representation, and is evaluated on public benchmarks and real-world recordings.

Result: BaroPoser outperforms state-of-the-art IMU-only methods in pose estimation and global translation accuracy.

Conclusion: Combining IMU and barometric data enhances real-time human motion tracking on non-flat terrain, offering superior performance over existing methods.

Abstract: In recent years, tracking human motion using IMUs from everyday devices such as smartphones and smartwatches has gained increasing popularity. However, due to the sparsity of sensor measurements and the lack of datasets capturing human motion over uneven terrain, existing methods often struggle with pose estimation accuracy and are typically limited to recovering movements on flat terrain only. To this end, we present BaroPoser, the first method that combines IMU and barometric data recorded by a smartphone and a smartwatch to estimate human pose and global translation in real time. By leveraging barometric readings, we estimate sensor height changes, which provide valuable cues for both improving the accuracy of human pose estimation and predicting global translation on non-flat terrain. Furthermore, we propose a local thigh coordinate frame to disentangle local and global motion input for better pose representation learning. We evaluate our method on both public benchmark datasets and real-world recordings. Quantitative and qualitative results demonstrate that our approach outperforms the state-of-the-art (SOTA) methods that use IMUs only with the same hardware configuration.

[174] Architectural Insights into Knowledge Distillation for Object Detection: A Comprehensive Review

Mahdi Golizadeh, Nassibeh Golizadeh, Mohammad Ali Keyvanrad, Hossein Shirazi

Main category: cs.CV

TL;DR: A review of Knowledge Distillation (KD) methods for object detection, proposing a taxonomy for CNN and Transformer-based detectors, and evaluating their performance on MS COCO and PASCAL VOC datasets.

DetailsMotivation: Improving object detection efficiency for resource-constrained devices by adapting KD, addressing challenges like dual objectives, imbalance, and multi-scale features.

Method: Introduces a taxonomy for KD methods in object detection, categorizing CNN-based (backbone, neck, head, RPN/RoI levels) and Transformer-based (query, feature, logit levels) approaches. Evaluates methods using MS COCO and PASCAL VOC datasets.

Result: Comparative analysis of KD methods’ effectiveness, measured by mAP@0.5, highlighting their performance and challenges.

Conclusion: The taxonomy and analysis clarify KD’s role in object detection, identify challenges, and guide future research toward efficient detection systems.

Abstract: Object detection has achieved remarkable accuracy through deep learning, yet these improvements often come with increased computational cost, limiting deployment on resource-constrained devices. Knowledge Distillation (KD) provides an effective solution by enabling compact student models to learn from larger teacher models. However, adapting KD to object detection poses unique challenges due to its dual objectives-classification and localization-as well as foreground-background imbalance and multi-scale feature representation. This review introduces a novel architecture-centric taxonomy for KD methods, distinguishing between CNN-based detectors (covering backbone-level, neck-level, head-level, and RPN/RoI-level distillation) and Transformer-based detectors (including query-level, feature-level, and logit-level distillation). We further evaluate representative methods using the MS COCO and PASCAL VOC datasets with mAP@0.5 as performance metric, providing a comparative analysis of their effectiveness. The proposed taxonomy and analysis aim to clarify the evolving landscape of KD in object detection, highlight current challenges, and guide future research toward efficient and scalable detection systems.

[175] Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, Yahui Zhou

Main category: cs.CV

TL;DR: Skywork UniPic is a 1.5B-parameter autoregressive model unifying image understanding, text-to-image generation, and image editing in one architecture, achieving state-of-the-art performance with efficient resource use.

DetailsMotivation: The paper aims to eliminate the need for task-specific adapters or connectors in multimodal AI by introducing a unified model that performs multiple tasks efficiently.

Method: The model uses a decoupled encoding strategy, progressive training, and curated datasets with reward models.

Result: Skywork UniPic achieves high scores on benchmarks (GenEval: 0.86, DPG-Bench: 85.5) and efficient GPU memory usage (under 15 GB for 1024x1024 images).

Conclusion: Skywork UniPic proves that high-fidelity multimodal integration can be resource-efficient, setting a practical standard for deployable AI.

Abstract: We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.

[176] Live Demonstration: Neuromorphic Radar for Gesture Recognition

Satyapreet Singh Yadav, Chandra Sekhar Seelamantula, Chetan Singh Thakur

Main category: cs.CV

TL;DR: A neuromorphic radar framework for real-time, low-power hand gesture recognition using event-driven architecture and bio-inspired sensing.

DetailsMotivation: To enable efficient, low-power, and real-time hand gesture recognition by mimicking biological sensing and reducing computational overhead.

Method: Uses a 24 GHz Doppler radar and custom neuromorphic sampler for asynchronous sigma-delta encoding, processed by a lightweight neural network on a Cortex-M0 microcontroller.

Result: Achieves >85% accuracy on a dataset of five gestures from seven users, with reduced memory, power, and computation.

Conclusion: First bio-inspired, event-driven radar HGR system, demonstrating efficient real-time performance.

Abstract: We present a neuromorphic radar framework for real-time, low-power hand gesture recognition (HGR) using an event-driven architecture inspired by biological sensing. Our system comprises a 24 GHz Doppler radar front-end and a custom neuromorphic sampler that converts intermediate-frequency (IF) signals into sparse spike-based representations via asynchronous sigma-delta encoding. These events are directly processed by a lightweight neural network deployed on a Cortex-M0 microcontroller, enabling low-latency inference without requiring spectrogram reconstruction. Unlike conventional radar HGR pipelines that continuously sample and process data, our architecture activates only when meaningful motion is detected, significantly reducing memory, power, and computation overhead. Evaluated on a dataset of five gestures collected from seven users, our system achieves > 85% real-time accuracy. To the best of our knowledge, this is the first work that employs bio-inspired asynchronous sigma-delta encoding and an event-driven processing framework for radar-based HGR.

[177] LRDDv2: Enhanced Long-Range Drone Detection Dataset with Range Information and Comprehensive Real-World Challenges

Amirreza Rouhi, Sneh Patel, Noah McCarthy, Siddiqa Khan, Hadi Khorsand, Kaleb Lefkowitz, David K. Han

Main category: cs.CV

TL;DR: The paper introduces LRDDv2, an enhanced dataset for long-range drone detection, featuring 39,516 annotated images with range information for 8,000+ images.

DetailsMotivation: The increasing use of UAVs necessitates reliable long-range detection, but existing datasets lack diversity and range-specific data.

Method: The authors expand the LRDDv1 dataset by adding more diverse images and including target range information.

Result: LRDDv2 offers a comprehensive resource for drone detection, especially for small drones (≤50 pixels in 1080p).

Conclusion: LRDDv2 addresses gaps in drone detection research by providing a richer, range-inclusive dataset.

Abstract: The exponential growth in Unmanned Aerial Vehicles (UAVs) usage underscores the critical need of detecting them at extended distances to ensure safe operations, especially in densely populated areas. Despite the tremendous advances made in computer vision through deep learning, the detection of these small airborne objects remains a formidable challenge. While several datasets have been developed specifically for drone detection, the need for a more extensive and diverse collection of drone image data persists, particularly for long-range detection under varying environmental conditions. We introduce here the Long Range Drone Detection (LRDD) Version 2 dataset, comprising 39,516 meticulously annotated images, as a second release of the LRDD dataset released previously. The LRDDv2 dataset enhances the LRDDv1 by incorporating a greater variety of images, providing a more diverse and comprehensive resource for drone detection research. What sets LRDDv2 apart is its inclusion of target range information for over 8,000 images, making it possible to develop algorithms for drone range estimation. Tailored for long-range aerial object detection, the majority of LRDDv2’s dataset consists of images capturing drones with 50 or fewer pixels in 1080p resolution. For access to the complete Long-Range Drone Detection Dataset (LRDD)v2, please visit https://research.coe.drexel.edu/ece/imaple/lrddv2/ .

[178] Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li

Main category: cs.CV

TL;DR: The paper proposes a planning-then-populating framework (MMPL) for long video generation to address temporal drift and parallelization issues in autoregressive models.

DetailsMotivation: Autoregressive diffusion models struggle with long-term consistency and parallelization in video generation due to error accumulation.

Method: MMPL uses hierarchical Micro and Macro Planning to sketch a global storyline, followed by parallel content populating and adaptive workload scheduling.

Result: The method outperforms existing models in quality and stability for long video generation.

Conclusion: MMPL effectively addresses long-term consistency and parallelization challenges in autoregressive video generation.

Abstract: Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.

[179] Beyond Illumination: Fine-Grained Detail Preservation in Extreme Dark Image Restoration

Tongshun Zhang, Pingping Liu, Zixuan Zhong, Zijian Zhang, Qiuzhan Zhou

Main category: cs.CV

TL;DR: A dual-stage method for enhancing dark images, combining a Residual Fourier-Guided Module for global illumination and Mamba modules for texture refinement, outperforms existing methods in detail recovery.

DetailsMotivation: Existing methods fail to preserve fine details and sharp edges in dark images, limiting their effectiveness in applications like text and edge detection.

Method: Proposes a dual-stage approach: (1) RFGM for global illumination in the frequency domain, and (2) Mamba modules (Patch Mamba and Grad Mamba) for texture refinement.

Result: Significantly improves detail recovery in dark images while maintaining efficiency, as shown in experiments on benchmark datasets.

Conclusion: The lightweight modules enhance detail recovery and can be integrated into existing Fourier-based frameworks with minimal overhead.

Abstract: Recovering fine-grained details in extremely dark images remains challenging due to severe structural information loss and noise corruption. Existing enhancement methods often fail to preserve intricate details and sharp edges, limiting their effectiveness in downstream applications like text and edge detection. To address these deficiencies, we propose an efficient dual-stage approach centered on detail recovery for dark images. In the first stage, we introduce a Residual Fourier-Guided Module (RFGM) that effectively restores global illumination in the frequency domain. RFGM captures inter-stage and inter-channel dependencies through residual connections, providing robust priors for high-fidelity frequency processing while mitigating error accumulation risks from unreliable priors. The second stage employs complementary Mamba modules specifically designed for textural structure refinement: (1) Patch Mamba operates on channel-concatenated non-downsampled patches, meticulously modeling pixel-level correlations to enhance fine-grained details without resolution loss. (2) Grad Mamba explicitly focuses on high-gradient regions, alleviating state decay in state space models and prioritizing reconstruction of sharp edges and boundaries. Extensive experiments on multiple benchmark datasets and downstream applications demonstrate that our method significantly improves detail recovery performance while maintaining efficiency. Crucially, the proposed modules are lightweight and can be seamlessly integrated into existing Fourier-based frameworks with minimal computational overhead. Code is available at https://github.com/bywlzts/RFGM.

[180] Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Shaoguang Wang, Jianxiang He, Yijie Xu, Ziyang Chen, Weiyu Guo, Hui Xiong

Main category: cs.CV

TL;DR: AFP reduces token cost and improves Video-QA performance by pruning redundant frames and using a semantic graph for context.

DetailsMotivation: High token costs and context dilution from excessive frames hinder MLLMs in Video-QA.

Method: Proposes Adaptive Frame-Pruning (AFP) with hierarchical clustering and a lightweight semantic graph.

Result: Reduces frames by 86.9% and tokens by 83.2%, often improving accuracy.

Conclusion: AFP enhances efficiency and performance in Video-QA for MLLMs.

Abstract: The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a “less is more” phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term ‘visual echoes’. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.

[181] CIVQLLIE: Causal Intervention with Vector Quantization for Low-Light Image Enhancement

Tongshun Zhang, Pingping Liu, Zhe Zhang, Qiuzhan Zhou

Main category: cs.CV

TL;DR: CIVQLLIE is a novel framework for low-light image enhancement using discrete representation learning and causal reasoning to address challenges in current methods.

DetailsMotivation: Current LLIE methods lack interpretability or rely on unreliable priors, while physics-based methods fail in complex scenarios.

Method: Uses Vector Quantization (VQ) with a multi-level causal intervention approach, including Pixel-level Causal Intervention (PCI), Feature-aware Causal Intervention (FCI), and High-frequency Detail Reconstruction Module (HDRM).

Result: Aligns degraded inputs with learned codebook distributions, enhances generalization, and reconstructs fine details.

Conclusion: CIVQLLIE effectively improves low-light image visibility by combining discrete representation learning and causal interventions.

Abstract: Images captured in nighttime scenes suffer from severely reduced visibility, hindering effective content perception. Current low-light image enhancement (LLIE) methods face significant challenges: data-driven end-to-end mapping networks lack interpretability or rely on unreliable prior guidance, struggling under extremely dark conditions, while physics-based methods depend on simplified assumptions that often fail in complex real-world scenarios. To address these limitations, we propose CIVQLLIE, a novel framework that leverages the power of discrete representation learning through causal reasoning. We achieve this through Vector Quantization (VQ), which maps continuous image features to a discrete codebook of visual tokens learned from large-scale high-quality images. This codebook serves as a reliable prior, encoding standardized brightness and color patterns that are independent of degradation. However, direct application of VQ to low-light images fails due to distribution shifts between degraded inputs and the learned codebook. Therefore, we propose a multi-level causal intervention approach to systematically correct these shifts. First, during encoding, our Pixel-level Causal Intervention (PCI) module intervenes to align low-level features with the brightness and color distributions expected by the codebook. Second, a Feature-aware Causal Intervention (FCI) mechanism with Low-frequency Selective Attention Gating (LSAG) identifies and enhances channels most affected by illumination degradation, facilitating accurate codebook token matching while enhancing the encoder’s generalization performance through flexible feature-level intervention. Finally, during decoding, the High-frequency Detail Reconstruction Module (HDRM) leverages structural information preserved in the matched codebook representations to reconstruct fine details using deformable convolution techniques.

[182] WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval

Junlong Ren, Gangjian Zhang, Honghao Fu, Pengcheng Wu, Hao Wang

Main category: cs.CV

TL;DR: WaMo is a wavelet-based framework for text-motion retrieval, improving alignment by capturing multi-frequency motion details and outperforming SOTA methods.

DetailsMotivation: Existing methods fail to address the complexities of 3D motion-text alignment, lacking part-specific and time-varying motion detail extraction.

Method: WaMo uses wavelet decomposition, reconstruction, and disordered sequence prediction to extract discriminative motion features for fine-grained alignment.

Result: WaMo achieves 17.0% and 18.2% improvements in Rsum on HumanML3D and KIT-ML datasets, outperforming SOTA methods.

Conclusion: WaMo effectively addresses motion-text alignment challenges, demonstrating superior performance through its wavelet-based approach.

Abstract: Text-Motion Retrieval (TMR) aims to retrieve 3D motion sequences semantically relevant to text descriptions. However, matching 3D motions with text remains highly challenging, primarily due to the intricate structure of human body and its spatial-temporal dynamics. Existing approaches often overlook these complexities, relying on general encoding methods that fail to distinguish different body parts and their dynamics, limiting precise semantic alignment. To address this, we propose WaMo, a novel wavelet-based multi-frequency feature extraction framework. It fully captures part-specific and time-varying motion details across multiple resolutions on body joints, extracting discriminative motion features to achieve fine-grained alignment with texts. WaMo has three key components: (1) Trajectory Wavelet Decomposition decomposes motion signals into frequency components that preserve both local kinematic details and global motion semantics. (2) Trajectory Wavelet Reconstruction uses learnable inverse wavelet transforms to reconstruct original joint trajectories from extracted features, ensuring the preservation of essential spatial-temporal information. (3) Disordered Motion Sequence Prediction reorders shuffled motion sequences to improve the learning of inherent temporal coherence, enhancing motion-text alignment. Extensive experiments demonstrate WaMo’s superiority, achieving 17.0% and 18.2% improvements in $Rsum$ on HumanML3D and KIT-ML datasets, respectively, outperforming existing state-of-the-art (SOTA) methods.

[183] FedPromo: Federated Lightweight Proxy Models at the Edge Bring New Domains to Foundation Models

Matteo Caligiuri, Francesco Barbato, Donald Shenaj, Umberto Michieli, Pietro Zanuttigh

Main category: cs.CV

TL;DR: FedPromo is a federated learning framework that efficiently adapts large foundation models to new domains using lightweight proxy models, reducing computational overhead and maintaining privacy.

DetailsMotivation: To address the computational inefficiency of conventional FL for large models on resource-limited client devices.

Method: A two-stage process: server-side knowledge distillation aligns representations of a large model with a compact one, followed by local classifier training on clients and aggregation.

Result: Outperforms existing methods on five image classification benchmarks with limited-resource clients.

Conclusion: FedPromo balances performance, privacy, and efficiency in decentralized multi-domain learning.

Abstract: Federated Learning (FL) is an established paradigm for training deep learning models on decentralized data. However, as the size of the models grows, conventional FL approaches often require significant computational resources on client devices, which may not be feasible. We introduce FedPromo, a novel framework that enables efficient adaptation of large-scale foundation models stored on a central server to new domains encountered only by remote clients. Instead of directly training the large model on client devices, FedPromo optimizes lightweight proxy models via FL, significantly reducing computational overhead while maintaining privacy. Our method follows a two-stage process: first, server-side knowledge distillation aligns the representations of a large-scale foundation model (e.g., a transformer) with those of a compact counterpart (e.g., a CNN). Then, the compact model encoder is deployed to client devices, where trainable classifiers are learned locally. These classifiers are subsequently aggregated and seamlessly transferred back to the foundation model, facilitating personalized adaptation without requiring direct access to user data. Through novel regularization strategies, our framework enables decentralized multi-domain learning, balancing performance, privacy, and resource efficiency. Extensive experiments on five image classification benchmarks demonstrate that FedPromo outperforms existing methods while assuming limited-resource clients.

[184] Diffusion Once and Done: Degradation-Aware LoRA for Efficient All-in-One Image Restoration

Ni Tang, Xiaotong Luo, Zihan Cheng, Liangtai Zhou, Dongxiao Zhang, Yanyun Qu

Main category: cs.CV

TL;DR: The paper introduces Diffusion Once and Done (DOD), an efficient all-in-one image restoration method using one-step sampling of Stable Diffusion models, outperforming existing approaches in quality and speed.

DetailsMotivation: Existing methods for all-in-one image restoration (AiOIR) using diffusion models are costly and lack adaptability to diverse degradation types.

Method: DOD uses multi-degradation feature modulation and parameter-efficient conditional low-rank adaptation to fine-tune Stable Diffusion models, plus a detail enhancement module.

Result: DOD achieves superior restoration performance with one-step sampling, excelling in visual quality and inference efficiency.

Conclusion: DOD is a highly efficient and adaptable solution for AiOIR, leveraging Stable Diffusion with minimal computational overhead.

Abstract: Diffusion models have revealed powerful potential in all-in-one image restoration (AiOIR), which is talented in generating abundant texture details. The existing AiOIR methods either retrain a diffusion model or fine-tune the pretrained diffusion model with extra conditional guidance. However, they often suffer from high inference costs and limited adaptability to diverse degradation types. In this paper, we propose an efficient AiOIR method, Diffusion Once and Done (DOD), which aims to achieve superior restoration performance with only one-step sampling of Stable Diffusion (SD) models. Specifically, multi-degradation feature modulation is first introduced to capture different degradation prompts with a pretrained diffusion model. Then, parameter-efficient conditional low-rank adaptation integrates the prompts to enable the fine-tuning of the SD model for adapting to different degradation types. Besides, a high-fidelity detail enhancement module is integrated into the decoder of SD to improve structural and textural details. Experiments demonstrate that our method outperforms existing diffusion-based restoration approaches in both visual quality and inference efficiency.

[185] GRASPing Anatomy to Improve Pathology Segmentation

Keyi Li, Alexander Jaus, Jens Kleesiek, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: GRASP enhances pathology segmentation by integrating anatomical context without retraining, achieving top performance on PET/CT datasets.

DetailsMotivation: Current deep learning methods for pathology segmentation ignore anatomical context, limiting accuracy. GRASP aims to bridge this gap.

Method: GRASP leverages existing anatomy segmentation models via pseudolabel integration and feature alignment, integrating into standard pathology optimization without retraining.

Result: GRASP consistently ranks top across metrics and architectures, with its dual anatomy injection strategy proving effective.

Conclusion: GRASP successfully incorporates anatomical context into pathology segmentation, improving performance without additional training overhead.

Abstract: Radiologists rely on anatomical understanding to accurately delineate pathologies, yet most current deep learning approaches use pure pattern recognition and ignore the anatomical context in which pathologies develop. To narrow this gap, we introduce GRASP (Guided Representation Alignment for the Segmentation of Pathologies), a modular plug-and-play framework that enhances pathology segmentation models by leveraging existing anatomy segmentation models through pseudolabel integration and feature alignment. Unlike previous approaches that obtain anatomical knowledge via auxiliary training, GRASP integrates into standard pathology optimization regimes without retraining anatomical components. We evaluate GRASP on two PET/CT datasets, conduct systematic ablation studies, and investigate the framework’s inner workings. We find that GRASP consistently achieves top rankings across multiple evaluation metrics and diverse architectures. The framework’s dual anatomy injection strategy, combining anatomical pseudo-labels as input channels with transformer-guided anatomical feature fusion, effectively incorporates anatomical context.

[186] GaitAdapt: Continual Learning for Evolving Gait Recognition

Jingjie Wang, Shunli Zhang, Xiang Wei, Senmao Tian

Main category: cs.CV

TL;DR: GaitAdapter introduces a continual learning approach for gait recognition without retraining, using graph neural networks and a Euclidean Distance Stability Method to preserve knowledge across tasks.

DetailsMotivation: Current gait recognition methods require retraining for new datasets, leading to performance drops on earlier tasks. GaitAdapter aims to enhance gait recognition progressively without forgetting past knowledge.

Method: GaitAdapter employs the GaitPartition Adaptive Knowledge (GPAK) module with graph neural networks to aggregate gait patterns and the Euclidean Distance Stability Method (EDSN) to maintain feature distributions.

Result: GaitAdapter outperforms other methods in retaining gait knowledge across tasks, showing superior discriminative capability.

Conclusion: GaitAdapter effectively addresses continual gait recognition challenges, preserving knowledge and improving performance without retraining.

Abstract: Current gait recognition methodologies generally necessitate retraining when encountering new datasets. Nevertheless, retrained models frequently encounter difficulties in preserving knowledge from previous datasets, leading to a significant decline in performance on earlier test sets. To tackle these challenges, we present a continual gait recognition task, termed GaitAdapt, which supports the progressive enhancement of gait recognition capabilities over time and is systematically categorized according to various evaluation scenarios. Additionally, we propose GaitAdapter, a non-replay continual learning approach for gait recognition. This approach integrates the GaitPartition Adaptive Knowledge (GPAK) module, employing graph neural networks to aggregate common gait patterns from current data into a repository constructed from graph vectors. Subsequently, this repository is used to improve the discriminability of gait features in new tasks, thereby enhancing the model’s ability to effectively recognize gait patterns. We also introduce a Euclidean Distance Stability Method (EDSN) based on negative pairs, which ensures that newly added gait samples from different classes maintain similar relative spatial distributions across both previous and current gait tasks, thereby alleviating the impact of task changes on the distinguishability of original domain features. Extensive evaluations demonstrate that GaitAdapter effectively retains gait knowledge acquired from diverse tasks, exhibiting markedly superior discriminative capability compared to alternative methods.

[187] Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation

Yizhe Xiong, Zihan Zhou, Yiwen Liang, Hui Chen, Zijia Lin, Tianxiang Hao, Fan Zhang, Jungong Han, Guiguang Ding

Main category: cs.CV

TL;DR: NAVIA improves Test-Time Adaptation (TTA) for Vision Transformers (ViTs) by neutralizing token aggregation via information augmentation, reducing latency by 20% while outperforming state-of-the-art methods by 2.5%.

DetailsMotivation: Existing TTA methods for ViTs are computationally expensive and suffer performance degradation when integrated with token aggregation, limiting real-world applicability.

Method: Proposes NAVIA, which augments the [CLS] token embedding and incorporates adaptive biases in shallow ViT layers to recover information lost from token aggregation.

Result: NAVIA achieves a 2.5% performance improvement and reduces inference latency by over 20% on out-of-distribution benchmarks.

Conclusion: NAVIA effectively addresses the Efficient Test-Time Adaptation (ETTA) challenge by balancing performance and computational efficiency.

Abstract: Test-Time Adaptation (TTA) has emerged as an effective solution for adapting Vision Transformers (ViT) to distribution shifts without additional training data. However, existing TTA methods often incur substantial computational overhead, limiting their applicability in resource-constrained real-world scenarios. To reduce inference cost, plug-and-play token aggregation methods merge redundant tokens in ViTs to reduce total processed tokens. Albeit efficient, it suffers from significant performance degradation when directly integrated with existing TTA methods. We formalize this problem as Efficient Test-Time Adaptation (ETTA), seeking to preserve the adaptation capability of TTA while reducing inference latency. In this paper, we first provide a theoretical analysis from a novel mutual information perspective, showing that token aggregation inherently leads to information loss, which cannot be fully mitigated by conventional norm-tuning-based TTA methods. Guided by this insight, we propose to \textbf{N}eutralize Token \textbf{A}ggregation \textbf{v}ia \textbf{I}nformation \textbf{A}ugmentation (\textbf{NAVIA}). Specifically, we directly augment the [CLS] token embedding and incorporate adaptive biases into the [CLS] token in shallow layers of ViTs. We theoretically demonstrate that these augmentations, when optimized via entropy minimization, recover the information lost due to token aggregation. Extensive experiments across various out-of-distribution benchmarks demonstrate that NAVIA significantly outperforms state-of-the-art methods by over 2.5%, while achieving an inference latency reduction of more than 20%, effectively addressing the ETTA challenge.

[188] SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models

Pingchuan Ma, Xiaopei Yang, Yusong Li, Ming Gui, Felix Krause, Johannes Schusterbauer, Björn Ommer

Main category: cs.CV

TL;DR: SCFlow proposes a flow-matching framework to merge style and content invertibly, enabling natural disentanglement without explicit supervision, and demonstrates strong zero-shot generalization.

DetailsMotivation: Existing methods struggle with disentangling style and content due to semantic overlap and subjectivity. SCFlow explores bypassing explicit disentanglement by learning invertible merging.

Method: SCFlow uses flow matching to learn bidirectional mappings between entangled and disentangled representations, avoiding restrictive priors and leveraging a synthetic dataset for training.

Result: SCFlow achieves competitive performance on ImageNet-1k and WikiArt in zero-shot settings, showing disentanglement emerges naturally from merging.

Conclusion: The invertible merging process in SCFlow effectively bypasses explicit disentanglement challenges, demonstrating practical and scalable disentanglement.

Abstract: Explicitly disentangling style and content in vision models remains challenging due to their semantic overlap and the subjectivity of human perception. Existing methods propose separation through generative or discriminative objectives, but they still face the inherent ambiguity of disentangling intertwined concepts. Instead, we ask: Can we bypass explicit disentanglement by learning to merge style and content invertibly, allowing separation to emerge naturally? We propose SCFlow, a flow-matching framework that learns bidirectional mappings between entangled and disentangled representations. Our approach is built upon three key insights: 1) Training solely to merge style and content, a well-defined task, enables invertible disentanglement without explicit supervision; 2) flow matching bridges on arbitrary distributions, avoiding the restrictive Gaussian priors of diffusion models and normalizing flows; and 3) a synthetic dataset of 510,000 samples (51 styles $\times$ 10,000 content samples) was curated to simulate disentanglement through systematic style-content pairing. Beyond controllable generation tasks, we demonstrate that SCFlow generalizes to ImageNet-1k and WikiArt in zero-shot settings and achieves competitive performance, highlighting that disentanglement naturally emerges from the invertible merging process.

[189] Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Shuicheng Yan

Main category: cs.CV

TL;DR: MACT is a Multi-Agent Collaboration framework for visual document understanding and VQA, outperforming existing VLMs with smaller parameters and better handling of long contexts and complex reasoning.

DetailsMotivation: Existing VLMs are limited by parameter scale, lack self-correction, and struggle with long visual contexts and complex reasoning, especially in document tasks.

Method: MACT uses four specialized agents (planning, execution, judgment, answer) with mixed reward modeling and hybrid test-time scaling for optimized collaboration.

Result: MACT leads in 13 of 15 benchmarks, excels in long-context and complex reasoning tasks, and maintains general and mathematical task performance.

Conclusion: MACT demonstrates superior performance with efficient parameter use, making it a promising solution for advanced visual-language tasks.

Abstract: Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.

[190] SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation

Diana-Nicoleta Grigore, Neelu Madan, Andreas Mogelmose, Thomas B. Moeslund, Radu Tudor Ionescu

Main category: cs.CV

TL;DR: SlotMatch, a knowledge distillation framework, transfers object-centric representations from a teacher to a lightweight student for unsupervised video segmentation, outperforming the teacher with fewer parameters and faster speed.

DetailsMotivation: Unsupervised video segmentation lacks supervisory signals and often requires complex models. SlotMatch aims to simplify this by distilling knowledge efficiently.

Method: SlotMatch aligns teacher and student slots via cosine similarity without additional objectives or supervision.

Result: The distilled student matches or outperforms the teacher (SlotContrast) with 3.6x fewer parameters and 1.9x faster speed, surpassing prior models.

Conclusion: SlotMatch demonstrates efficient knowledge distillation for unsupervised video segmentation, achieving superior performance with simplicity.

Abstract: Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on two datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running 1.9x faster. Moreover, our student surpasses previous unsupervised video segmentation models.

[191] Learning Latent Representations for Image Translation using Frequency Distributed CycleGAN

Shivangi Nigam, Adarsh Prasad Behera, Shekhar Verma, P. Nagabhushan

Main category: cs.CV

TL;DR: Fd-CycleGAN enhances CycleGAN with Local Neighborhood Encoding and frequency-aware supervision for better image-to-image translation, outperforming baselines in quality and efficiency.

DetailsMotivation: To improve latent representation learning for approximating real data distributions in image translation tasks, addressing limitations in CycleGAN.

Method: Integrates Local Neighborhood Encoding (LNE) and frequency-aware supervision, using KL/JS divergence and log-based similarity for distribution alignment.

Result: Superior perceptual quality, faster convergence, and improved mode diversity, especially in low-data scenarios, validated on Horse2Zebra, Monet2Photo, and Strike-off datasets.

Conclusion: Fd-CycleGAN’s frequency-guided latent learning enhances generalization, with applications in document restoration, style transfer, and medical imaging, while being more efficient than diffusion models.

Abstract: This paper presents Fd-CycleGAN, an image-to-image (I2I) translation framework that enhances latent representation learning to approximate real data distributions. Building upon the foundation of CycleGAN, our approach integrates Local Neighborhood Encoding (LNE) and frequency-aware supervision to capture fine-grained local pixel semantics while preserving structural coherence from the source domain. We employ distribution-based loss metrics, including KL/JS divergence and log-based similarity measures, to explicitly quantify the alignment between real and generated image distributions in both spatial and frequency domains. To validate the efficacy of Fd-CycleGAN, we conduct experiments on diverse datasets – Horse2Zebra, Monet2Photo, and a synthetically augmented Strike-off dataset. Compared to baseline CycleGAN and other state-of-the-art methods, our approach demonstrates superior perceptual quality, faster convergence, and improved mode diversity, particularly in low-data regimes. By effectively capturing local and global distribution characteristics, Fd-CycleGAN achieves more visually coherent and semantically consistent translations. Our results suggest that frequency-guided latent learning significantly improves generalization in image translation tasks, with promising applications in document restoration, artistic style transfer, and medical image synthesis. We also provide comparative insights with diffusion-based generative models, highlighting the advantages of our lightweight adversarial approach in terms of training efficiency and qualitative output.

[192] R2GenKG: Hierarchical Multi-modal Knowledge Graph for LLM-based Radiology Report Generation

Futian Wang, Yuhan Qiao, Xiao Wang, Fuling Wang, Yuxiang Zhang, Dengdi Sun

Main category: cs.CV

TL;DR: The paper proposes a framework for X-ray medical report generation using a multi-modal knowledge graph (M3KG) and large foundation models to address challenges like hallucination and weak disease diagnosis.

DetailsMotivation: To improve the quality of medical report generation by leveraging structured knowledge and multi-modal data.

Method: Constructs M3KG using GPT-4o, samples it for multi-granularity graphs, and uses R-GCN, Swin-Transformer, and cross-attention for feature extraction and interaction. A large language model generates reports.

Result: Experiments validate the effectiveness of the proposed framework on multiple datasets.

Conclusion: The framework successfully integrates knowledge graphs and vision-language models for accurate and reliable X-ray report generation.

Abstract: X-ray medical report generation is one of the important applications of artificial intelligence in healthcare. With the support of large foundation models, the quality of medical report generation has significantly improved. However, challenges such as hallucination and weak disease diagnostic capability still persist. In this paper, we first construct a large-scale multi-modal medical knowledge graph (termed M3KG) based on the ground truth medical report using the GPT-4o. It contains 2477 entities, 3 kinds of relations, 37424 triples, and 6943 disease-aware vision tokens for the CheXpert Plus dataset. Then, we sample it to obtain multi-granularity semantic graphs and use an R-GCN encoder for feature extraction. For the input X-ray image, we adopt the Swin-Transformer to extract the vision features and interact with the knowledge using cross-attention. The vision tokens are fed into a Q-former and retrieved the disease-aware vision tokens using another cross-attention. Finally, we adopt the large language model to map the semantic knowledge graph, input X-ray image, and disease-aware vision tokens into language descriptions. Extensive experiments on multiple datasets fully validated the effectiveness of our proposed knowledge graph and X-ray report generation framework. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.

[193] Spatial Imputation Drives Cross-Domain Alignment for EEG Classification

Hongjun Liu, Chao Yao, Yalan Zhang, Xiaokun wang, Xiaojuan Ban

Main category: cs.CV

TL;DR: IMAC is a self-supervised framework for EEG signal classification that addresses cross-domain data shifts using channel-dependent masks and spatial imputation, achieving state-of-the-art performance.

DetailsMotivation: EEG signal classification is hindered by data distribution shifts from heterogeneous electrode configurations and hardware discrepancies.

Method: IMAC standardizes electrode layouts, introduces spatio-temporal signal alignment via channel-dependent masks, and uses disentangled temporal-spatial modeling.

Result: IMAC outperforms baselines by up to 35% in integrity scores and achieves top classification accuracy in cross-subject and cross-center scenarios.

Conclusion: IMAC effectively handles cross-domain EEG data shifts, offering robustness and superior performance in real-world applications.

Abstract: Electroencephalogram (EEG) signal classification faces significant challenges due to data distribution shifts caused by heterogeneous electrode configurations, acquisition protocols, and hardware discrepancies across domains. This paper introduces IMAC, a novel channel-dependent mask and imputation self-supervised framework that formulates the alignment of cross-domain EEG data shifts as a spatial time series imputation task. To address heterogeneous electrode configurations in cross-domain scenarios, IMAC first standardizes different electrode layouts using a 3D-to-2D positional unification mapping strategy, establishing unified spatial representations. Unlike previous mask-based self-supervised representation learning methods, IMAC introduces spatio-temporal signal alignment. This involves constructing a channel-dependent mask and reconstruction task framed as a low-to-high resolution EEG spatial imputation problem. Consequently, this approach simulates cross-domain variations such as channel omissions and temporal instabilities, thus enabling the model to leverage the proposed imputer for robust signal alignment during inference. Furthermore, IMAC incorporates a disentangled structure that separately models the temporal and spatial information of the EEG signals separately, reducing computational complexity while enhancing flexibility and adaptability. Comprehensive evaluations across 10 publicly available EEG datasets demonstrate IMAC’s superior performance, achieving state-of-the-art classification accuracy in both cross-subject and cross-center validation scenarios. Notably, IMAC shows strong robustness under both simulated and real-world distribution shifts, surpassing baseline methods by up to $35$% in integrity scores while maintaining consistent classification accuracy.

[194] MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis

Ning Zhu, Xiaochuan Ma, Shaoting Zhang, Guotai Wang

Main category: cs.CV

TL;DR: MedCAL-Bench is the first benchmark for evaluating Foundation Models (FMs) in Cold-Start Active Learning (CSAL) for medical image analysis, covering 14 FMs and 7 CSAL strategies across 7 datasets.

DetailsMotivation: To address the inefficiency and limitations of existing CSAL methods relying on Self-Supervised Learning (SSL) and explore the potential of pre-trained FMs for better feature extraction in CSAL tasks.

Method: Proposes MedCAL-Bench, evaluating FMs and CSAL strategies across diverse medical datasets for classification and segmentation tasks under varying annotation budgets.

Result: 1) Most FMs are effective for CSAL, with DINO family excelling in segmentation. 2) FM performance varies significantly in segmentation but not classification. 3) Different sample selection strategies perform best depending on the dataset (ALPS for segmentation, RepDiv for classification).

Conclusion: MedCAL-Bench provides a comprehensive evaluation of FMs in CSAL, highlighting their effectiveness and the need for tailored sample selection strategies in medical image analysis.

Abstract: Cold-Start Active Learning (CSAL) aims to select informative samples for annotation without prior knowledge, which is important for improving annotation efficiency and model performance under a limited annotation budget in medical image analysis. Most existing CSAL methods rely on Self-Supervised Learning (SSL) on the target dataset for feature extraction, which is inefficient and limited by insufficient feature representation. Recently, pre-trained Foundation Models (FMs) have shown powerful feature extraction ability with a potential for better CSAL. However, this paradigm has been rarely investigated, with a lack of benchmarks for comparison of FMs in CSAL tasks. To this end, we propose MedCAL-Bench, the first systematic FM-based CSAL benchmark for medical image analysis. We evaluate 14 FMs and 7 CSAL strategies across 7 datasets under different annotation budgets, covering classification and segmentation tasks from diverse medical modalities. It is also the first CSAL benchmark that evaluates both the feature extraction and sample selection stages. Our experimental results reveal that: 1) Most FMs are effective feature extractors for CSAL, with DINO family performing the best in segmentation; 2) The performance differences of these FMs are large in segmentation tasks, while small for classification; 3) Different sample selection strategies should be considered in CSAL on different datasets, with Active Learning by Processing Surprisal (ALPS) performing the best in segmentation while RepDiv leading for classification. The code is available at https://github.com/HiLab-git/MedCAL-Bench.

[195] RAAG: Ratio Aware Adaptive Guidance

Shangwen Zhu, Qianyu Peng, Yuting Hu, Zhantao Yang, Han Zhang, Zhao Pu, Ruili Feng, Fan Cheng

Main category: cs.CV

TL;DR: The paper identifies a sensitivity to guidance scale in early steps of flow-based generative models, proposes a RATIO-aware adaptive guidance schedule to stabilize and speed up sampling.

DetailsMotivation: To understand and mitigate the instability caused by guidance scale sensitivity in early sampling steps of flow-based models, which affects generation quality and speed.

Method: Theoretical analysis and empirical validation of the RATIO spike, followed by the introduction of a lightweight, adaptive guidance schedule using exponential decay.

Result: The proposed method enables up to 3x faster sampling while maintaining or improving quality, robustness, and semantic alignment across image and video models.

Conclusion: Stepwise guidance adaptation is crucial for optimizing fast flow-based generative models, with the proposed schedule offering a practical solution.

Abstract: Flow-based generative models have recently achieved remarkable progress in image and video synthesis, with classifier-free guidance (CFG) becoming the standard tool for high-fidelity, controllable generation. However, despite their practical success, little is known about how guidance interacts with different stages of the sampling process-especially in the fast, low-step regimes typical of modern flow-based pipelines. In this work, we uncover and analyze a fundamental instability: the earliest reverse steps are acutely sensitive to the guidance scale, owing to a pronounced spike in the relative strength (RATIO) of conditional to unconditional predictions. Through rigorous theoretical analysis and empirical validation, we show that this RATIO spike is intrinsic to the data distribution, independent of the model architecture, and causes exponential error amplification when paired with strong guidance. To address this, we propose a simple, theoretically grounded, RATIO-aware adaptive guidance schedule that automatically dampens the guidance scale at early steps based on the evolving RATIO, using a closed-form exponential decay. Our method is lightweight, requires no additional inference overhead, and is compatible with standard flow frameworks. Experiments across state-of-the-art image (SD3.5, Lumina) and video (WAN2.1) models demonstrate that our approach enables up to 3x faster sampling while maintaining or improving generation quality, robustness, and semantic alignment. Extensive ablation studies further confirm the generality and stability of our schedule across models, datasets, and hyperparameters. Our findings highlight the critical role of stepwise guidance adaptation in unlocking the full potential of fast flow-based generative models.

[196] CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection

Qiyu Chen, Zhen Qu, Wei Luo, Haiming Yao, Yunkang Cao, Yuxin Jiang, Yinan Duan, Huiyuan Luo, Chengkan Lv, Zhengtao Zhang

Main category: cs.CV

TL;DR: The paper introduces Conditional Prompt Synthesis (CoPS), a framework for zero-shot anomaly detection (ZSAD) that dynamically generates prompts based on visual features to improve performance.

DetailsMotivation: Existing prompt learning methods in ZSAD face challenges like static tokens and sparse textual labels, limiting generalization and causing overfitting.

Method: CoPS synthesizes dynamic prompts by extracting normal/anomaly prototypes from patch features and using a variational autoencoder to model semantic features. It also includes a spatially-aware alignment mechanism.

Result: CoPS outperforms state-of-the-art methods by 2.5% AUROC in classification and segmentation across 13 datasets.

Conclusion: CoPS effectively addresses the limitations of static prompts and sparse labels, enhancing ZSAD performance through dynamic prompt synthesis and semantic feature modeling.

Abstract: Recently, large pre-trained vision-language models have shown remarkable performance in zero-shot anomaly detection (ZSAD). With fine-tuning on a single auxiliary dataset, the model enables cross-category anomaly detection on diverse datasets covering industrial defects and medical lesions. Compared to manually designed prompts, prompt learning eliminates the need for expert knowledge and trial-and-error. However, it still faces the following challenges: (i) static learnable tokens struggle to capture the continuous and diverse patterns of normal and anomalous states, limiting generalization to unseen categories; (ii) fixed textual labels provide overly sparse category information, making the model prone to overfitting to a specific semantic subspace. To address these issues, we propose Conditional Prompt Synthesis (CoPS), a novel framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance. Specifically, we extract representative normal and anomaly prototypes from fine-grained patch features and explicitly inject them into prompts, enabling adaptive state modeling. Given the sparsity of class labels, we leverage a variational autoencoder to model semantic image features and implicitly fuse varied class tokens into prompts. Additionally, integrated with our spatially-aware alignment mechanism, extensive experiments demonstrate that CoPS surpasses state-of-the-art methods by 2.5% AUROC in both classification and segmentation across 13 industrial and medical datasets. Code will be available at https://github.com/cqylunlun/CoPS.

[197] Video Demoireing using Focused-Defocused Dual-Camera System

Xuan Dong, Xiangyuan Sun, Xia Wang, Jian Song, Ya Li, Weixin Li

Main category: cs.CV

TL;DR: A dual-camera framework is proposed to address moire patterns by capturing synchronized videos (one focused, one defocused) and using the defocused video to guide demoireing of the focused video, outperforming existing methods.

DetailsMotivation: Existing demoireing methods struggle to distinguish moire patterns from real textures and maintain tonal/temporal consistency.

Method: Uses a dual-camera setup (focused and defocused videos), optical flow alignment, multi-scale CNN, and joint bilateral filtering for demoireing.

Result: The proposed framework significantly outperforms state-of-the-art demoireing methods.

Conclusion: The dual-camera approach effectively addresses moire artifacts while preserving texture and consistency.

Abstract: Moire patterns, unwanted color artifacts in images and videos, arise from the interference between spatially high-frequency scene contents and the spatial discrete sampling of digital cameras. Existing demoireing methods primarily rely on single-camera image/video processing, which faces two critical challenges: 1) distinguishing moire patterns from visually similar real textures, and 2) preserving tonal consistency and temporal coherence while removing moire artifacts. To address these issues, we propose a dual-camera framework that captures synchronized videos of the same scene: one in focus (retaining high-quality textures but may exhibit moire patterns) and one defocused (with significantly reduced moire patterns but blurred textures). We use the defocused video to help distinguish moire patterns from real texture, so as to guide the demoireing of the focused video. We propose a frame-wise demoireing pipeline, which begins with an optical flow based alignment step to address any discrepancies in displacement and occlusion between the focused and defocused frames. Then, we leverage the aligned defocused frame to guide the demoireing of the focused frame using a multi-scale CNN and a multi-dimensional training loss. To maintain tonal and temporal consistency, our final step involves a joint bilateral filter to leverage the demoireing result from the CNN as the guide to filter the input focused frame to obtain the final output. Experimental results demonstrate that our proposed framework largely outperforms state-of-the-art image and video demoireing methods.

[198] AVPDN: Learning Motion-Robust and Scale-Adaptive Representations for Video-Based Polyp Detection

Zilin Chen, Shengnan Lu

Main category: cs.CV

TL;DR: AVPDN is a robust framework for polyp detection in colonoscopy videos, addressing challenges like rapid camera movement and noise with adaptive feature and multi-scale integration modules.

DetailsMotivation: Accurate polyp detection is crucial for early colorectal cancer diagnosis, but colonoscopy videos' rapid camera movement introduces noise and false positives.

Method: AVPDN includes the AFIA module (triple-branch architecture for feature enhancement) and the SACI module (dilated convolutions for multi-scale context integration).

Result: Experiments show AVPDN achieves competitive performance in video-based polyp detection on public benchmarks.

Conclusion: AVPDN effectively addresses challenges in colonoscopy video analysis, demonstrating strong generalization and performance.

Abstract: Accurate detection of polyps is of critical importance for the early and intermediate stages of colorectal cancer diagnosis. Compared to static images, dynamic colonoscopy videos provide more comprehensive visual information, which can facilitate the development of effective treatment plans. However, unlike fixed-camera recordings, colonoscopy videos often exhibit rapid camera movement, introducing substantial background noise that disrupts the structural integrity of the scene and increases the risk of false positives. To address these challenges, we propose the Adaptive Video Polyp Detection Network (AVPDN), a robust framework for multi-scale polyp detection in colonoscopy videos. AVPDN incorporates two key components: the Adaptive Feature Interaction and Augmentation (AFIA) module and the Scale-Aware Context Integration (SACI) module. The AFIA module adopts a triple-branch architecture to enhance feature representation. It employs dense self-attention for global context modeling, sparse self-attention to mitigate the influence of low query-key similarity in feature aggregation, and channel shuffle operations to facilitate inter-branch information exchange. In parallel, the SACI module is designed to strengthen multi-scale feature integration. It utilizes dilated convolutions with varying receptive fields to capture contextual information at multiple spatial scales, thereby improving the model’s denoising capability. Experiments conducted on several challenging public benchmarks demonstrate the effectiveness and generalization ability of the proposed method, achieving competitive performance in video-based polyp detection tasks.

[199] IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models

Jiabing Yang, Chenhang Cui, Yiyang Zhou, Yixiang Chen, Peng Xia, Ying Wei, Tao Yu, Yan Huang, Liang Wang

Main category: cs.CV

TL;DR: The paper addresses hallucinations in Large Vision-Language Models (LVLMs) by proposing IKOD, a lightweight decoding strategy that mitigates attention degradation and reduces hallucinations without extra training or cost.

DetailsMotivation: LVLMs struggle with integrating vision and language, leading to hallucinations, especially as sequence length grows. The cause is unclear, but diminishing visual attention is hypothesized as a key factor.

Method: Proposes IKOD, a collaborative decoding strategy that merges logits from shorter sequences (with higher image attention) to counteract attention degradation.

Result: IKOD effectively reduces hallucinations and improves LVLM performance without additional training or significant inference cost.

Conclusion: IKOD is a lightweight, efficient solution for mitigating hallucinations in LVLMs, applicable across various models.

Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress across multiple domains. However, these models still face the inherent challenge of integrating vision and language for collaborative inference, which often leads to “hallucinations”, outputs that are not grounded in the corresponding images. Many efforts have been made to address these issues, but each comes with its own limitations, such as high computational cost or expensive dataset annotation. Recent research shows that LVLMs exhibit a long-term bias where hallucinations increase as the sequence length grows, yet the underlying cause remains poorly understood. Building on extensive research into attention mechanisms in LVLMs, we analyze the relationship between this long-term bias and visual attention. In our research, we identify a consistent phenomenon in current LVLMs: the model’s attention to visual input diminishes as the generated sequence grows, which we hypothesize to be a key factor contributing to observed increasing hallucinations. Based on these insights, we propose Image attention-guided Key-value merging cOllaborative Decoding (IKOD), a collaborative decoding strategy generating more image-focused sequences. This method derives logits from shorter sequences with higher image attention through key-value merging and combines them with those from the original decoding, effectively mitigating attention degradation and suppressing hallucinations while not incurring too much inference cost. Extensive experiments on both hallucination and comprehensive benchmarks demonstrate IKOD’s superior effectiveness in mitigating hallucinations and improving comprehensive capacities for LVLMs. Importantly, IKOD requires no additional training or external tools, making it a lightweight and efficient framework applicable to various models.

[200] VideoGuard: Protecting Video Content from Unauthorized Editing

Junjie Cao, Kaizhou Li, Xinchun Yu, Hongxiang Li, Xiaoping Zhang

Main category: cs.CV

TL;DR: VideoGuard is a method to protect videos from unauthorized editing by introducing subtle perturbations that disrupt generative diffusion models, outperforming existing baselines.

DetailsMotivation: The rise of generative models poses risks of misuse for malicious editing, especially in videos, which lack robust protection compared to images.

Method: VideoGuard uses joint frame optimization and integrates motion information to create perturbations that disrupt generative models, ensuring implausible outputs.

Result: VideoGuard outperforms baseline methods in protecting videos from unauthorized editing, validated by objective and subjective metrics.

Conclusion: VideoGuard effectively bridges the gap in video protection, offering a robust solution against malicious generative editing.

Abstract: With the rapid development of generative technology, current generative models can generate high-fidelity digital content and edit it in a controlled manner. However, there is a risk that malicious individuals might misuse these capabilities for misleading activities. Although existing research has attempted to shield photographic images from being manipulated by generative models, there remains a significant disparity in the protection offered to video content editing. To bridge the gap, we propose a protection method named VideoGuard, which can effectively protect videos from unauthorized malicious editing. This protection is achieved through the subtle introduction of nearly unnoticeable perturbations that interfere with the functioning of the intended generative diffusion models. Due to the redundancy between video frames, and inter-frame attention mechanism in video diffusion models, simply applying image-based protection methods separately to every video frame can not shield video from unauthorized editing. To tackle the above challenge, we adopt joint frame optimization, treating all video frames as an optimization entity. Furthermore, we extract video motion information and fuse it into optimization objectives. Thus, these alterations can effectively force the models to produce outputs that are implausible and inconsistent. We provide a pipeline to optimize this perturbation. Finally, we use both objective metrics and subjective metrics to demonstrate the efficacy of our method, and the results show that the protection performance of VideoGuard is superior to all the baseline methods.

[201] When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

Dasol Choi Jihwan Lee, Minjae Lee, Minsuk Kahng

Main category: cs.CV

TL;DR: The paper introduces SODA, a framework to measure demographic biases in generated objects (e.g., cars) by comparing visual attributes from demographic-cued vs. neutral prompts across three models. It reveals strong associations between demographic groups and visual attributes, including subtle biases, and highlights reduced diversity in some models.

DetailsMotivation: To investigate pervasive but subtle demographic biases in generated objects, beyond human depictions, and propose a systematic auditing method.

Method: Introduces SODA, a framework comparing visual attributes of objects generated with demographic cues vs. neutral prompts across 2,700 images from three models (GPT Image-1, Imagen 4, Stable Diffusion) in five categories.

Result: Uncovers strong associations between demographic groups and visual attributes, reflecting stereotypes and subtle biases. Some models produce less diverse outputs, amplifying disparities.

Conclusion: SODA provides a practical auditing tool to reveal embedded stereotypes in generative models, advocating for more systematic and responsible AI development.

Abstract: While prior research on text-to-image generation has predominantly focused on biases in human depictions, we investigate a more subtle yet pervasive phenomenon: demographic bias in generated objects (e.g., cars). We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring such biases. Our approach compares visual attributes of objects generated with demographic cues (e.g., “for young people’’) to those from neutral prompts, across 2,700 images produced by three state-of-the-art models (GPT Image-1, Imagen 4, and Stable Diffusion) in five object categories. Through a comprehensive analysis, we uncover strong associations between specific demographic groups and visual attributes, such as recurring color patterns prompted by gender or ethnicity cues. These patterns reflect and reinforce not only well-known stereotypes but also more subtle and unintuitive biases. We also observe that some models generate less diverse outputs, which in turn amplifies the visual disparities compared to neutral prompts. Our proposed auditing framework offers a practical approach for testing, revealing how stereotypes still remain embedded in today’s generative models. We see this as an essential step toward more systematic and responsible AI development.

[202] LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Text-to-Image Generation

Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, Qingyi Gu

Main category: cs.CV

TL;DR: LRQ-DiT is a post-training quantization framework for Diffusion Transformers (DiTs) that addresses weight and activation outliers to enable efficient low-bit quantization without significant performance loss.

DetailsMotivation: DiTs face high computational costs and large parameter sizes, making them impractical for resource-constrained scenarios. Existing PTQ methods degrade performance under low-bit settings due to weight distribution and activation outliers.

Method: Proposes Twin-Log Quantization (TLQ) for weights and Adaptive Rotation Scheme (ARS) for activations to mitigate quantization errors and outliers.

Result: LRQ-DiT outperforms existing PTQ baselines, preserving image quality under low-bit settings on datasets like COCO, MJHQ, and sDCI.

Conclusion: LRQ-DiT effectively enables efficient low-bit quantization for DiTs, addressing key challenges and improving practicality for resource-limited applications.

Abstract: Diffusion Transformers (DiTs) have achieved impressive performance in text-to-image generation. However, their high computational cost and large parameter sizes pose significant challenges for usage in resource-constrained scenarios. Post-training quantization (PTQ) is a promising solution to reduce memory usage and accelerate inference, but existing PTQ methods suffer from severe performance degradation under extreme low-bit settings. We identify two key obstacles to low-bit post-training quantization for DiT models: (1) model weights follow a Gaussian-like distribution with long tails, causing uniform quantization to poorly allocate intervals and leading to significant errors; (2) two types of activation outliers: (i) Mild Outliers with slightly elevated values, and (ii) Salient Outliers with large magnitudes concentrated in specific channels, which disrupt activation quantization. To address these issues, we propose LRQ-DiT, an efficient and accurate PTQ framework. We introduce Twin-Log Quantization (TLQ), a log-based method that aligns well with the weight distribution and reduces quantization errors. We also propose an Adaptive Rotation Scheme (ARS) that dynamically applies Hadamard or outlier-aware rotations based on activation fluctuation, effectively mitigating the impact of both types of outliers. We evaluate LRQ-DiT on PixArt and FLUX under various bit-width settings, and validate the performance on COCO, MJHQ, and sDCI datasets. LRQ-DiT achieves low-bit quantization of DiT models while preserving image quality, outperforming existing PTQ baselines.

[203] ParticleSAM: Small Particle Segmentation for Material Quality Monitoring in Recycling Processes

Yu Zhou, Pelle Thielmann, Ayush Chamoli, Bruno Mirbach, Didier Stricker, Jason Rambach

Main category: cs.CV

TL;DR: ParticleSAM adapts a segmentation foundation model for small, dense objects in construction materials, validated by a new dataset and outperforming the original SAM method.

DetailsMotivation: Manual quality monitoring of recycled construction materials is inefficient; vision-based ML could improve this but lacks suitable methods for small-particle images.

Method: Proposes ParticleSAM, an adaptation of a segmentation foundation model, and creates a dense multi-particle dataset using automated data generation.

Result: ParticleSAM outperforms the original SAM method in quantitative and qualitative experiments.

Conclusion: The method and dataset advance visual material quality control and have broader applications for small-particle segmentation.

Abstract: The construction industry represents a major sector in terms of resource consumption. Recycled construction material has high reuse potential, but quality monitoring of the aggregates is typically still performed with manual methods. Vision-based machine learning methods could offer a faster and more efficient solution to this problem, but existing segmentation methods are by design not directly applicable to images with hundreds of small particles. In this paper, we propose ParticleSAM, an adaptation of the segmentation foundation model to images with small and dense objects such as the ones often encountered in construction material particles. Moreover, we create a new dense multi-particle dataset simulated from isolated particle images with the assistance of an automated data generation and labeling pipeline. This dataset serves as a benchmark for visual material quality control automation while our segmentation approach has the potential to be valuable in application areas beyond construction where small-particle segmentation is needed. Our experimental results validate the advantages of our method by comparing to the original SAM method both in quantitative and qualitative experiments.

[204] Quality Versus Sparsity in Image Recovery by Dictionary Learning Using Iterative Shrinkage

Mohammadsadegh Khoshghiaferezaee, Moritz Krauth, Shima Shabani, Michael Breuß

Main category: cs.CV

TL;DR: The paper explores sparsity in sparse dictionary learning (SDL) for image recovery, showing that high sparsity doesn’t compromise quality and varies by optimization method.

DetailsMotivation: To understand how enforcing sparsity in SDL affects recovery quality and to identify sparsity regimes across optimization methods.

Method: Analyzes sparsity of solutions from various optimization methods in SDL, focusing on iterative shrinkage algorithms.

Result: Different sparsity regimes exist depending on the method, and high sparsity doesn’t harm recovery quality, even with dissimilar training data.

Conclusion: High sparsity in SDL is viable without compromising recovery quality, with sparsity levels varying by optimization approach.

Abstract: Sparse dictionary learning (SDL) is a fundamental technique that is useful for many image processing tasks. As an example we consider here image recovery, where SDL can be cast as a nonsmooth optimization problem. For this kind of problems, iterative shrinkage methods represent a powerful class of algorithms that are subject of ongoing research. Sparsity is an important property of the learned solutions, as exactly the sparsity enables efficient further processing or storage. The sparsity implies that a recovered image is determined as a combination of a number of dictionary elements that is as low as possible. Therefore, the question arises, to which degree sparsity should be enforced in SDL in order to not compromise recovery quality. In this paper we focus on the sparsity of solutions that can be obtained using a variety of optimization methods. It turns out that there are different sparsity regimes depending on the method in use. Furthermore, we illustrate that high sparsity does in general not compromise recovery quality, even if the recovered image is quite different from the learning database.

[205] Prototype-Enhanced Confidence Modeling for Cross-Modal Medical Image-Report Retrieval

Shreyank N Gowda, Xiaobo Jin, Christian Wagner

Main category: cs.CV

TL;DR: The paper introduces the Prototype-Enhanced Confidence Modeling (PECM) framework to improve cross-modal retrieval in medical data by addressing ambiguity and variability.

DetailsMotivation: Existing models fail to capture nuanced semantic relationships in radiology data, leading to unreliable retrieval results.

Method: PECM uses multi-level prototypes and dual-stream confidence estimation with adaptive weighting to enhance retrieval robustness.

Result: The method achieves up to 10.17% improvement in retrieval precision and consistency, setting a new state-of-the-art.

Conclusion: PECM effectively handles data ambiguity and improves reliability in clinical retrieval tasks.

Abstract: In cross-modal retrieval tasks, such as image-to-report and report-to-image retrieval, accurately aligning medical images with relevant text reports is essential but challenging due to the inherent ambiguity and variability in medical data. Existing models often struggle to capture the nuanced, multi-level semantic relationships in radiology data, leading to unreliable retrieval results. To address these issues, we propose the Prototype-Enhanced Confidence Modeling (PECM) framework, which introduces multi-level prototypes for each modality to better capture semantic variability and enhance retrieval robustness. PECM employs a dual-stream confidence estimation that leverages prototype similarity distributions and an adaptive weighting mechanism to control the impact of high-uncertainty data on retrieval rankings. Applied to radiology image-report datasets, our method achieves significant improvements in retrieval precision and consistency, effectively handling data ambiguity and advancing reliability in complex clinical scenarios. We report results on multiple different datasets and tasks including fully supervised and zero-shot retrieval obtaining performance gains of up to 10.17%, establishing in new state-of-the-art.

[206] Retinal Lipidomics Associations as Candidate Biomarkers for Cardiovascular Health

Inamullah, Imran Razzak, Shoaib Jameel

Main category: cs.CV

TL;DR: The study explores links between serum lipid subclasses and retinal microvascular traits, finding specific associations that suggest retinal imaging can reflect systemic metabolic health.

DetailsMotivation: To understand the relationship between lipidomics and retinal vasculature, which is understudied, and to evaluate retinal imaging as a non-invasive marker for metabolic health.

Method: Spearman correlation analysis was used to examine connections between lipid subclasses (FA, DAG, TAG, CE) and retinal microvascular traits, with BH-FDR adjustment for significance.

Result: FA correlated with vessel twistiness, CE with vessel widths, while DAG and TAG negatively correlated with arteriole/venule width and complexity.

Conclusion: Retinal vascular traits reflect distinct lipid profiles, supporting retinal imaging as a non-invasive metabolic health marker, independent of disease or treatment.

Abstract: Retinal microvascular imaging is increasingly recognised as a non invasive method for evaluating systemic vascular and metabolic health. However, the association between lipidomics and retinal vasculature remains inadequate. This study investigates the relationships between serum lipid subclasses, free fatty acids (FA), diacylglycerols (DAG), triacylglycerols (TAG), and cholesteryl esters (CE), and retinal microvascular characteristics in a large population-based cohort. Using Spearman correlation analysis, we examined the interconnection between lipid subclasses and ten retinal microvascular traits, applying the Benjamini-Hochberg false discovery rate (BH-FDR) to adjust for statistical significance. Results indicated that FA were linked to retinal vessel twistiness, while CE correlated with the average widths of arteries and veins. Conversely, DAG and TAG showed negative correlations with the width and complexity of arterioles and venules. These findings suggest that retinal vascular architecture reflects distinct circulating lipid profiles, supporting its role as a non-invasive marker of systemic metabolic health. This study is the first to integrate deep learning (DL)derived retinal traits with lipidomic subclasses in a healthy cohort, thereby providing insights into microvascular structural changes independent of disease status or treatment effects.

[207] EditGarment: An Instruction-Based Garment Editing Dataset Constructed with Automated MLLM Synthesis and Semantic-Aware Evaluation

Deqiang Yin, Junyi Guo, Huanda Lu, Fangyu Wu, Dongming Lu

Main category: cs.CV

TL;DR: The paper introduces an automated pipeline to create a garment editing dataset, addressing the lack of high-quality instruction-image pairs. It defines six editing categories and a new evaluation metric, Fashion Edit Score, resulting in the EditGarment dataset.

DetailsMotivation: Progress in instruction-based garment editing is hindered by scarce high-quality data and imprecise modeling. The paper aims to automate dataset construction for better fashion-specific supervision.

Method: The authors propose a pipeline with six instruction categories and the Fashion Edit Score metric to generate and evaluate high-quality instruction-image triplets.

Result: They construct 52,257 candidate triplets, retaining 20,596 for the EditGarment dataset, the first tailored to standalone garment editing.

Conclusion: The pipeline and dataset address key challenges in garment editing, enabling more precise and scalable fashion design applications.

Abstract: Instruction-based garment editing enables precise image modifications via natural language, with broad applications in fashion design and customization. Unlike general editing tasks, it requires understanding garment-specific semantics and attribute dependencies. However, progress is limited by the scarcity of high-quality instruction-image pairs, as manual annotation is costly and hard to scale. While MLLMs have shown promise in automated data synthesis, their application to garment editing is constrained by imprecise instruction modeling and a lack of fashion-specific supervisory signals. To address these challenges, we present an automated pipeline for constructing a garment editing dataset. We first define six editing instruction categories aligned with real-world fashion workflows to guide the generation of balanced and diverse instruction-image triplets. Second, we introduce Fashion Edit Score, a semantic-aware evaluation metric that captures semantic dependencies between garment attributes and provides reliable supervision during construction. Using this pipeline, we construct a total of 52,257 candidate triplets and retain 20,596 high-quality triplets to build EditGarment, the first instruction-based dataset tailored to standalone garment editing. The project page is https://yindq99.github.io/EditGarment-project/.

[208] MAUP: Training-free Multi-center Adaptive Uncertainty-aware Prompting for Cross-domain Few-shot Medical Image Segmentation

Yazhou Zhu, Haofeng Zhang

Main category: cs.CV

TL;DR: A training-free CD-FSMIS model using MAUP strategy adapts SAM for medical image segmentation without additional training, outperforming conventional models.

DetailsMotivation: Current CD-FSMIS models require heavy training, limiting universality and deployment ease. Leveraging large natural image models like SAM offers a solution.

Method: MAUP strategy includes multi-center prompts generation, uncertainty-aware selection, and adaptive optimization to adapt SAM for medical segmentation.

Result: MAUP achieves precise segmentation across three medical datasets without training, outperforming conventional CD-FSMIS and training-free models.

Conclusion: MAUP provides an effective, training-free solution for CD-FSMIS, enhancing universality and deployment ease.

Abstract: Cross-domain Few-shot Medical Image Segmentation (CD-FSMIS) is a potential solution for segmenting medical images with limited annotation using knowledge from other domains. The significant performance of current CD-FSMIS models relies on the heavily training procedure over other source medical domains, which degrades the universality and ease of model deployment. With the development of large visual models of natural images, we propose a training-free CD-FSMIS model that introduces the Multi-center Adaptive Uncertainty-aware Prompting (MAUP) strategy for adapting the foundation model Segment Anything Model (SAM), which is trained with natural images, into the CD-FSMIS task. To be specific, MAUP consists of three key innovations: (1) K-means clustering based multi-center prompts generation for comprehensive spatial coverage, (2) uncertainty-aware prompts selection that focuses on the challenging regions, and (3) adaptive prompt optimization that can dynamically adjust according to the target region complexity. With the pre-trained DINOv2 feature encoder, MAUP achieves precise segmentation results across three medical datasets without any additional training compared with several conventional CD-FSMIS models and training-free FSMIS model. The source code is available at: https://github.com/YazhouZhu19/MAUP.

[209] Distribution-aware Knowledge Unification and Association for Non-exemplar Lifelong Person Re-identification

Shiben Liu, Mingyue Xu, Huijie Fan, Qiang Wang, Yandong Tang, Zhi Han

Main category: cs.CV

TL;DR: The paper proposes DKUA, a framework for Lifelong Person Re-Identification (LReID), addressing knowledge retention and adaptation challenges through domain-style modeling, adaptive knowledge consolidation, and unified knowledge association.

DetailsMotivation: Existing LReID methods lack specific distribution awareness and cross-domain unified knowledge learning, limiting their ability to balance old knowledge preservation and new information adaptation.

Method: The DKUA framework includes domain-style modeling, adaptive knowledge consolidation (AKC), unified knowledge association (UKA), and distribution-based knowledge transfer (DKT).

Result: DKUA achieves 7.6%/5.3% average mAP/R@1 improvement over existing methods in anti-forgetting and generalization.

Conclusion: DKUA effectively addresses LReID challenges by unifying domain-specific and cross-domain knowledge, enhancing both retention and adaptation.

Abstract: Lifelong person re-identification (LReID) encounters a key challenge: balancing the preservation of old knowledge with adaptation to new information. Existing LReID methods typically employ knowledge distillation to enforce representation alignment. However, these approaches ignore two crucial aspects: specific distribution awareness and cross-domain unified knowledge learning, both of which are essential for addressing this challenge. To overcome these limitations, we propose a novel distribution-aware knowledge unification and association (DKUA) framework where domain-style modeling is performed for each instance to propagate domain-specific representations, enhancing anti-forgetting and generalization capacity. Specifically, we design a distribution-aware model to transfer instance-level representations of the current domain into the domain-specific representations with the different domain styles, preserving learned knowledge without storing old samples. Next, we propose adaptive knowledge consolidation (AKC) to dynamically generate the unified representation as a cross-domain representation center. To further mitigate forgetting, we develop a unified knowledge association (UKA) mechanism, which explores the unified representation as a bridge to explicitly model inter-domain associations, reducing inter-domain gaps. Finally, distribution-based knowledge transfer (DKT) is proposed to prevent the current domain distribution from deviating from the cross-domain distribution center, improving adaptation capacity. Experimental results show our DKUA outperforms the existing methods by 7.6%/5.3% average mAP/R@1 improvement on anti-forgetting and generalization capacity, respectively. Our code will be publicly released.

[210] MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy

Wuyang Li, Wentao Pan, Xiaoyuan Liu, Zhendong Luo, Chenxin Li, Hengyu Liu, Din Ping Tsai, Mu Ku Chen, Yixuan Yuan

Main category: cs.CV

TL;DR: The paper introduces MetaScope, a neural network for metalens endoscopy, addressing optical issues like intensity decay and chromatic aberration, and outperforms existing methods.

DetailsMotivation: Existing endoscopy with convex lenses faces physical constraints; metalens offers a solution but lacks data and algorithm research.

Method: Establishes metalens datasets, proposes MetaScope with Optics-informed Intensity Adjustment (OIA) and Optics-informed Chromatic Correction (OCC), and uses gradient-guided distillation.

Result: MetaScope excels in segmentation and restoration, showing strong generalization in biomedical scenes.

Conclusion: MetaScope bridges the gap in metalens endoscopy research, offering a robust solution with superior performance.

Abstract: Miniaturized endoscopy has advanced accurate visual perception within the human body. Prevailing research remains limited to conventional cameras employing convex lenses, where the physical constraints with millimetre-scale thickness impose serious impediments on the micro-level clinical. Recently, with the emergence of meta-optics, ultra-micro imaging based on metalenses (micron-scale) has garnered great attention, serving as a promising solution. However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. First, we establish datasets for metalens endoscopy and conduct preliminary optical simulation, identifying two derived optical issues that physically adhere to strong optical priors. Second, we propose MetaScope, a novel optics-driven neural network tailored for metalens endoscopy driven by physical optics. MetaScope comprises two novel designs: Optics-informed Intensity Adjustment (OIA), rectifying intensity decay by learning optical embeddings, and Optics-informed Chromatic Correction (OCC), mitigating chromatic aberration by learning spatial deformations informed by learned Point Spread Function (PSF) distributions. To enhance joint learning, we further deploy a gradient-guided distillation to transfer knowledge from the foundational model adaptively. Extensive experiments demonstrate that MetaScope not only outperforms state-of-the-art methods in both metalens segmentation and restoration but also achieves impressive generalized ability in real biomedical scenes.

[211] Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara

Main category: cs.CV

TL;DR: Talk2DINO combines DINOv2’s spatial accuracy with CLIP’s language understanding for improved open-vocabulary segmentation, achieving state-of-the-art results.

DetailsMotivation: Existing models like CLIP lack fine spatial localization, while DINO lacks language integration. Talk2DINO bridges this gap.

Method: Aligns CLIP’s textual embeddings with DINOv2’s patch-level features via a learned mapping, using DINOv2’s attention maps for selective alignment.

Result: Produces more natural, less noisy segmentations and effectively distinguishes foreground from background.

Conclusion: Talk2DINO outperforms existing methods in unsupervised OVS benchmarks, offering a hybrid solution for better segmentation.

Abstract: Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.

[212] Semantic Mosaicing of Histo-Pathology Image Fragments using Visual Foundation Models

Stefan Brandstätter, Maximilian Köller, Philipp Seeböck, Alissa Blessing, Felicitas Oberndorfer, Svitlana Pochepnia, Helmut Prosch, Georg Langs

Main category: cs.CV

TL;DR: SemanticStitcher uses a histopathology foundation model to automate tissue fragment stitching, outperforming existing methods in accuracy.

DetailsMotivation: Automated stitching of large tissue samples is challenging due to preparation artifacts and distortions, limiting current methods.

Method: Uses latent feature representations from a visual histopathology model for semantic matching and robust pose estimation to create whole mount slides.

Result: Outperforms state-of-the-art methods in boundary matches across three datasets.

Conclusion: SemanticStitcher provides a robust solution for automated tissue stitching in histopathology.

Abstract: In histopathology, tissue samples are often larger than a standard microscope slide, making stitching of multiple fragments necessary to process entire structures such as tumors. Automated stitching is a prerequisite for scaling analysis, but is challenging due to possible tissue loss during preparation, inhomogeneous morphological distortion, staining inconsistencies, missing regions due to misalignment on the slide, or frayed tissue edges. This limits state-of-the-art stitching methods using boundary shape matching algorithms to reconstruct artificial whole mount slides (WMS). Here, we introduce SemanticStitcher using latent feature representations derived from a visual histopathology foundation model to identify neighboring areas in different fragments. Robust pose estimation based on a large number of semantic matching candidates derives a mosaic of multiple fragments to form the WMS. Experiments on three different histopathology datasets demonstrate that SemanticStitcher yields robust WMS mosaicing and consistently outperforms the state of the art in correct boundary matches.

[213] WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang, Xinheng Lyu, Meidan Ding, Wenting Chen, Jipeng Zhang, Yuexiang Ren, Xiangjian He, Song Wu, Sen Yang, Xiyue Wang, Xiaohan Xing, Linlin Shen

Main category: cs.CV

TL;DR: WSI-LLaVA, a new framework for gigapixel WSI understanding, outperforms existing models by addressing limitations of patch-level MLLMs through a three-stage training approach and specialized metrics.

DetailsMotivation: Current MLLMs lack comprehensive WSI analysis and miss key morphological features critical for diagnosis, necessitating a more robust solution.

Method: Introduces WSI-Bench (a morphology-aware benchmark) and WSI-LLaVA (a three-stage training framework: WSI-text alignment, feature space alignment, task-specific tuning), with specialized metrics (WSI-Precision, WSI-Relevance).

Result: WSI-LLaVA surpasses existing models, showing significant improvement in morphological analysis and diagnostic accuracy.

Conclusion: The framework establishes a strong link between morphological understanding and diagnostic accuracy, setting a new standard for WSI analysis.

Abstract: Recent advancements in computational pathology have produced patch-level Multi-modal Large Language Models (MLLMs), but these models are limited by their inability to analyze whole slide images (WSIs) comprehensively and their tendency to bypass crucial morphological features that pathologists rely on for diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, designed to evaluate MLLMs’ understanding of morphological characteristics crucial for accurate diagnosis. Building upon this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI understanding that employs a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning. To better assess model performance in pathological contexts, we develop two specialized WSI metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that WSI-LLaVA outperforms existing models across all capability dimensions, with a significant improvement in morphological analysis, establishing a clear correlation between morphological understanding and diagnostic accuracy.

[214] CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation

Kaishen Yuan, Yuting Zhang, Shang Gao, Yijie Zhu, Wenshuo Chen, Yutao Yue

Main category: cs.CV

TL;DR: CoEmoGen is a pipeline for generating emotionally faithful and semantically coherent images using multimodal LLMs and a HiLoRA module, outperforming existing methods.

DetailsMotivation: Existing text-to-image models struggle with abstract emotions, and current EICG methods rely on flawed word-level labels. CoEmoGen addresses these issues.

Method: Uses multimodal LLMs for emotion-triggering captions and a HiLoRA module to model polarity-shared and emotion-specific features.

Result: Demonstrates superior emotional faithfulness and semantic coherence via experiments and user studies. Introduces EmoArt dataset.

Conclusion: CoEmoGen offers scalable, high-quality emotional image generation, supported by the EmoArt dataset.

Abstract: Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories, with broad application prospects. While recent text-to-image diffusion models excel at generating concrete concepts, they struggle with the complexity of abstract emotions. There have also emerged methods specifically designed for EICG, but they excessively rely on word-level attribute labels for guidance, which suffer from semantic incoherence, ambiguity, and limited scalability. To address these challenges, we propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability. Specifically, leveraging multimodal large language models (MLLMs), we construct high-quality captions focused on emotion-triggering content for context-rich semantic guidance. Furthermore, inspired by psychological insights, we design a Hierarchical Low-Rank Adaptation (HiLoRA) module to cohesively model both polarity-shared low-level features and emotion-specific high-level semantics. Extensive experiments demonstrate CoEmoGen’s superiority in emotional faithfulness and semantic coherence from quantitative, qualitative, and user study perspectives. To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images, providing endless inspiration for emotion-driven artistic creation. The dataset and code are available at https://github.com/yuankaishen2001/CoEmoGen.

[215] AttZoom: Attention Zoom for Better Visual Features

Daniel DeAlcala, Aythami Morales, Julian Fierrez, Ruben Tolosana

Main category: cs.CV

TL;DR: Attention Zoom is a model-agnostic spatial attention mechanism for CNNs, improving feature extraction without architecture-specific integration.

DetailsMotivation: To enhance feature extraction in CNNs with a flexible, standalone attention layer.

Method: Introduces a modular spatial attention layer that emphasizes high-importance regions in inputs.

Result: Consistent accuracy improvements on CIFAR-100 and TinyImageNet; visual analysis shows fine-grained attention patterns.

Conclusion: Attention Zoom effectively improves CNNs with minimal overhead, demonstrating generality and performance gains.

Abstract: We present Attention Zoom, a modular and model-agnostic spatial attention mechanism designed to improve feature extraction in convolutional neural networks (CNNs). Unlike traditional attention approaches that require architecture-specific integration, our method introduces a standalone layer that spatially emphasizes high-importance regions in the input. We evaluated Attention Zoom on multiple CNN backbones using CIFAR-100 and TinyImageNet, showing consistent improvements in Top-1 and Top-5 classification accuracy. Visual analyses using Grad-CAM and spatial warping reveal that our method encourages fine-grained and diverse attention patterns. Our results confirm the effectiveness and generality of the proposed layer for improving CCNs with minimal architectural overhead.

[216] CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base

Cong-Duy Nguyen, Xiaobao Wu, Duc Anh Vu, Shuai Zhao, Thong Nguyen, Anh Tuan Luu

Main category: cs.CV

TL;DR: CutPaste&Find is a lightweight, training-free framework for detecting hallucinations in LVLM-generated outputs, using off-the-shelf modules and a Visual-aid Knowledge Base for efficient verification.

DetailsMotivation: LVLMs suffer from hallucination issues (e.g., fabricating objects/attributes), and existing detection methods are impractical due to high costs and reliance on LVLM inference.

Method: Proposes CutPaste&Find, leveraging visual and linguistic modules for multi-step verification without LVLM inference, using a Visual-aid Knowledge Base and a scaling factor for refined similarity scores.

Result: Achieves competitive hallucination detection performance on benchmarks (POPE, R-Bench) while being more efficient and cost-effective than prior methods.

Conclusion: CutPaste&Find offers a practical solution for large-scale or offline hallucination detection in LVLM outputs, balancing performance and efficiency.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, but they remain susceptible to hallucination, particularly object hallucination where non-existent objects or incorrect attributes are fabricated in generated descriptions. Existing detection methods achieve strong performance but rely heavily on expensive API calls and iterative LVLM-based validation, making them impractical for large-scale or offline use. To address these limitations, we propose CutPaste&Find, a lightweight and training-free framework for detecting hallucinations in LVLM-generated outputs. Our approach leverages off-the-shelf visual and linguistic modules to perform multi-step verification efficiently without requiring LVLM inference. At the core of our framework is a Visual-aid Knowledge Base that encodes rich entity-attribute relationships and associated image representations. We introduce a scaling factor to refine similarity scores, mitigating the issue of suboptimal alignment values even for ground-truth image-text pairs. Comprehensive evaluations on benchmark datasets, including POPE and R-Bench, demonstrate that CutPaste&Find achieves competitive hallucination detection performance while being significantly more efficient and cost-effective than previous methods.

[217] Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection

Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang

Main category: cs.CV

TL;DR: ARAS introduces a language-conditioned, auto-regressive anomaly synthesis method to improve defect realism and semantic control, integrated with QARAD for enhanced anomaly detection.

DetailsMotivation: Existing anomaly synthesis methods suffer from structural deficiencies like discontinuities and limited controllability.

Method: ARAS uses token-anchored latent editing, a hard-gated auto-regressive operator, and a masked sampling kernel for precise defect injection. QARAD employs dynamic weighting based on image-text similarity.

Result: QARAD outperforms SOTA methods in anomaly detection, achieving better accuracy, robustness, and 5x faster synthesis.

Conclusion: ARAS and QARAD offer improved anomaly synthesis and detection, with public code and dataset availability.

Abstract: Despite substantial progress in anomaly synthesis methods, existing diffusion-based and coarse inpainting pipelines commonly suffer from structural deficiencies such as micro-structural discontinuities, limited semantic controllability, and inefficient generation. To overcome these limitations, we introduce ARAS, a language-conditioned, auto-regressive anomaly synthesis approach that precisely injects local, text-specified defects into normal images via token-anchored latent editing. Leveraging a hard-gated auto-regressive operator and a training-free, context-preserving masked sampling kernel, ARAS significantly enhances defect realism, preserves fine-grained material textures, and provides continuous semantic control over synthesized anomalies. Integrated within our Quality-Aware Re-weighted Anomaly Detection (QARAD) framework, we further propose a dynamic weighting strategy that emphasizes high-quality synthetic samples by computing an image-text similarity score with a dual-encoder model. Extensive experiments across three benchmark datasets-MVTec AD, VisA, and BTAD, demonstrate that our QARAD outperforms SOTA methods in both image- and pixel-level anomaly detection tasks, achieving improved accuracy, robustness, and a 5 times synthesis speedup compared to diffusion-based alternatives. Our complete code and synthesized dataset will be publicly available.

[218] Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

Dmitrii Korzh, Dmitrii Tarasov, Artyom Iudin, Elvir Karimov, Matvey Skripkin, Nikita Kuzmin, Andrey Kuznetsov, Oleg Y. Rogov, Ivan Oseledets

Main category: cs.CV

TL;DR: The paper introduces a large-scale dataset for converting spoken math expressions to LaTeX, addressing gaps in prior work like limited data and multilingual coverage. It achieves competitive results on benchmarks.

DetailsMotivation: The task of converting spoken math to LaTeX is underexplored despite its educational and research applications. Prior work has limitations like small datasets and lack of multilingual support.

Method: The authors create a large open-source dataset (66,000 samples in English and Russian) and use ASR post-correction, few-shot prompting, and audio language models.

Result: Their models achieve a 28% CER on MathSpeech and outperform it by 40+ points on the S2L-equations benchmark. They also set a benchmark for math sentence recognition (40% CER).

Conclusion: This work advances multimodal AI for math content recognition and provides a foundation for future research.

Abstract: Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in both English and Russian, drawn from diverse scientific domains. In addition to the ASR post-correction models and few-shot prompting, we apply audio language models, demonstrating comparable character error rate (CER) results on the MathSpeech benchmark (28% vs. 30%) for the equations conversion. In contrast, on the proposed S2L-equations benchmark, our models outperform the MathSpeech model by a substantial margin of more than 40 percentage points, even after accounting for LaTeX formatting artifacts (27% vs. 64%). We establish the first benchmark for mathematical sentence recognition (S2L-sentences) and achieve an equation CER of 40%. This work lays the groundwork for future advances in multimodal AI, with a particular focus on mathematical content recognition.

[219] DyCAF-Net: Dynamic Class-Aware Fusion Network

Md Abrar Jahin, Shahriar Soudeep, M. F. Mridha, Nafiz Fahad, Md. Jakir Hossen

Main category: cs.CV

TL;DR: DyCAF-Net introduces dynamic class-aware fusion and attention to improve object detection in challenging scenes, outperforming baselines in precision and mAP.

DetailsMotivation: Static fusion and class-agnostic attention limit performance in dynamic scenes with occlusions, clutter, and class imbalance.

Method: DyCAF-Net uses input-conditioned equilibrium-based neck, dual dynamic attention, and class-aware feature adaptation.

Result: Achieves significant improvements in precision and mAP across 13 benchmarks, maintaining efficiency.

Conclusion: DyCAF-Net is a robust solution for real-world detection tasks due to its adaptability and efficiency.

Abstract: Recent advancements in object detection rely on modular architectures with multi-scale fusion and attention mechanisms. However, static fusion heuristics and class-agnostic attention limit performance in dynamic scenes with occlusions, clutter, and class imbalance. We introduce Dynamic Class-Aware Fusion Network (DyCAF-Net) that addresses these challenges through three innovations: (1) an input-conditioned equilibrium-based neck that iteratively refines multi-scale features via implicit fixed-point modeling, (2) a dual dynamic attention mechanism that adaptively recalibrates channel and spatial responses using input- and class-dependent cues, and (3) class-aware feature adaptation that modulates features to prioritize discriminative regions for rare classes. Through comprehensive ablation studies with YOLOv8 and related architectures, alongside benchmarking against nine state-of-the-art baselines, DyCAF-Net achieves significant improvements in precision, mAP@50, and mAP@50-95 across 13 diverse benchmarks, including occlusion-heavy and long-tailed datasets. The framework maintains computational efficiency ($\sim$11.1M parameters) and competitive inference speeds, while its adaptability to scale variance, semantic overlaps, and class imbalance positions it as a robust solution for real-world detection tasks in medical imaging, surveillance, and autonomous systems.

[220] Advancing Wildlife Monitoring: Drone-Based Sampling for Roe Deer Density Estimation

Stephanie Wohlfahrt, Christoph Praschl, Horst Leitner, Wolfram Jantsch, Julia Konic, Silvio Schueler, Andreas Stöckl, David C. Schedl

Main category: cs.CV

TL;DR: Drones with thermal and RGB imagery efficiently estimate wildlife density, yielding higher densities than camera traps, offering scalable and non-intrusive monitoring.

DetailsMotivation: Traditional wildlife density methods like capture-recapture or camera traps are labor-intensive or spatially limited. Drones provide a more efficient and scalable alternative.

Method: Drones surveyed areas using thermal and RGB imagery at 60 m altitude, with systematic randomized transects. Three extrapolation methods (naive, bootstrapping, zero-inflated negative binomial) were applied and compared to camera trap REM estimates.

Result: Drone-based estimates were generally higher than REM, except in one area, reflecting daytime activity in open and forested zones.

Conclusion: Drones are a promising, scalable tool for wildlife density estimation, complementing traditional methods with unique insights.

Abstract: We use unmanned aerial drones to estimate wildlife density in southeastern Austria and compare these estimates to camera trap data. Traditional methods like capture-recapture, distance sampling, or camera traps are well-established but labour-intensive or spatially constrained. Using thermal (IR) and RGB imagery, drones enable efficient, non-intrusive animal counting. Our surveys were conducted during the leafless period on single days in October and November 2024 in three areas of a sub-Illyrian hill and terrace landscape. Flight transects were based on predefined launch points using a 350 m grid and an algorithm that defined the direction of systematically randomized transects. This setup allowed surveying large areas in one day using multiple drones, minimizing double counts. Flight altitude was set at 60 m to avoid disturbing roe deer (Capreolus capreolus) while ensuring detection. Animals were manually annotated in the recorded imagery and extrapolated to densities per square kilometer. We applied three extrapolation methods with increasing complexity: naive area-based extrapolation, bootstrapping, and zero-inflated negative binomial modelling. For comparison, a Random Encounter Model (REM) estimate was calculated using camera trap data from the flight period. The drone-based methods yielded similar results, generally showing higher densities than REM, except in one area in October. We hypothesize that drone-based density reflects daytime activity in open and forested areas, while REM estimates average activity over longer periods within forested zones. Although both approaches estimate density, they offer different perspectives on wildlife presence. Our results show that drones offer a promising, scalable method for wildlife density estimation.

[221] A Scalable Machine Learning Pipeline for Building Footprint Detection in Historical Maps

Annemarie McCarthy

Main category: cs.CV

TL;DR: A scalable pipeline using hierarchical ML (CNN classifiers and segmentation) efficiently extracts building footprints from rural historical maps, validated on Irish Ordnance Survey maps, revealing abandoned settlements.

DetailsMotivation: Prior methods are computationally intensive and urban-focused, limiting rural analysis for tasks like verifying census data or locating abandoned settlements.

Method: Hierarchical ML: CNN classifiers filter out low-probability map sections, then CNN segmentation extracts building features from high-probability areas.

Result: Validated on Irish maps, the pipeline showed high performance and efficiency, identifying an abandoned settlement (22 buildings) in Tully, Co. Galway.

Conclusion: The pipeline enables efficient rural map analysis, aiding historical and archaeological discoveries, as demonstrated by uncovering a likely Famine-era abandoned settlement.

Abstract: Historical maps offer a valuable lens through which to study past landscapes and settlement patterns. While prior research has leveraged machine learning based techniques to extract building footprints from historical maps, such approaches have largely focused on urban areas and tend to be computationally intensive. This presents a challenge for research questions requiring analysis across extensive rural regions, such as verifying historical census data or locating abandoned settlements. In this paper, this limitation is addressed by proposing a scalable and efficient pipeline tailored to rural maps with sparse building distributions. The method described employs a hierarchical machine learning based approach: convolutional neural network (CNN) classifiers are first used to progressively filter out map sections unlikely to contain buildings, significantly reducing the area requiring detailed analysis. The remaining high probability sections are then processed using CNN segmentation algorithms to extract building features. The pipeline is validated using test sections from the Ordnance Survey Ireland historical 25 inch map series and 6 inch map series, demonstrating both high performance and improved efficiency compared to conventional segmentation-only approaches. Application of the technique to both map series, covering the same geographic region, highlights its potential for historical and archaeological discovery. Notably, the pipeline identified a settlement of approximately 22 buildings in Tully, Co. Galway, present in the 6 inch map, produced in 1839, but absent from the 25 inch map, produced in 1899, suggesting it may have been abandoned during the Great Famine period.

[222] SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks

Xinyu Xiong, Zihuang Wu, Lei Zhang, Lei Lu, Ming Li, Guanbin Li

Main category: cs.CV

TL;DR: SAM2-UNeXT enhances SAM2-UNet by integrating a DINOv2 encoder and dual-resolution strategy for better segmentation performance.

DetailsMotivation: To improve the generalizability and performance of SAM2 for downstream tasks by addressing encoder limitations.

Method: Proposes SAM2-UNeXT, combining SAM2-UNet with a DINOv2 encoder, dual-resolution strategy, and dense glue layer for simpler, more accurate segmentation.

Result: Outperforms benchmarks in dichotomous image segmentation, camouflaged object detection, marine animal segmentation, and remote sensing saliency detection.

Conclusion: SAM2-UNeXT offers a powerful, generalizable solution for segmentation tasks with a simple architecture.

Abstract: Recent studies have highlighted the potential of adapting the Segment Anything Model (SAM) for various downstream tasks. However, constructing a more powerful and generalizable encoder to further enhance performance remains an open challenge. In this work, we propose SAM2-UNeXT, an advanced framework that builds upon the core principles of SAM2-UNet while extending the representational capacity of SAM2 through the integration of an auxiliary DINOv2 encoder. By incorporating a dual-resolution strategy and a dense glue layer, our approach enables more accurate segmentation with a simple architecture, relaxing the need for complex decoder designs. Extensive experiments conducted on four benchmarks, including dichotomous image segmentation, camouflaged object detection, marine animal segmentation, and remote sensing saliency detection, demonstrate the superior performance of our proposed method. The code is available at https://github.com/WZH0120/SAM2-UNeXT.

[223] RadProPoser: A Framework for Human Pose Estimation with Uncertainty Quantification from Raw Radar Data

Jonas Leo Mueller, Lukas Engel, Eva Dorschky, Daniel Krauss, Ingrid Ullmann, Martin Vossiek, Bjoern M. Eskofier

Main category: cs.CV

TL;DR: RadProPoser is a probabilistic encoder-decoder for radar-based human pose estimation, handling noisy data and predicting joint locations with uncertainty. It achieves 6.425 cm MPJPE and supports data augmentation.

DetailsMotivation: Radar-based HPE is privacy-preserving and illumination-invariant but suffers from noisy, multipath-affected measurements.

Method: Uses variational inference for keypoint regression, processing complex-valued radar tensors from a MIMO radar, and explores Gaussian/Laplace distributions.

Result: Achieves 6.425 cm MPJPE, strong uncertainty alignment, and 0.870 F1 score for activity classification.

Conclusion: First end-to-end radar HPE system modeling per-joint uncertainty, enabling reliable human motion analysis.

Abstract: Radar-based human pose estimation (HPE) provides a privacy-preserving, illumination-invariant sensing modality but is challenged by noisy, multipath-affected measurements. We introduce RadProPoser, a probabilistic encoder-decoder architecture that processes complex-valued radar tensors from a compact 3-transmitter, 4-receiver MIMO radar. By incorporating variational inference into keypoint regression, RadProPoser jointly predicts 26 three-dimensional joint locations alongside heteroscedastic aleatoric uncertainties and can be recalibrated to predict total uncertainty. We explore different probabilistic formulations using both Gaussian and Laplace distributions for latent priors and likelihoods. On our newly released dataset with optical motion-capture ground truth, RadProPoser achieves an overall mean per-joint position error (MPJPE) of 6.425 cm, with 5.678 cm at the 45 degree aspect angle. The learned uncertainties exhibit strong alignment with actual pose errors and can be calibrated to produce reliable prediction intervals, with our best configuration achieving an expected calibration error of 0.021. As an additional demonstration, sampling from these latent distributions enables effective data augmentation for downstream activity classification, resulting in an F1 score of 0.870. To our knowledge, this is the first end-to-end radar tensor-based HPE system to explicitly model and quantify per-joint uncertainty from raw radar tensor data, establishing a foundation for explainable and reliable human motion analysis in radar applications.

[224] evTransFER: A Transfer Learning Framework for Event-based Facial Expression Recognition

Rodrigo Verschae, Ignacio Bugueno-Cordova

Main category: cs.CV

TL;DR: The paper introduces evTransFER, a transfer learning framework for face expression recognition using event-based cameras, achieving a 93.6% recognition rate on the e-CK+ database.

DetailsMotivation: Event-based cameras capture high-resolution spatio-temporal data, but leveraging this for facial expression recognition requires innovative methods.

Method: Proposes a transfer learning approach using a pre-trained encoder from facial reconstruction, an LSTM for long-term dynamics, and a new event-based representation (TIE).

Result: Achieves 93.6% recognition rate, a 25.9% improvement over state-of-the-art methods.

Conclusion: evTransFER effectively leverages transfer learning and novel architectures to enhance facial expression recognition with event-based cameras.

Abstract: Event-based cameras are bio-inspired vision sensors that asynchronously capture per-pixel intensity changes with microsecond latency, high temporal resolution, and high dynamic range, providing valuable information about the spatio-temporal dynamics of the scene. In the present work, we propose evTransFER, a transfer learning-based framework and architecture for face expression recognition using event-based cameras. The main contribution is a feature extractor designed to encode the spatio-temporal dynamics of faces, built by training an adversarial generative method on a different problem (facial reconstruction) and then transferring the trained encoder weights to the face expression recognition system. We show that this proposed transfer learning method greatly improves the ability to recognize facial expressions compared to training a network from scratch. In addition, we propose an architecture that incorporates an LSTM to capture longer-term facial expression dynamics, and we introduce a new event-based representation, referred to as TIE, both of which further improve the results. We evaluate the proposed framework on the event-based facial expression database e-CK+ and compare it to state-of-the-art methods. The results show that the proposed framework evTransFER achieves a 93.6% recognition rate on the e-CK+ database, significantly improving the accuracy (25.9% points or more) when compared to state-of-the-art performance for similar problems.

[225] FPG-NAS: FLOPs-Aware Gated Differentiable Neural Architecture Search for Efficient 6DoF Pose Estimation

Nassim Ali Ousalah, Peyman Rostami, Anis Kacem, Enjie Ghorbel, Emmanuel Koumandakis, Djamila Aouada

Main category: cs.CV

TL;DR: FPG-NAS is a FLOPs-aware gated differentiable NAS framework for efficient 6DoF object pose estimation, balancing accuracy and computational efficiency.

DetailsMotivation: Existing 6DoF pose estimation methods are computationally demanding, limiting their use in resource-constrained scenarios. FPG-NAS aims to address this by optimizing both accuracy and efficiency.

Method: FPG-NAS uses a task-specific search space and a differentiable gating mechanism for discrete multi-candidate operator selection, along with a FLOPs regularization term.

Result: Experiments on LINEMOD and SPEED+ datasets show FPG-NAS outperforms previous methods under strict FLOPs constraints.

Conclusion: FPG-NAS is the first differentiable NAS framework tailored for 6DoF pose estimation, offering improved architectural diversity and efficiency.

Abstract: We introduce FPG-NAS, a FLOPs-aware Gated Differentiable Neural Architecture Search framework for efficient 6DoF object pose estimation. Estimating 3D rotation and translation from a single image has been widely investigated yet remains computationally demanding, limiting applicability in resource-constrained scenarios. FPG-NAS addresses this by proposing a specialized differentiable NAS approach for 6DoF pose estimation, featuring a task-specific search space and a differentiable gating mechanism that enables discrete multi-candidate operator selection, thus improving architectural diversity. Additionally, a FLOPs regularization term ensures a balanced trade-off between accuracy and efficiency. The framework explores a vast search space of approximately 10\textsuperscript{92} possible architectures. Experiments on the LINEMOD and SPEED+ datasets demonstrate that FPG-NAS-derived models outperform previous methods under strict FLOPs constraints. To the best of our knowledge, FPG-NAS is the first differentiable NAS framework specifically designed for 6DoF object pose estimation.

[226] Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

Xiangyu Sun, Haoyi jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park

Main category: cs.CV

TL;DR: Uni3R is a feed-forward framework for joint 3D scene reconstruction and open-vocabulary semantic interpretation from unposed multi-view images, achieving state-of-the-art performance.

DetailsMotivation: Overcome limitations of conventional methods that decouple semantic understanding from reconstruction or require costly per-scene optimization, aiming for scalability and generalizability.

Method: Leverages a Cross-View Transformer to integrate multi-view inputs and regresses 3D Gaussian primitives with semantic feature fields.

Result: Achieves 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet, enabling high-fidelity novel view synthesis, semantic segmentation, and depth prediction.

Conclusion: Uni3R introduces a generalizable, unified paradigm for 3D scene reconstruction and understanding.

Abstract: Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.

[227] Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur

Main category: cs.CV

TL;DR: The paper introduces Document Haystack, a benchmark for evaluating Vision Language Models (VLMs) on long, visually complex documents, addressing the lack of suitable benchmarks in this area.

DetailsMotivation: The processing of long documents by multimodal Large Language Models is under-explored due to a lack of benchmarks.

Method: Document Haystack includes documents (5-200 pages) with inserted text or multimodal ’needles’ and 8,250 questions, supported by an automated evaluation framework.

Result: The benchmark evaluates VLMs’ retrieval capabilities and presents results from prominent models.

Conclusion: Document Haystack fills a gap in evaluating VLMs for long documents and suggests future research directions.

Abstract: The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image “needles” at various depths within the documents to challenge VLMs’ retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present results from prominent VLMs and discuss potential research avenues in this area.

[228] OmniShape: Zero-Shot Multi-Hypothesis Shape and Pose Estimation in the Real World

Katherine Liu, Sergey Zakharov, Dian Chen, Takuya Ikeda, Greg Shakhnarovich, Adrien Gaidon, Rares Ambrus

Main category: cs.CV

TL;DR: OmniShape is a method for probabilistic pose and shape estimation from a single observation, using decoupled multi-modal distributions and conditional diffusion models.

DetailsMotivation: To estimate object pose and full shape without known 3D models or categories, addressing the challenge of single-view reconstruction.

Method: Decouples shape completion into two multi-modal distributions: one for measurements in a normalized reference frame and another for object geometries as triplanar neural fields. Uses conditional diffusion models for each.

Result: Demonstrates strong performance on real-world datasets, enabling sampling of multiple hypotheses for pose and shape.

Conclusion: OmniShape is the first method to achieve probabilistic pose and shape estimation from a single observation, showing promise for practical applications.

Abstract: We would like to estimate the pose and full shape of an object from a single observation, without assuming known 3D model or category. In this work, we propose OmniShape, the first method of its kind to enable probabilistic pose and shape estimation. OmniShape is based on the key insight that shape completion can be decoupled into two multi-modal distributions: one capturing how measurements project into a normalized object reference frame defined by the dataset and the other modelling a prior over object geometries represented as triplanar neural fields. By training separate conditional diffusion models for these two distributions, we enable sampling multiple hypotheses from the joint pose and shape distribution. OmniShape demonstrates compelling performance on challenging real world datasets. Project website: https://tri-ml.github.io/omnishape

[229] Veila: Panoramic LiDAR Generation from a Monocular RGB Image

Youquan Liu, Lingdong Kong, Weidong Yang, Ao Liang, Jianxiong Gao, Yang Wu, Xiang Xu, Xin Li, Linfeng Li, Runnan Chen, Ben Fei

Main category: cs.CV

TL;DR: Veila is a conditional diffusion framework for generating controllable panoramic LiDAR data from monocular RGB images, addressing challenges like unreliable conditioning, modality gaps, and structural coherence.

DetailsMotivation: Existing methods lack fine-grained spatial control or rely on text-guided synthesis, which is insufficient for scalable 3D perception in autonomous driving and robotics.

Method: Veila integrates Confidence-Aware Conditioning Mechanism (CACM), Geometric Cross-Modal Alignment (GCMA), and Panoramic Feature Coherence (PFC) to enhance RGB-LiDAR alignment and structural consistency.

Result: Veila achieves state-of-the-art generation fidelity and cross-modal consistency on benchmarks like nuScenes and SemanticKITTI, improving downstream LiDAR semantic segmentation.

Conclusion: Veila offers a scalable and low-cost solution for realistic and controllable LiDAR data generation, advancing 3D perception in autonomous systems.

Abstract: Realistic and controllable panoramic LiDAR data generation is critical for scalable 3D perception in autonomous driving and robotics. Existing methods either perform unconditional generation with poor controllability or adopt text-guided synthesis, which lacks fine-grained spatial control. Leveraging a monocular RGB image as a spatial control signal offers a scalable and low-cost alternative, which remains an open problem. However, it faces three core challenges: (i) semantic and depth cues from RGB are vary spatially, complicating reliable conditioning generation; (ii) modality gaps between RGB appearance and LiDAR geometry amplify alignment errors under noisy diffusion; and (iii) maintaining structural coherence between monocular RGB and panoramic LiDAR is challenging, particularly in non-overlap regions between images and LiDAR. To address these challenges, we propose Veila, a novel conditional diffusion framework that integrates: a Confidence-Aware Conditioning Mechanism (CACM) that strengthens RGB conditioning by adaptively balancing semantic and depth cues according to their local reliability; a Geometric Cross-Modal Alignment (GCMA) for robust RGB-LiDAR alignment under noisy diffusion; and a Panoramic Feature Coherence (PFC) for enforcing global structural consistency across monocular RGB and panoramic LiDAR. Additionally, we introduce two metrics, Cross-Modal Semantic Consistency and Cross-Modal Depth Consistency, to evaluate alignment quality across modalities. Experiments on nuScenes, SemanticKITTI, and our proposed KITTI-Weather benchmark demonstrate that Veila achieves state-of-the-art generation fidelity and cross-modal consistency, while enabling generative data augmentation that improves downstream LiDAR semantic segmentation.

[230] La La LiDAR: Large-Scale Layout Generation from LiDAR Data

Youquan Liu, Lingdong Kong, Weidong Yang, Xin Li, Ao Liang, Runnan Chen, Ben Fei, Tongliang Liu

Main category: cs.CV

TL;DR: La La LiDAR is a layout-guided generative framework for controllable LiDAR scene generation, ensuring spatial and semantic consistency.

DetailsMotivation: Existing diffusion-based models lack explicit control over foreground objects and spatial relationships, limiting their utility for scenario simulation and safety validation.

Method: The framework uses semantic-enhanced scene graph diffusion with relation-aware contextual conditioning for structured layout generation, followed by foreground-aware control injection for complete scene generation.

Result: La La LiDAR achieves state-of-the-art performance in LiDAR generation and downstream perception tasks, supported by new datasets (Waymo-SG, nuScenes-SG) and metrics.

Conclusion: The model sets a new benchmark for controllable 3D scene generation, addressing limitations of prior methods.

Abstract: Controllable generation of realistic LiDAR scenes is crucial for applications such as autonomous driving and robotics. While recent diffusion-based models achieve high-fidelity LiDAR generation, they lack explicit control over foreground objects and spatial relationships, limiting their usefulness for scenario simulation and safety validation. To address these limitations, we propose Large-scale Layout-guided LiDAR generation model (“La La LiDAR”), a novel layout-guided generative framework that introduces semantic-enhanced scene graph diffusion with relation-aware contextual conditioning for structured LiDAR layout generation, followed by foreground-aware control injection for complete scene generation. This enables customizable control over object placement while ensuring spatial and semantic consistency. To support our structured LiDAR generation, we introduce Waymo-SG and nuScenes-SG, two large-scale LiDAR scene graph datasets, along with new evaluation metrics for layout synthesis. Extensive experiments demonstrate that La La LiDAR achieves state-of-the-art performance in both LiDAR generation and downstream perception tasks, establishing a new benchmark for controllable 3D scene generation.

[231] LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

Ao Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, Wei Tsang Ooi

Main category: cs.CV

TL;DR: LiDARCrafter is a framework for 4D LiDAR generation and editing using natural language inputs, achieving state-of-the-art performance in fidelity, controllability, and temporal coherence.

DetailsMotivation: Existing generative world models for autonomous driving overlook LiDAR properties, lacking controllability, temporal coherence, and standardized evaluation.

Method: LiDARCrafter parses natural language into scene graphs, uses a tri-branch diffusion network for object structures, motion trajectories, and geometry, and includes an autoregressive module for temporal coherence.

Result: Experiments on nuScenes show LiDARCrafter excels in fidelity, controllability, and temporal consistency.

Conclusion: LiDARCrafter advances LiDAR generation for data augmentation and simulation, with released code and benchmark.

Abstract: Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free-form natural language inputs, we parse instructions into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine-grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene-, object-, and sequence-level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state-of-the-art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. The code and benchmark are released to the community.

[232] LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu

Main category: cs.CV

TL;DR: LongVie is an autoregressive framework for controllable ultra-long video generation, addressing temporal inconsistency and visual degradation with unified noise initialization, global control signal normalization, multi-modal control, and degradation-aware training.

DetailsMotivation: Existing methods struggle with scalability in ultra-long video generation due to temporal inconsistency and visual degradation.

Method: LongVie uses unified noise initialization, global control signal normalization, multi-modal control (dense and sparse signals), and degradation-aware training.

Result: LongVie achieves state-of-the-art performance in controllability, consistency, and quality, validated on the LongVGenBench benchmark.

Conclusion: LongVie effectively addresses key challenges in ultra-long video generation, offering scalable and high-quality results.

Abstract: Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency:

  1. a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.

[233] Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

Pulkit Kumar, Shuaiyi Huang, Matthew Walmer, Sai Saketh Rambhatla, Abhinav Shrivastava

Main category: cs.CV

TL;DR: Trokens improves few-shot action recognition by transforming trajectory points into semantic-aware tokens, combining motion and appearance features for state-of-the-art results.

DetailsMotivation: Addressing challenges in selecting informative points and modeling their motion patterns for few-shot action recognition.

Method: Semantic-aware sampling for point tracking and a motion modeling framework using Histogram of Oriented Displacements (HoD) and inter-trajectory relationships.

Result: Achieves state-of-the-art performance on six benchmarks: Something-Something-V2, Kinetics, UCF101, HMDB51, and FineGym.

Conclusion: Trokens effectively integrates motion and appearance features, advancing few-shot action recognition.

Abstract: Video understanding requires effective modeling of both motion and appearance information, particularly for few-shot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym. For project page see https://trokens-iccv25.github.io

[234] Beyond Images: Adaptive Fusion of Visual and Textual Data for Food Classification

Prateek Mittal, Puneet Goyal, Joohi Chauhan

Main category: cs.CV

TL;DR: A novel multimodal food recognition framework combines visual and textual data for improved accuracy, achieving 97.84% accuracy when fused, outperforming existing methods.

DetailsMotivation: To enhance food recognition accuracy and robustness by leveraging both visual and textual data, addressing challenges like missing or inconsistent modality data.

Method: Dynamic multimodal fusion strategy adaptively integrates visual and textual features, optimizing informative content while handling modality inconsistencies.

Result: Achieved 73.60% (images) and 88.84% (text) unimodal accuracies; 97.84% when fused, surpassing state-of-the-art methods.

Conclusion: The framework is robust, adaptable, and efficient, suitable for real-world multimodal food-recognition applications.

Abstract: This study introduces a novel multimodal food recognition framework that effectively combines visual and textual modalities to enhance classification accuracy and robustness. The proposed approach employs a dynamic multimodal fusion strategy that adaptively integrates features from unimodal visual inputs and complementary textual metadata. This fusion mechanism is designed to maximize the use of informative content, while mitigating the adverse impact of missing or inconsistent modality data. The framework was rigorously evaluated on the UPMC Food-101 dataset and achieved unimodal classification accuracies of 73.60% for images and 88.84% for text. When both modalities were fused, the model achieved an accuracy of 97.84%, outperforming several state-of-the-art methods. Extensive experimental analysis demonstrated the robustness, adaptability, and computational efficiency of the proposed settings, highlighting its practical applicability to real-world multimodal food-recognition scenarios.

[235] LumiNet: Perception-Driven Knowledge Distillation via Statistical Logit Calibration

Md. Ismail Hossain, M M Lutfe Elahi, Sameera Ramasinghe, Ali Cheraghian, Fuad Rahman, Nabeel Mohammed, Shafin Rahman

Main category: cs.CV

TL;DR: LumiNet is a novel logit-based knowledge distillation method that outperforms feature-based methods by addressing overconfidence and leveraging sample relationships.

DetailsMotivation: Feature-based distillation dominates, but logit-based methods underperform. LumiNet aims to bridge this gap by enhancing logit-based distillation.

Method: Introduces ‘perception’ to calibrate logits and reconstructs logits using batch sample relationships.

Result: Outperforms feature-based methods on CIFAR-100, ImageNet, and MSCOCO, with improvements of 1.5% and 2.05% on ImageNet.

Conclusion: LumiNet successfully enhances logit-based distillation, surpassing feature-based methods in performance.

Abstract: In the knowledge distillation literature, feature-based methods have dominated due to their ability to effectively tap into extensive teacher models. In contrast, logit-based approaches, which aim to distill “dark knowledge” from teachers, typically exhibit inferior performance compared to feature-based methods. To bridge this gap, we present LumiNet, a novel knowledge distillation algorithm designed to enhance logit-based distillation. We introduce the concept of “perception”, aiming to calibrate logits based on the model’s representation capability. This concept addresses overconfidence issues in the logit-based distillation method while also introducing a novel method to distill knowledge from the teacher. It reconstructs the logits of a sample/instances by considering relationships with other samples in the batch. LumiNet excels on benchmarks like CIFAR-100, ImageNet, and MSCOCO, outperforming the leading feature-based methods, e.g., compared to KD with ResNet18 and MobileNetV2 on ImageNet, it shows improvements of 1.5% and 2.05%, respectively. Codes are available at https://github.com/ismail31416/LumiNet.

[236] Learning Interpretable Queries for Explainable Image Classification with Information Pursuit

Stefan Kolek, Aditya Chattopadhyay, Kwan Ho Ryan Chan, Hector Andrade-Loarca, Gitta Kutyniok, Réne Vidal

Main category: cs.CV

TL;DR: The paper introduces a method to learn interpretable query dictionaries directly from datasets, improving upon hand-crafted dictionaries in Information Pursuit (IP).

DetailsMotivation: Hand-crafted dictionaries in IP are limited by expert knowledge and prompt engineering heuristics, prompting the need for a learned approach.

Method: The authors formulate query dictionary learning as an optimization problem, leveraging latent spaces of models like CLIP and proposing a sparse dictionary learning-inspired algorithm.

Result: Learned dictionaries outperform hand-crafted ones, especially those generated by large language models.

Conclusion: Learning query dictionaries directly from data enhances IP’s effectiveness, surpassing traditional methods.

Abstract: Information Pursuit (IP) is an explainable prediction algorithm that greedily selects a sequence of interpretable queries about the data in order of information gain, updating its posterior at each step based on observed query-answer pairs. The standard paradigm uses hand-crafted dictionaries of potential data queries curated by a domain expert or a large language model after a human prompt. However, in practice, hand-crafted dictionaries are limited by the expertise of the curator and the heuristics of prompt engineering. This paper introduces a novel approach: learning a dictionary of interpretable queries directly from the dataset. Our query dictionary learning problem is formulated as an optimization problem by augmenting IP’s variational formulation with learnable dictionary parameters. To formulate learnable and interpretable queries, we leverage the latent space of large vision and language models like CLIP. To solve the optimization problem, we propose a new query dictionary learning algorithm inspired by classical sparse dictionary learning. Our experiments demonstrate that learned dictionaries significantly outperform hand-crafted dictionaries generated with large language models.

[237] Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding

Ke Zou, Yang Bai, Zhihao Chen, Yang Zhou, Yidi Chen, Kai Ren, Meng Wang, Xuedong Yuan, Xiaojing Shen, Xiaochun Cao, Yih Chung Tham, Huazhu Fu

Main category: cs.CV

TL;DR: The paper introduces Medical Report Grounding (MRG) and proposes uMedGround, a framework using a multimodal large language model for end-to-end diagnostic phrase and grounding box identification, with uncertainty-aware predictions to enhance reliability.

DetailsMotivation: Current methods for medical phrase grounding rely on manual key phrase extraction, reducing efficiency and lacking model confidence estimation, which limits clinical trust.

Method: The uMedGround framework embeds a unique token ($<$$\mathtt{BOX}$$>$) into a multimodal large language model to predict diagnostic phrases and uses a vision encoder-decoder to generate grounding boxes, incorporating uncertainty-aware predictions.

Result: uMedGround outperforms state-of-the-art methods and fine-tuned large visual-language models, demonstrating effectiveness and reliability in MRG and other tasks like medical visual question answering.

Conclusion: This study pioneers the MRG task, showcasing uMedGround’s applicability in clinical settings for interpreting diverse textual inputs and improving diagnostic accuracy.

Abstract: Medical phrase grounding is crucial for identifying relevant regions in medical images based on phrase queries, facilitating accurate image analysis and diagnosis. However, current methods rely on manual extraction of key phrases from medical reports, reducing efficiency and increasing the workload for clinicians. Additionally, the lack of model confidence estimation limits clinical trust and usability. In this paper, we introduce a novel task called Medical Report Grounding (MRG), which aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner. To address this challenge, we propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases by embedding a unique token, $<$$\mathtt{BOX}$$>$, into the vocabulary to enhance detection capabilities. A vision encoder-decoder processes the embedded token and input image to generate grounding boxes. Critically, uMedGround incorporates an uncertainty-aware prediction model, significantly improving the robustness and reliability of grounding predictions. Experimental results demonstrate that uMedGround outperforms state-of-the-art medical phrase grounding methods and fine-tuned large visual-language models, validating its effectiveness and reliability. This study represents a pioneering exploration of the MRG task, marking the first-ever endeavor in this domain. Additionally, we demonstrate the applicability of uMedGround in medical visual question answering and class-based localization tasks, where it highlights visual evidence aligned with key diagnostic phrases, supporting clinicians in interpreting various types of textual inputs, including free-text reports, visual question answering queries, and class labels.

[238] Towards Optimal Aggregation of Varying Range Dependencies in Haze Removal

Xiaozhe Zhang, Fengying Xie, Haidong Ding, Linpeng Pan, Zhenwei Shi

Main category: cs.CV

TL;DR: DehazeMatic integrates short- and long-range dependencies for haze removal, using a dual-stream design and CLIP-enhanced aggregation guided by haze density and semantics, outperforming existing methods.

DetailsMotivation: Existing methods specialize in either short- or long-range dependencies, but their integration is underexplored despite complementary strengths.

Method: Proposes DehazeMatic with a dual-stream design to capture both dependencies, and a CLIP-enhanced Dual-path Aggregator for optimized aggregation based on haze density and semantic information.

Result: DehazeMatic outperforms state-of-the-art methods across benchmarks, generating fine-grained haze density and semantic maps.

Conclusion: Explicit integration of short- and long-range dependencies with guided aggregation significantly improves haze removal performance.

Abstract: Haze removal aims to restore a clear image from a hazy input. Existing methods achieve notable success by specializing in either short-range dependencies to preserve local details or long-range dependencies to capture global context. Given the complementary strengths of both, a natural progression is to explicitly integrate them within a unified framework and enable their reasonable aggregation. However, this integration remains underexplored. In this paper, we propose DehazeMatic, which simultaneously and explicitly captures both short- and long-range dependencies through a dual-stream design. To optimize the contribution of dependencies at varying ranges, we conduct extensive experiments to identify key influencing factors and find that an effective aggregation mechanism should be guided by the joint consideration of haze density and semantic information. Building on these insights, we introduce the CLIP-enhanced Dual-path Aggregator, which not only enables the generation of fine-grained haze density maps for the first time, but also produces semantic maps within a shared backbone, ultimately leveraging both to instruct the aggregation process. Extensive experiments demonstrate that DehazeMatic outperforms state-of-the-art methods across multiple benchmarks.

[239] Attack Anything: Blind DNNs via Universal Background Adversarial Attack

Jiawei Lian, Shaohui Mei, Xiaofei Wang, Yi Wang, Lefan Wang, Yingjie Lu, Mingyang Ma, Lap-Pui Chau

Main category: cs.CV

TL;DR: The paper proposes a background adversarial attack framework that disrupts DNNs by altering the background, not the target objects, showing its effectiveness across diverse scenarios.

DetailsMotivation: Existing attacks focus on corrupting target objects or images, but this work explores background perturbations to reveal vulnerabilities in DNNs.

Method: The attack is formulated as an iterative optimization problem, with a new ensemble strategy and smooth constraints for seamless perturbation integration.

Result: Experiments in digital and physical domains confirm the method’s efficacy, highlighting the underestimated role of background variations in DNN robustness.

Conclusion: The findings challenge the reliability of DNNs and emphasize the need to reassess their robustness against background adversarial attacks.

Abstract: It has been widely substantiated that deep neural networks (DNNs) are susceptible and vulnerable to adversarial perturbations. Existing studies mainly focus on performing attacks by corrupting targeted objects (physical attack) or images (digital attack), which is intuitively acceptable and understandable in terms of the attack’s effectiveness. In contrast, our focus lies in conducting background adversarial attacks in both digital and physical domains, without causing any disruptions to the targeted objects themselves. Specifically, an effective background adversarial attack framework is proposed to attack anything, by which the attack efficacy generalizes well between diverse objects, models, and tasks. Technically, we approach the background adversarial attack as an iterative optimization problem, analogous to the process of DNN learning. Besides, we offer a theoretical demonstration of its convergence under a set of mild but sufficient conditions. To strengthen the attack efficacy and transferability, we propose a new ensemble strategy tailored for adversarial perturbations and introduce an improved smooth constraint for the seamless connection of integrated perturbations. We conduct comprehensive and rigorous experiments in both digital and physical domains across various objects, models, and tasks, demonstrating the effectiveness of attacking anything of the proposed method. The findings of this research substantiate the significant discrepancy between human and machine vision on the value of background variations, which play a far more critical role than previously recognized, necessitating a reevaluation of the robustness and reliability of DNNs. The code will be publicly available at https://github.com/JiaweiLian/Attack_Anything

[240] Dynamic 2D Gaussians: Geometrically Accurate Radiance Fields for Dynamic Objects

Shuai Zhang, Guanjun Wu, Zhoufeng Xie, Xinggang Wang, Bin Feng, Wenyu Liu

Main category: cs.CV

TL;DR: The paper introduces Dynamic 2D Gaussians (D-2DGS), a novel representation for reconstructing high-quality meshes from sparse image inputs, addressing limitations of current 4D representations.

DetailsMotivation: Current 4D representations fail to reconstruct high-quality meshes due to implicit or geometrically inaccurate representations.

Method: The method uses 2D Gaussians for basic geometry and sparse-controlled points to capture deformation, leveraging object masks and depth maps to remove reconstruction artifacts.

Result: Experiments show D-2DGS excels in reconstructing detailed, smooth meshes from sparse inputs.

Conclusion: D-2DGS effectively reconstructs high-quality dynamic mesh sequences, with code publicly available.

Abstract: Reconstructing objects and extracting high-quality surfaces play a vital role in the real world. Current 4D representations show the ability to render high-quality novel views for dynamic objects, but cannot reconstruct high-quality meshes due to their implicit or geometrically inaccurate representations. In this paper, we propose a novel representation that can reconstruct accurate meshes from sparse image input, named Dynamic 2D Gaussians (D-2DGS). We adopt 2D Gaussians for basic geometry representation and use sparse-controlled points to capture the 2D Gaussian’s deformation. By extracting the object mask from the rendered high-quality image and masking the rendered depth map, we remove floaters that are prone to occur during reconstruction and can extract high-quality dynamic mesh sequences of dynamic objects. Experiments demonstrate that our D-2DGS is outstanding in reconstructing detailed and smooth high-quality meshes from sparse inputs. The code is available at https://github.com/hustvl/Dynamic-2DGS.

[241] TextMaster: A Unified Framework for Realistic Text Editing via Glyph-Style Dual-Control

Zhenyu Yan, Jian Wang, Aoqiang Wang, Yuhan Li, Wenxiang Shang, Ran Lin

Main category: cs.CV

TL;DR: TextMaster improves text editing in images with high accuracy, layout control, and style transfer using glyph info, perceptual loss, and attention mechanisms.

DetailsMotivation: Existing methods lack accuracy and controllability for complex text editing in images, leading to high costs.

Method: Incorporates high-resolution glyph info, perceptual loss, and attention for layout learning. Introduces style injection for controllable style transfer.

Result: Achieves state-of-the-art performance in text editing accuracy and style control.

Conclusion: TextMaster effectively addresses limitations in text editing, offering high accuracy and style controllability.

Abstract: In image editing tasks, high-quality text editing capabilities can significantly reduce both human and material resource costs. Existing methods, however, face significant limitations in terms of stroke accuracy for complex text and controllability of generated text styles. To address these challenges, we propose TextMaster, a solution capable of accurately editing text across various scenarios and image regions, while ensuring proper layout and controllable text style. Our method enhances the accuracy and fidelity of text rendering by incorporating high-resolution standard glyph information and applying perceptual loss within the text editing region. Additionally, we leverage an attention mechanism to compute intermediate layer bounding box regression loss for each character, enabling the model to learn text layout across varying contexts. Furthermore, we propose a novel style injection technique that enables controllable style transfer for the injected text. Through comprehensive experiments, we demonstrate the state-of-the-art performance of our method.

[242] IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang

Main category: cs.CV

TL;DR: IDEATOR is a novel method for generating malicious image-text pairs to jailbreak Vision-Language Models (VLMs), achieving high success rates and transferability. It also introduces VLJailbreakBench, a safety benchmark revealing gaps in VLM defenses.

DetailsMotivation: Ensuring safe deployment of VLMs is critical, but current methods rely on limited, ineffective adversarial data. IDEATOR addresses this by autonomously generating diverse and effective jailbreak prompts.

Method: IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with images from a diffusion model. It tests these pairs on VLMs, achieving high attack success rates with minimal queries.

Result: IDEATOR achieves a 94% attack success rate on MiniGPT-4 and high transferability to other VLMs. VLJailbreakBench reveals significant safety gaps, with ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet.

Conclusion: IDEATOR demonstrates the vulnerability of VLMs to automated jailbreak attacks and highlights the urgent need for stronger safety defenses, as evidenced by VLJailbreakBench results.

Abstract: As large Vision-Language Models (VLMs) gain prominence, ensuring their safe deployment has become critical. Recent studies have explored VLM robustness against jailbreak attacks-techniques that exploit model vulnerabilities to elicit harmful outputs. However, the limited availability of diverse multimodal data has constrained current approaches to rely heavily on adversarial or manually crafted images derived from harmful text datasets, which often lack effectiveness and diversity across different contexts. In this paper, we propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the insight that VLMs themselves could serve as powerful red team models for generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. Extensive experiments demonstrate IDEATOR’s high effectiveness and transferability, achieving a 94% attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high ASRs of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Chameleon, respectively. Building on IDEATOR’s strong transferability and automated process, we introduce the VLJailbreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark results on 11 recently released VLMs reveal significant gaps in safety alignment. For instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses.

[243] Adaptive Augmentation Policy Optimization with LLM Feedback

Ant Duru, Alptekin Temizel

Main category: cs.CV

TL;DR: Proposes LLM-guided data augmentation optimization, refining policies iteratively to improve model performance with reduced computational costs.

DetailsMotivation: Traditional augmentation methods are computationally expensive and dataset-specific; LLMs offer dynamic, context-aware solutions.

Method: Two approaches: (1) Iterative refinement of LLM-selected policies, and (2) Adaptive policy adjustment based on performance metrics.

Result: Experiments show consistent accuracy improvements in domain-specific image classification tasks.

Conclusion: LLM-guided augmentation is efficient and effective, outperforming traditional methods.

Abstract: Data augmentation is a critical component of deep learning pipelines, enhancing model generalization by increasing dataset diversity. Traditional augmentation strategies rely on manually designed transformations, stochastic sampling, or automated search-based approaches. Although automated methods improve performance, they often require extensive computational resources and are specifically designed for certain datasets. In this work, we propose a Large Language Model (LLM)-guided augmentation optimization strategy that refines augmentation policies based on model performance feedback. We propose two approaches: (1) LLM-Guided Augmentation Policy Optimization, where augmentation policies selected by LLM are refined iteratively across training cycles, and (2) Adaptive LLM-Guided Augmentation Policy Optimization, which adjusts policies at each iteration based on performance metrics. This in-training approach eliminates the need for full model retraining before getting LLM feedback, reducing computational costs while increasing performance. Our methodology employs an LLM to dynamically select augmentation transformations based on dataset characteristics, model architecture, and prior training performance. Leveraging LLMs’ contextual knowledge, especially in domain-specific tasks like medical imaging, our method selects augmentations tailored to dataset characteristics and model performance. Experiments across domain-specific image classification datasets show consistent accuracy improvements over traditional methods.

[244] CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Jingwei Xu, Chenyu Wang, Zibo Zhao, Wen Liu, Yi Ma, Shenghua Gao

Main category: cs.CV

TL;DR: CAD-MLLM is a unified system for generating parametric CAD models from multimodal inputs (text, images, point clouds) using advanced LLMs, outperforming existing methods.

DetailsMotivation: To create a versatile CAD generation system that accommodates diverse input modalities (text, images, point clouds) for seamless model creation.

Method: Leverages CAD command sequences and LLMs to align multimodal data with CAD vectorized representations, supported by the Omni-CAD dataset (450K instances).

Result: CAD-MLLM outperforms existing methods, showing robustness to noise and missing data, with new evaluation metrics for topology and surface quality.

Conclusion: CAD-MLLM is a groundbreaking system for multimodal CAD generation, validated by extensive experiments and a comprehensive dataset.

Abstract: This paper aims to design a unified Computer-Aided Design (CAD) generation system that can easily generate CAD models based on the user’s inputs in the form of textual description, images, point clouds, or even a combination of them. Towards this goal, we introduce the CAD-MLLM, the first system capable of generating parametric CAD models conditioned on the multimodal input. Specifically, within the CAD-MLLM framework, we leverage the command sequences of CAD models and then employ advanced large language models (LLMs) to align the feature space across these diverse multi-modalities data and CAD models’ vectorized representations. To facilitate the model training, we design a comprehensive data construction and annotation pipeline that equips each CAD model with corresponding multimodal data. Our resulting dataset, named Omni-CAD, is the first multimodal CAD dataset that contains textual description, multi-view images, points, and command sequence for each CAD model. It contains approximately 450K instances and their CAD construction sequences. To thoroughly evaluate the quality of our generated CAD models, we go beyond current evaluation metrics that focus on reconstruction quality by introducing additional metrics that assess topology quality and surface enclosure extent. Extensive experimental results demonstrate that CAD-MLLM significantly outperforms existing conditional generative methods and remains highly robust to noises and missing points. The project page and more visualizations can be found at: https://cad-mllm.github.io/

Faraz Waseem, Muhammad Shahzad

Main category: cs.CV

TL;DR: The paper surveys long video generation challenges, current techniques (GANs, diffusion models), and future research directions, highlighting limitations like planning and consistency.

DetailsMotivation: Despite progress in multimodal large language models (MLLMs), generating extended videos remains difficult due to complexities like planning, story development, and consistency.

Method: The survey examines foundational techniques (GANs, diffusion models), video generation strategies, datasets, quality metrics, and scalability approaches like divide-and-conquer.

Result: Current systems like OpenAI’s Sora are limited to short videos (~1 minute), with long video generation requiring advancements in scalability and control.

Conclusion: The paper provides a comprehensive foundation for future research in long video generation, addressing current limitations and potential improvements.

Abstract: An image may convey a thousand words, but a video composed of hundreds or thousands of image frames tells a more intricate story. Despite significant progress in multimodal large language models (MLLMs), generating extended videos remains a formidable challenge. As of this writing, OpenAI’s Sora, the current state-of-the-art system, is still limited to producing videos that are up to one minute in length. This limitation stems from the complexity of long video generation, which requires more than generative AI techniques for approximating density functions essential aspects such as planning, story development, and maintaining spatial and temporal consistency present additional hurdles. Integrating generative AI with a divide-and-conquer approach could improve scalability for longer videos while offering greater control. In this survey, we examine the current landscape of long video generation, covering foundational techniques like GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of the existing video generation capabilities. We believe it would serve as a comprehensive foundation, offering extensive information to guide future advancements and research in the field of long video generation.

[246] Neuro-3D: Towards 3D Visual Decoding from EEG Signals

Zhanqiang Guo, Jiamin Wu, Yonghao Song, Jiahui Bu, Weijian Mai, Qihao Zheng, Wanli Ouyang, Chunfeng Song

Main category: cs.CV

TL;DR: The paper introduces EEG-3D, a dataset for decoding 3D visual perception from EEG signals, and proposes Neuro-3D, a framework to reconstruct 3D objects from EEG data.

DetailsMotivation: To understand how the brain processes 3D visual stimuli and develop a method to decode this perception using EEG signals.

Method: Uses EEG-3D dataset and Neuro-3D framework to integrate EEG features from static/dynamic stimuli, employing a diffusion-based decoder for 3D object reconstruction.

Result: Neuro-3D successfully reconstructs colored 3D objects and provides insights into neural representations and brain regions.

Conclusion: The work pioneers EEG-based 3D visual decoding, offering a new tool for neuroscience research with publicly available data and code.

Abstract: Human’s perception of the visual world is shaped by the stereo processing of 3D information. Understanding how the brain perceives and processes 3D visual stimuli in the real world has been a longstanding endeavor in neuroscience. Towards this goal, we introduce a new neuroscience task: decoding 3D visual perception from EEG signals, a neuroimaging technique that enables real-time monitoring of neural dynamics enriched with complex visual cues. To provide the essential benchmark, we first present EEG-3D, a pioneering dataset featuring multimodal analysis data and extensive EEG recordings from 12 subjects viewing 72 categories of 3D objects rendered in both videos and images. Furthermore, we propose Neuro-3D, a 3D visual decoding framework based on EEG signals. This framework adaptively integrates EEG features derived from static and dynamic stimuli to learn complementary and robust neural representations, which are subsequently utilized to recover both the shape and color of 3D objects through the proposed diffusion-based colored point cloud decoder. To the best of our knowledge, we are the first to explore EEG-based 3D visual decoding. Experiments indicate that Neuro-3D not only reconstructs colored 3D objects with high fidelity, but also learns effective neural representations that enable insightful brain region analysis. The dataset and associated code will be made publicly available.

[247] CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models

Alexis Roger, Prateek Humane, Daniel Z. Kaplan, Kshitij Gupta, Qi Sun, George Adamopoulos, Jonathan Siu Chi Lim, Quentin Anthony, Edwin Fennell, Irina Rish

Main category: cs.CV

TL;DR: The paper introduces Robin, a suite of Vision-Language Models (VLMs), and CHIRP, a benchmark for robust VLM evaluation, addressing limitations in current methods.

DetailsMotivation: Existing VLM evaluation techniques lack rigor and comprehensiveness, necessitating improved methods and benchmarks.

Method: Developed Robin by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and created CHIRP for long-form response evaluation.

Result: Identified shortcomings in current VLM evaluation approaches and introduced new tools (Robin and CHIRP) for better assessment.

Conclusion: The work provides open-access tools to enhance reproducibility and advance VLM research.

Abstract: The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.

[248] Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings

Pura Peetathawatchai, Wei-Ning Chen, Berivan Isik, Sanmi Koyejo, Albert No

Main category: cs.CV

TL;DR: DPAgg-TI, a method using Textual Inversion with differential privacy, outperforms DP-SGD in utility and robustness for small, sensitive datasets.

DetailsMotivation: Addressing privacy risks in personalizing large-scale diffusion models for small, sensitive datasets, where DP-SGD fails due to high noise.

Method: Leverages Textual Inversion (TI) to learn embeddings, adding calibrated noise for differential privacy (DP) guarantees.

Result: DPAgg-TI matches non-private baseline performance in style adaptation tasks, unlike DP-SGD, which fails under the same privacy budget.

Conclusion: DPAgg-TI is a superior alternative to DP-SGD for private adaptation of diffusion models, balancing privacy and utility effectively.

Abstract: Personalizing large-scale diffusion models poses serious privacy risks, especially when adapting to small, sensitive datasets. A common approach is to fine-tune the model using differentially private stochastic gradient descent (DP-SGD), but this suffers from severe utility degradation due to the high noise needed for privacy, particularly in the small data regime. We propose an alternative that leverages Textual Inversion (TI), which learns an embedding vector for an image or set of images, to enable adaptation under differential privacy (DP) constraints. Our approach, Differentially Private Aggregation via Textual Inversion (DPAgg-TI), adds calibrated noise to the aggregation of per-image embeddings to ensure formal DP guarantees while preserving high output fidelity. We show that DPAgg-TI outperforms DP-SGD finetuning in both utility and robustness under the same privacy budget, achieving results closely matching the non-private baseline on style adaptation tasks using private artwork from a single artist and Paris 2024 Olympic pictograms. In contrast, DP-SGD fails to generate meaningful outputs in this setting.

[249] 4D Scaffold Gaussian Splatting with Dynamic-Aware Anchor Growing for Efficient and High-Fidelity Dynamic Scene Reconstruction

Woong Oh Cho, In Cho, Seoha Kim, Jeongmin Bae, Youngjung Uh, Seon Joo Kim

Main category: cs.CV

TL;DR: A novel 4D anchor-based framework reduces storage costs for dynamic scenes by compressing Gaussians into compact grid-aligned features, improving rendering quality without reducing Gaussian count.

DetailsMotivation: Addressing the storage overhead of 4D Gaussian models for dynamic scenes while maintaining high visual fidelity.

Method: Uses 4D anchor features processed by an MLP to spawn neural 4D Gaussians, with a dynamic-aware anchor growing strategy for better reconstruction in dynamic regions.

Result: Achieves state-of-the-art visual quality in dynamic regions with practical storage costs, outperforming baselines.

Conclusion: The method effectively balances storage efficiency and rendering quality for dynamic scenes.

Abstract: Modeling dynamic scenes through 4D Gaussians offers high visual fidelity and fast rendering speeds, but comes with significant storage overhead. Recent approaches mitigate this cost by aggressively reducing the number of Gaussians. However, this inevitably removes Gaussians essential for high-quality rendering, leading to severe degradation in dynamic regions. In this paper, we introduce a novel 4D anchor-based framework that tackles the storage cost in different perspective. Rather than reducing the number of Gaussians, our method retains a sufficient quantity to accurately model dynamic contents, while compressing them into compact, grid-aligned 4D anchor features. Each anchor is processed by an MLP to spawn a set of neural 4D Gaussians, which represent a local spatiotemporal region. We design these neural 4D Gaussians to capture temporal changes with minimal parameters, making them well-suited for the MLP-based spawning. Moreover, we introduce a dynamic-aware anchor growing strategy to effectively assign additional anchors to under-reconstructed dynamic regions. Our method adjusts the accumulated gradients with Gaussians’ temporal coverage, significantly improving reconstruction quality in dynamic regions. Experimental results highlight that our method achieves state-of-the-art visual quality in dynamic regions, outperforming all baselines by a large margin with practical storage costs.

[250] EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision

Dmitrii Torbunov, Yihui Ren, Animesh Ghose, Odera Dim, Yonggang Cui

Main category: cs.CV

TL;DR: I2EvDet bridges mainstream object detection with event-based camera data, achieving state-of-the-art performance via minimal architectural adaptations.

DetailsMotivation: Event-based cameras (EBCs) offer advantages like power efficiency and high dynamic range, but their sparse, asynchronous data challenges traditional image analysis methods.

Method: The paper introduces I2EvDet, adapting RT-DETR (a natural image detector) to EBC data with minimal architectural changes, creating EvRT-DETR.

Result: EvRT-DETR outperforms specialized EBC methods on Gen1 (+2.3 mAP) and 1Mpx/Gen4 (+1.4 mAP) benchmarks.

Conclusion: The work presents a novel, efficient approach to EBC object detection by adapting mainstream architectures, with broader applications in temporal visual domains.

Abstract: Event-based cameras (EBCs) have emerged as a bio-inspired alternative to traditional cameras, offering advantages in power efficiency, temporal resolution, and high dynamic range. However, the development of image analysis methods for EBCs is challenging due to the sparse and asynchronous nature of the data. This work addresses the problem of object detection for EBC cameras. The current approaches to EBC object detection focus on constructing complex data representations and rely on specialized architectures. We introduce I2EvDet (Image-to-Event Detection), a novel adaptation framework that bridges mainstream object detection with temporal event data processing. First, we demonstrate that a Real-Time DEtection TRansformer, or RT-DETR, a state-of-the-art natural image detector, trained on a simple image-like representation of the EBC data achieves performance comparable to specialized EBC methods. Next, as part of our framework, we develop an efficient adaptation technique that transforms image-based detectors into event-based detection models by modifying their frozen latent representation space through minimal architectural additions. The resulting EvRT-DETR model reaches state-of-the-art performance on the standard benchmark datasets Gen1 (mAP $+2.3$) and 1Mpx/Gen4 (mAP $+1.4$). These results demonstrate a fundamentally new approach to EBC object detection through principled adaptation of mainstream architectures, offering an efficient alternative with potential applications to other temporal visual domains. The code is available at: https://github.com/realtime-intelligence/evrt-detr

[251] FedSemiDG: Domain Generalized Federated Semi-supervised Medical Image Segmentation

Zhipeng Deng, Zhe Xu, Tsuyoshi Isshiki, Yefeng Zheng

Main category: cs.CV

TL;DR: The paper introduces FGASL, a framework for domain generalized federated semi-supervised learning (FedSemiDG), addressing domain shift in medical image segmentation by combining global and local strategies for better generalization.

DetailsMotivation: The diversity of medical images and lack of labeled data motivate federated semi-supervised learning (FSSL), but domain shift remains under-explored, leading to poor performance in unseen domains.

Method: FGASL includes Generalization-Aware Aggregation (GAA) for global model weighting, Dual-Teacher Adaptive Pseudo Label Refinement (DR) for local knowledge integration, and Perturbation-Invariant Alignment (PIA) for domain-invariant features.

Result: Experiments on four medical segmentation tasks show FGASL outperforms state-of-the-art FSSL and domain generalization methods, achieving robust generalization.

Conclusion: FGASL effectively addresses domain shift in FedSemiDG, improving generalization in medical image segmentation with limited labeled data.

Abstract: Medical image segmentation is challenging due to the diversity of medical images and the lack of labeled data, which motivates recent developments in federated semi-supervised learning (FSSL) to leverage a large amount of unlabeled data from multiple centers for model training without sharing raw data. However, what remains under-explored in FSSL is the domain shift problem which may cause suboptimal model aggregation and low effectivity of the utilization of unlabeled data, eventually leading to unsatisfactory performance in unseen domains. In this paper, we explore this previously ignored scenario, namely domain generalized federated semi-supervised learning (FedSemiDG), which aims to learn a model in a distributed manner from multiple domains with limited labeled data and abundant unlabeled data such that the model can generalize well to unseen domains. We present a novel framework, Federated Generalization-Aware SemiSupervised Learning (FGASL), to address the challenges in FedSemiDG by effectively tackling critical issues at both global and local levels. Globally, we introduce Generalization-Aware Aggregation (GAA), assigning adaptive weights to local models based on their generalization performance. Locally, we use a Dual-Teacher Adaptive Pseudo Label Refinement (DR) strategy to combine global and domain-specific knowledge, generating more reliable pseudo labels. Additionally, Perturbation-Invariant Alignment (PIA) enforces feature consistency under perturbations, promoting domain-invariant learning. Extensive experiments on four medical segmentation tasks (cardiac MRI, spine MRI, bladder cancer MRI and colorectal polyp) demonstrate that our method significantly outperforms state-of-the-art FSSL and domain generalization approaches, achieving robust generalization on unseen domains.

[252] A Causal Framework for Aligning Image Quality Metrics and Deep Neural Network Robustness

Nathan Drenkow, Mathias Unberath

Main category: cs.CV

TL;DR: The paper investigates whether conventional image quality assessment (IQA) metrics predict deep neural network (DNN) performance, finds them weak predictors, and proposes new metrics aligned with DNN sensitivities.

DetailsMotivation: To address the gap between conventional IQA metrics (aligned with human perception) and DNN performance sensitivity to image quality.

Method: Theoretical and empirical analysis of conventional IQA metrics, followed by development of new metrics using a causal framework.

Result: Conventional IQA metrics are weak predictors of DNN performance; new metrics show strong correlation with DNN performance.

Conclusion: Proposed metrics effectively estimate image quality distribution for vision tasks, bridging the gap between IQA and DNN performance.

Abstract: Image quality plays an important role in the performance of deep neural networks (DNNs) that have been widely shown to exhibit sensitivity to changes in imaging conditions. Conventional image quality assessment (IQA) seeks to measure and align quality relative to human perceptual judgments, but we often need a metric that is not only sensitive to imaging conditions but also well-aligned with DNN sensitivities. We first ask whether conventional IQA metrics are also informative of DNN performance. We show theoretically and empirically that conventional IQA metrics are weak predictors of DNN performance for image classification. Using our causal framework, we then develop metrics that exhibit strong correlation with DNN performance, thus enabling us to effectively estimate the quality distribution of large image datasets relative to targeted vision tasks.

[253] RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng

Main category: cs.CV

TL;DR: The paper introduces RealSyn, a dataset combining realistic and synthetic texts for contrastive vision-language learning, achieving state-of-the-art performance in various tasks.

DetailsMotivation: To leverage underutilized multimodal interleaved documents for contrastive vision-language representation learning.

Method: Establishes a data extraction pipeline, hierarchical retrieval, image semantic augmentation, and semantic balance sampling to create RealSyn.

Result: Models trained on RealSyn outperform others in linear probe, zero-shot transfer, robustness, and retrieval tasks.

Conclusion: RealSyn enhances vision-language learning and is scalable, with datasets and model weights publicly released.

Abstract: After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. We compare our dataset with other widely used datasets of equivalent scale for CLIP training. Models pre-trained on RealSyn consistently achieve state-of-the-art performance across various downstream tasks, including linear probe, zero-shot transfer, zero-shot robustness, and zero-shot retrieval. Furthermore, extensive experiments confirm that RealSyn significantly enhances contrastive vision-language representation learning and demonstrates robust scalability. To facilitate future research, the RealSyn dataset and pretrained model weights are released at https://github.com/deepglint/RealSyn.

[254] ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

Boqian Li, Haiwen Feng, Zeyu Cai, Michael J. Black, Yuliang Xiu

Main category: cs.CV

TL;DR: ETCH proposes a novel pipeline for fitting 3D clothed human point clouds by leveraging SE(3) equivariance and tightness mapping, outperforming state-of-the-art methods in accuracy and generalization.

DetailsMotivation: Traditional methods are sensitive to pose initialization, and learning-based approaches struggle with diverse poses and garments. ETCH aims to address these limitations.

Method: ETCH uses locally approximate SE(3) equivariance to map cloth-to-body surfaces and regresses pose-invariant body markers for simplified fitting.

Result: ETCH outperforms existing methods in body fitting accuracy (16.7% ~ 69.5%) and shape accuracy (49.9%), with significant error reduction in one-shot settings.

Conclusion: ETCH demonstrates strong generalization across poses, shapes, and clothing, offering a robust solution for 3D clothed human fitting.

Abstract: Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods – both tightness-agnostic and tightness-aware – in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings (~ 1% data). Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at https://boqian-li.github.io/ETCH/.

[255] Information Bottleneck-Guided Heterogeneous Graph Learning for Interpretable Neurodevelopmental Disorder Diagnosis

Yueyang Li, Lei Chen, Wenhao Dong, Shengyu Gong, Zijian Kang, Boyang Wei, Weiming Zeng, Hongjie Yan, Lingbin Bian, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang

Main category: cs.CV

TL;DR: The paper proposes I2B-HGNN, an interpretable framework for diagnosing neurodevelopmental disorders (NDDs) by integrating multimodal neuroimaging and demographic data using information bottleneck principles.

DetailsMotivation: Existing methods lack interpretability in extracting biomarkers from fMRI data and fail to effectively integrate multimodal data, limiting their diagnostic utility for NDDs.

Method: I2B-HGNN combines an Information Bottleneck Graph Transformer (IBGraphFormer) for biomarker identification and an Information Bottleneck Heterogeneous Graph Attention Network (IB-HGAN) for interpretable multimodal data fusion.

Result: I2B-HGNN outperforms existing methods in NDD diagnosis, offering high accuracy and interpretable biomarker identification.

Conclusion: The proposed framework advances interpretable modeling for NDD diagnosis by effectively integrating and analyzing multimodal data.

Abstract: Developing interpretable models for neurodevelopmental disorders (NDDs) diagnosis presents significant challenges in effectively encoding, decoding, and integrating multimodal neuroimaging data. While many existing machine learning approaches have shown promise in brain network analysis, they typically suffer from limited interpretability, particularly in extracting meaningful biomarkers from functional magnetic resonance imaging (fMRI) data and establishing clear relationships between imaging features and demographic characteristics. Besides, current graph neural network methodologies face limitations in capturing both local and global functional connectivity patterns while simultaneously achieving theoretically principled multimodal data fusion. To address these challenges, we propose the Interpretable Information Bottleneck Heterogeneous Graph Neural Network (I2B-HGNN), a unified framework that applies information bottleneck principles to guide both brain connectivity modeling and cross-modal feature integration. This framework comprises two complementary components. The first is the Information Bottleneck Graph Transformer (IBGraphFormer), which combines transformer-based global attention mechanisms with graph neural networks through information bottleneck-guided pooling to identify sufficient biomarkers. The second is the Information Bottleneck Heterogeneous Graph Attention Network (IB-HGAN), which employs meta-path-based heterogeneous graph learning with structural consistency constraints to achieve interpretable fusion of neuroimaging and demographic data. The experimental results demonstrate that I2B-HGNN achieves superior performance in diagnosing NDDs, exhibiting both high classification accuracy and the ability to provide interpretable biomarker identification while effectively analyzing non-imaging data.

[256] Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model

Jannik Endres, Oliver Hahn, Charles Corbière, Simone Schaub-Meyer, Stefan Roth, Alexandre Alahi

Main category: cs.CV

TL;DR: DFI-OmniStereo is a novel omnidirectional stereo matching method using a pre-trained foundation model for monocular depth estimation, achieving state-of-the-art results with a 16% reduction in disparity MAE.

DetailsMotivation: Omnidirectional depth perception is crucial for robotics, but existing stereo matching methods lack accuracy due to limited real-world data.

Method: Leverages a pre-trained foundation model for monocular depth estimation within an iterative optimization-based stereo matching framework, with a two-stage training strategy.

Result: Achieves state-of-the-art performance on the Helvipad dataset, reducing disparity MAE by ~16%.

Conclusion: DFI-OmniStereo significantly improves omnidirectional stereo matching accuracy, addressing limitations of existing methods.

Abstract: Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360{\deg} field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.

[257] Long-tailed Adversarial Training with Self-Distillation

Seungju Cho, Hongsin Lee, Changick Kim

Main category: cs.CV

TL;DR: The paper addresses adversarial robustness in long-tailed datasets, proposing a self-distillation technique using a balanced self-teacher model to improve tail class performance.

DetailsMotivation: Adversarial training struggles with tail classes in long-tailed distributions due to data scarcity. Existing methods combine long-tailed natural training with adversarial robustness techniques, but performance remains limited.

Method: A novel self-distillation technique is introduced, leveraging a balanced self-teacher model trained on a balanced subset of the long-tailed dataset.

Result: The method achieves state-of-the-art performance in clean and robust accuracy, with significant tail class improvements (e.g., 20.3% on CIFAR-10).

Conclusion: The proposed self-distillation technique effectively enhances adversarial robustness for long-tailed distributions, particularly for tail classes.

Abstract: Adversarial training significantly enhances adversarial robustness, yet superior performance is predominantly achieved on balanced datasets. Addressing adversarial robustness in the context of unbalanced or long-tailed distributions is considerably more challenging, mainly due to the scarcity of tail data instances. Previous research on adversarial robustness within long-tailed distributions has primarily focused on combining traditional long-tailed natural training with existing adversarial robustness methods. In this study, we provide an in-depth analysis for the challenge that adversarial training struggles to achieve high performance on tail classes in long-tailed distributions. Furthermore, we propose a simple yet effective solution to advance adversarial robustness on long-tailed distributions through a novel self-distillation technique. Specifically, this approach leverages a balanced self-teacher model, which is trained using a balanced dataset sampled from the original long-tailed dataset. Our extensive experiments demonstrate state-of-the-art performance in both clean and robust accuracy for long-tailed adversarial robustness, with significant improvements in tail class performance on various datasets. We improve the accuracy against PGD attacks for tail classes by 20.3, 7.1, and 3.8 percentage points on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, while achieving the highest robust accuracy.

[258] Graph Attention-Driven Bayesian Deep Unrolling for Dual-Peak Single-Photon Lidar Imaging

Kyungmin Choi, JaKeoung Koo, Stephen McLaughlin, Abderrahim Halimi

Main category: cs.CV

TL;DR: A deep unrolling algorithm for dual-peak single-photon Lidar imaging is proposed, combining statistical and learning-based methods for accuracy and uncertainty quantification.

DetailsMotivation: Addressing the limitations of existing methods (statistical and deep learning) in noisy environments with multiple targets per pixel.

Method: Hierarchical Bayesian model for multiple targets, neural network unrolling, dual depth maps representation, and geometric deep learning.

Result: Competitive performance on synthetic and real data, with added uncertainty information.

Conclusion: The method successfully balances accuracy, robustness, and interpretability for dual-peak Lidar imaging.

Abstract: Single-photon Lidar imaging offers a significant advantage in 3D imaging due to its high resolution and long-range capabilities, however it is challenging to apply in noisy environments with multiple targets per pixel. To tackle these challenges, several methods have been proposed. Statistical methods demonstrate interpretability on the inferred parameters, but they are often limited in their ability to handle complex scenes. Deep learning-based methods have shown superior performance in terms of accuracy and robustness, but they lack interpretability or they are limited to a single-peak per pixel. In this paper, we propose a deep unrolling algorithm for dual-peak single-photon Lidar imaging. We introduce a hierarchical Bayesian model for multiple targets and propose a neural network that unrolls the underlying statistical method. To support multiple targets, we adopt a dual depth maps representation and exploit geometric deep learning to extract features from the point cloud. The proposed method takes advantages of statistical methods and learning-based methods in terms of accuracy and quantifying uncertainty. The experimental results on synthetic and real data demonstrate the competitive performance when compared to existing methods, while also providing uncertainty information.

[259] FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching

Zhen Zou, Feng Zhao

Main category: cs.CV

TL;DR: FEB-Cache addresses the issue of exposure bias in Diffusion Transformers (DiT) by introducing a frequency-guided caching strategy for Attention and MLP, improving generation quality and acceleration.

DetailsMotivation: The paper identifies that existing caching methods in DiT amplify exposure bias, degrading generation quality, and aims to mitigate this by analyzing the frequency response characteristics of Attention and MLP.

Method: The authors propose FEB-Cache, a joint caching strategy that aligns with non-exposed bias diffusion by separating and caching Attention and MLP based on a frequency-guided cache table.

Result: Empirical results show FEB-Cache enhances model performance and accelerates the diffusion process.

Conclusion: FEB-Cache provides a novel caching approach that reduces exposure bias and improves efficiency in DiT, offering a new perspective for diffusion acceleration.

Abstract: Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this issue, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing why caching damage the generation processes. In this paper, we first confirm that the cache greatly amplifies the exposure bias, resulting in a decline in the generation quality. However, directly applying noise scaling is challenging for this issue due to the non-smoothness of exposure bias. We found that this phenomenon stems from the mismatch between its frequency response characteristics and the simple cache of Attention and MLP. Since these two components exhibit unique preferences for frequency signals, which provides us with a caching strategy to separate Attention and MLP to achieve an enhanced fit of exposure bias and reduce it. Based on this, we introduced FEB-Cache, a joint caching strategy that aligns with the non-exposed bias diffusion process (which gives us a higher performance cap) of caching Attention and MLP based on the frequency-guided cache table. Our approach combines a comprehensive understanding of the caching mechanism and offers a new perspective on leveraging caching to accelerate the diffusion process. Empirical results indicate that FEB-Cache optimizes model performance while concurrently facilitating acceleration.

[260] NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers

Yuhang Ma, Bo Cheng, Shanyuan Liu, Hongyi Zhou, Liebucha Wu, Xiaoyu Wu, Dawei Leng, Yuhui Yin

Main category: cs.CV

TL;DR: NAMI introduces Bridged Progressive Rectified Flow Transformers for efficient high-resolution image generation, reducing inference time by 64% while maintaining quality.

DetailsMotivation: Address high inference latency and computational costs in flow-based Transformer models without sacrificing generation quality.

Method: Decomposes generation across temporal, spatial, and architectural dimensions using a BridgeFlow module and progressive Transformer layers.

Result: Achieves fast convergence, reduces inference time by 64% for 1024-resolution images, and maintains competitive generation quality.

Conclusion: NAMI offers an efficient solution for high-resolution image generation, balancing speed and quality with innovative architectural design.

Abstract: Flow-based Transformer models have achieved state-of-the-art image generation performance, but often suffer from high inference latency and computational cost due to their large parameter sizes. To improve inference efficiency without compromising quality, we propose Bridged Progressive Rectified Flow Transformers (NAMI), which decompose the generation process across temporal, spatial, and architectural demensions. We divide the rectified flow into different stages according to resolution, and use a BridgeFlow module to connect them. Fewer Transformer layers are used at low-resolution stages to generate image layouts and concept contours, and more layers are progressively added as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce Bridged Progressive Rectified Flow Transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 64% for generating 1024 resolution images; (3) We propose a BridgeFlow module to align flows between different stages; (4) We propose the NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and comprehensively assess model effectiveness. The results show that our model is competitive with state-of-the-art models.

[261] JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, Tae-Hyun Oh

Main category: cs.CV

TL;DR: JointDiT is a diffusion transformer for joint RGB and depth generation, using adaptive scheduling and unbalanced timestep sampling to achieve high-fidelity results.

DetailsMotivation: To model the joint distribution of RGB and depth for tasks like joint generation, depth estimation, and depth-conditioned image generation.

Method: Uses adaptive scheduling weights and unbalanced timestep sampling to train across noise levels for each modality.

Result: Produces high-fidelity images and accurate depth maps, with comparable performance in conditional tasks.

Conclusion: Joint distribution modeling can replace conditional generation, as demonstrated by JointDiT’s effectiveness.

Abstract: We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, namely, adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timesteps of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a viable alternative to conditional generation. The project page is available at https://byungki-k.github.io/JointDiT/.

[262] Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

Qingyu Shi, Jianzong Wu, Jinbin Bai, Jiangning Zhang, Lu Qi, Yunhai Tong, Xiangtai Li

Main category: cs.CV

TL;DR: DeT improves motion transfer in Diffusion Transformers (DiT) by introducing a temporal kernel and explicit supervision, outperforming previous methods on the new MTBench benchmark.

DetailsMotivation: Existing DiT models struggle with decoupling motion and appearance due to intertwined spatial-temporal attention. DeT aims to address this limitation.

Method: DeT uses a temporal kernel to smooth features and explicit supervision on dense trajectories, enhancing motion decoupling and consistency.

Result: DeT achieves the best trade-off between motion fidelity and edit fidelity on MTBench, a new challenging benchmark.

Conclusion: DeT effectively improves motion transfer in DiT models, offering a comprehensive evaluation framework with MTBench.

Abstract: The motion transfer task aims to transfer motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within the 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.

[263] What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Chi-Hsi Kung, Frangil Ramirez, Juhyung Ha, Yi-Ting Chen, David Crandall, Yi-Hsuan Tsai

Main category: cs.CV

TL;DR: The paper introduces a method for procedure-aware video representation learning by using state-change descriptions from LLMs and counterfactual reasoning to improve understanding of procedural activities.

DetailsMotivation: Existing work lacks explicit learning of scene transformations in procedural activities, which are crucial for understanding cause and effect.

Method: Incorporates state-change descriptions from LLMs and generates counterfactuals to simulate failure outcomes, enhancing video encoders.

Result: Significant improvements on tasks like action segmentation, error detection, and action recognition.

Conclusion: State-change descriptions and counterfactual reasoning effectively enhance procedural activity understanding in videos.

Abstract: Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Existing work has studied procedure-aware video representations by modeling the temporal order of actions, but has not explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by Large Language Models (LLMs) as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining unseen “What if” scenarios. This counterfactual reasoning facilitates the model’s ability to understand the cause and effect of each step in an activity. We conduct extensive experiments on procedure-aware tasks, including temporal action segmentation, error detection, action phase classification, frame retrieval, multi-instance retrieval, and action recognition. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals, and achieve significant improvements on multiple tasks. Code is available at https://github.com/HCIS- Lab/counterfactual-video-pretrain.

[264] Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

Benjamin Raphael Ernhofer, Daniil Prokhorov, Jannica Langner, Dominik Bollmann

Main category: cs.CV

TL;DR: A vision-language framework for automotive UI understanding and interaction is introduced, supported by the AutomotiveUI-Bench-4K dataset. A fine-tuned Molmo-7B model (ELAM) achieves strong performance and cross-domain generalization.

DetailsMotivation: To address the need for adaptive solutions in automotive infotainment systems due to frequent UI updates and diverse designs.

Method: A vision-language framework is developed, leveraging the AutomotiveUI-Bench-4K dataset. A Molmo-7B model is fine-tuned using LoRa, incorporating reasoning, visual grounding, and evaluation.

Result: ELAM achieves 80.8% accuracy on ScreenSpot, with +5.6% improvement over baseline, demonstrating strong cross-domain generalization.

Conclusion: The research highlights cost-efficient AI-driven advancements in automotive UI understanding, deployable on consumer-grade GPUs.

Abstract: Modern automotive infotainment systems necessitate intelligent and adaptive solutions to manage frequent User Interface (UI) updates and diverse design variations. This work introduces a vision-language framework to facilitate the understanding of and interaction with automotive UIs, enabling seamless adaptation across different UI designs. To support research in this field, AutomotiveUI-Bench-4K, an open-source dataset comprising 998 images with 4,208 annotations, is also released. Additionally, a data pipeline for generating training data is presented. A Molmo-7B-based model is fine-tuned using Low-Rank Adaptation (LoRa), incorporating generated reasoning along with visual grounding and evaluation capabilities. The fine-tuned Evaluative Large Action Model (ELAM) achieves strong performance on AutomotiveUI-Bench-4K (model and dataset are available on Hugging Face). The approach demonstrates strong cross-domain generalization, including a +5.6% improvement on ScreenSpot over the baseline model. An average accuracy of 80.8% is achieved on ScreenSpot, closely matching or surpassing specialized models for desktop, mobile, and web, despite being trained primarily on the automotive domain. This research investigates how data collection and subsequent fine-tuning can lead to AI-driven advancements in automotive UI understanding and interaction. The applied method is cost-efficient, and fine-tuned models can be deployed on consumer-grade GPUs.

[265] TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, Ying Tai

Main category: cs.CV

TL;DR: TextCrafter, a novel method for Complex Visual Text Generation (CVTG), improves text rendering in images by decomposing text components and enhancing alignment, outperforming existing methods.

DetailsMotivation: Addressing challenges like distorted, blurred, or missing visual text in CVTG tasks.

Method: Progressive decomposition of text components, robust alignment, and token focus enhancement.

Result: TextCrafter outperforms state-of-the-art methods, solving issues like text confusion, omissions, and blurriness.

Conclusion: TextCrafter is effective for CVTG, supported by a new benchmark dataset (CVTG-2K) and superior experimental results.

Abstract: This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.

[266] Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, Haohuan Fu, Jianxi Huang, Juepeng Zheng

Main category: cs.CV

TL;DR: AgroMind is a new benchmark for evaluating Large Multimodal Models (LMMs) in agricultural remote sensing, covering diverse tasks and revealing performance gaps, especially in spatial reasoning and fine-grained recognition.

DetailsMotivation: Existing benchmarks for agricultural remote sensing lack scene diversity and task complexity, limiting the evaluation of LMMs in this domain.

Method: AgroMind integrates multiple datasets, pre-processes data, defines agriculturally relevant tasks, and evaluates LMMs across 13 task types.

Result: Performance gaps among LMMs are significant, with humans lagging behind top models in some tasks.

Conclusion: AgroMind highlights LMM limitations in domain knowledge and sets a standardized framework for future agricultural RS research.

Abstract: Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 27,247 QA pairs and 19,615 images. The pipeline begins with multi-source data pre-processing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 20 open-source LMMs and 4 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.

[267] FlowR: Flowing from Sparse to Dense 3D Reconstructions

Tobias Fischer, Samuel Rota Bulò, Yung-Hsu Yang, Nikhil Keetha, Lorenzo Porzi, Norman Müller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, Peter Kontschieder

Main category: cs.CV

TL;DR: 3D Gaussian splatting achieves real-time NVS but struggles with sparse views. A multi-view flow matching model is proposed to enhance sparse reconstructions by generating consistent views, improving NVS quality.

DetailsMotivation: Dense captures for high-quality NVS are costly. Existing 2D generative models cause artifacts due to hallucinations and inconsistency.

Method: A multi-view flow matching model learns to connect sparse reconstructions to dense-like renderings, trained on 3.6M image pairs.

Result: The model processes 45 views at 540x960 resolution efficiently and improves NVS quality in sparse and dense scenarios.

Conclusion: The proposed method outperforms prior works, enhancing reconstruction quality across NVS benchmarks.

Abstract: 3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, dense captures are needed to match the high-quality expectations of applications like Virtual Reality (VR). However, such dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These models typically rely on a noise-to-data generative process conditioned only on a handful of reference input views, leading to hallucinations, inconsistent generation results, and subsequent reconstruction artifacts. Instead, we propose a multi-view, flow matching model that learns a flow to directly connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with consistent, generated views to improve reconstruction quality. Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in sparse- and dense-view scenarios, leading to higher-quality reconstructions than prior works across multiple, widely-used NVS benchmarks.

[268] Prior2Former – Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation

Sebastian Schmidt, Julius Körner, Dominik Fuchsgruber, Stefano Gasperini, Federico Tombari, Stephan Günnemann

Main category: cs.CV

TL;DR: Prior2Former (P2F) introduces evidential learning into segmentation vision transformers for better uncertainty estimation, enabling state-of-the-art performance in anomaly and open-world panoptic segmentation without requiring OOD data.

DetailsMotivation: Current panoptic segmentation methods struggle with novel and out-of-distribution (OOD) data, limiting reliability in safety-critical applications like autonomous driving. P2F aims to bridge this gap.

Method: P2F extends mask vision transformers with a Beta prior for pixel-wise uncertainty estimation, avoiding the need for OOD data or contrastive training.

Result: P2F achieves state-of-the-art performance in anomaly instance and open-world panoptic segmentation on datasets like Cityscapes, COCO, and SegmentMeIfYouCan.

Conclusion: P2F offers a flexible, reliable solution for segmentation tasks, particularly in real-world scenarios with unknown or novel classes.

Abstract: In panoptic segmentation, individual instances must be separated within semantic classes. As state-of-the-art methods rely on a pre-defined set of classes, they struggle with novel categories and out-of-distribution (OOD) data. This is particularly problematic in safety-critical applications, such as autonomous driving, where reliability in unseen scenarios is essential. We address the gap between outstanding benchmark performance and reliability by proposing Prior2Former (P2F), the first approach for segmentation vision transformers rooted in evidential learning. P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments. This design enables high-quality uncertainty estimation that effectively detects novel and OOD objects enabling state-of-the-art anomaly instance segmentation and open-world panoptic segmentation. Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes, making it highly applicable in real-world scenarios where such prior information is unavailable. Additionally, P2F can be flexibly applied to anomaly instance and panoptic segmentation. Through comprehensive experiments on the Cityscapes, COCO, SegmentMeIfYouCan, and OoDIS datasets, P2F demonstrates state-of-the-art performance across the board.

[269] How Can Objects Help Video-Language Understanding?

Zitian Tang, Shijie Wang, Junho Cho, Jaewook Yoo, Chen Sun

Main category: cs.CV

TL;DR: The paper investigates whether explicit object representation is necessary in multimodal large language models (MLLMs) and introduces ObjectMLLM, a framework integrating structured visual representations. Results show explicit object-centric representation is still needed, with quantized object information as text performing best.

DetailsMotivation: To determine if explicit object representation is essential in MLLMs, given the success of implicit methods like visual tokens and captions.

Method: Introduces ObjectMLLM, a framework integrating structured visual representations from arbitrary CV algorithms, tested on six video QA benchmarks.

Result: Explicit object-centric representation is necessary; quantizing structured object info as text performs best.

Conclusion: Explicit object representation remains valuable, with text quantization offering a data-efficient integration method for MLLMs.

Abstract: Do we still need to represent objects explicitly in multimodal large language models (MLLMs)? To one extreme, pre-trained encoders convert images into visual tokens, with which objects and spatiotemporal relationships may be implicitly modeled. To the other extreme, image captions by themselves provide strong empirical performances for understanding tasks, despite missing fine-grained spatiotemporal information. To answer this question, we introduce ObjectMLLM, a framework capable of leveraging arbitrary computer vision algorithm to extract and integrate structured visual representation. Through extensive evaluations on six video question answering benchmarks, we confirm that explicit integration of object-centric representation remains necessary. Surprisingly, we observe that the simple approach of quantizing the continuous, structured object information and representing them as plain text performs the best, offering a data-efficient approach to integrate other visual perception modules into MLLM design. Our code and models are released at https://github.com/brown-palm/ObjectMLLM.

[270] Causally Steered Diffusion for Automated Video Counterfactual Generation

Nikos Spyrou, Athanasios Vlontzos, Paraskevas Pegios, Thomas Melistas, Nefeli Gkouti, Yannis Panagakis, Giorgos Papanastasiou, Sotirios A. Tsaftaris

Main category: cs.CV

TL;DR: The paper introduces CSVC, a framework for causally faithful counterfactual video generation using text-to-image latent diffusion models (LDMs) and causal graphs, achieving high causal effectiveness without fine-tuning.

DetailsMotivation: Current video editing with LDMs lacks causal fidelity, leading to unrealistic outcomes when editing causally dependent attributes.

Method: The framework encodes causal relationships into text prompts and optimizes them using a vision-language model (VLM)-based textual loss to guide LDM generation.

Result: CSVC generates causally faithful counterfactuals with high visual quality and temporal consistency, outperforming in causal effectiveness.

Conclusion: CSVC’s black-box compatibility makes it versatile for realistic ‘what if’ scenarios in fields like digital media and healthcare.

Abstract: Adapting text-to-image (T2I) latent diffusion models (LDMs) to video editing has shown strong visual fidelity and controllability, but challenges remain in maintaining causal relationships inherent to the video data generating process. Edits affecting causally dependent attributes often generate unrealistic or misleading outcomes if these relationships are ignored. In this work, we introduce a causally faithful framework for counterfactual video generation, formulated as an Out-of-Distribution (OOD) prediction problem. We embed prior causal knowledge by encoding the relationships specified in a causal graph into text prompts and guide the generation process by optimizing these prompts using a vision-language model (VLM)-based textual loss. This loss encourages the latent space of the LDMs to capture OOD variations in the form of counterfactuals, effectively steering generation toward causally meaningful alternatives. The proposed framework, dubbed CSVC, is agnostic to the underlying video editing system and does not require access to its internal mechanisms or fine-tuning. We evaluate our approach using standard video quality metrics and counterfactual-specific criteria, such as causal effectiveness and minimality. Experimental results show that CSVC generates causally faithful video counterfactuals within the LDM distribution via prompt-based causal steering, achieving state-of-the-art causal effectiveness without compromising temporal consistency or visual quality on real-world facial videos. Due to its compatibility with any black-box video editing system, our framework has significant potential to generate realistic ‘what if’ hypothetical video scenarios in diverse areas such as digital media and healthcare.

[271] Knowledge Distillation for Underwater Feature Extraction and Matching via GAN-synthesized Images

Jinghe Yang, Mingming Gong, Ye Pu

Main category: cs.CV

TL;DR: The paper proposes a cross-modal knowledge distillation method to improve feature extraction and matching in turbid underwater environments using synthetic underwater images.

DetailsMotivation: Underwater environments challenge vision-based methods due to image blurring and noise, necessitating robust solutions for localization and mapping.

Method: The method involves adaptive GAN-synthesis for generating synthetic underwater images and a knowledge distillation framework to transfer in-air models to underwater settings.

Result: The GAN-based synthesis and knowledge distillation improve feature extraction, validated by VSLAM on real underwater sequences.

Conclusion: The approach effectively enhances feature extraction and matching in turbid underwater environments, demonstrated by downstream applications.

Abstract: Autonomous Underwater Vehicles (AUVs) play a crucial role in underwater exploration. Vision-based methods offer cost-effective solutions for localization and mapping in the absence of conventional sensors like GPS and LiDAR. However, underwater environments present significant challenges for feature extraction and matching due to image blurring and noise caused by attenuation, scattering, and the interference of \textit{marine snow}. In this paper, we aim to improve the robustness of the feature extraction and matching in the turbid underwater environment using the cross-modal knowledge distillation method that transfers the in-air feature extraction and matching models to underwater settings using synthetic underwater images as the medium. We first propose a novel adaptive GAN-synthesis method to estimate water parameters and underwater noise distribution, to generate environment-specific synthetic underwater images. We then introduce a general knowledge distillation framework compatible with different teacher models. The evaluation of GAN-based synthesis highlights the significance of the new components, i.e. GAN-synthesized noise and forward scattering, in the proposed model. Additionally, VSLAM, as a representative downstream application of feature extraction and matching, is employed on real underwater sequences to validate the effectiveness of the transferred model. Project page: https://github.com/Jinghe-mel/UFEN-GAN.

[272] Gradient as Conditions: Rethinking HOG for All-in-one Image Restoration

Jiawei Wu, Zhifei Yang, Zhe Wang, Zhi Jin

Main category: cs.CV

TL;DR: HOGformer, a Transformer-based model, integrates HOG features for degradation-aware image restoration, outperforming existing methods.

DetailsMotivation: Existing methods rely on implicit priors, which may hinder performance in complex scenarios. HOG features offer a discriminative and interpretable prior.

Method: Proposes HOGformer with Dynamic HOG-aware Self-Attention (DHOGSA) and Dynamic Interaction Feed-Forward (DIFF) modules, plus a HOG loss for structural fidelity.

Result: Achieves state-of-the-art performance on benchmarks, generalizing well to real-world scenarios.

Conclusion: HOGformer effectively leverages HOG features for robust and interpretable image restoration.

Abstract: All-in-one image restoration (AIR) aims to address diverse degradations within a unified model by leveraging informative degradation conditions to guide the restoration process. However, existing methods often rely on implicitly learned priors, which may entangle feature representations and hinder performance in complex or unseen scenarios. Histogram of Oriented Gradients (HOG) as a classical gradient representation, we observe that it has strong discriminative capability across diverse degradations, making it a powerful and interpretable prior for AIR. Based on this insight, we propose HOGformer, a Transformer-based model that integrates learnable HOG features for degradation-aware restoration. The core of HOGformer is a Dynamic HOG-aware Self-Attention (DHOGSA) mechanism, which adaptively models long-range spatial dependencies conditioned on degradation-specific cues encoded by HOG descriptors. To further adapt the heterogeneity of degradations in AIR, we propose a Dynamic Interaction Feed-Forward (DIFF) module that facilitates channel-spatial interactions, enabling robust feature transformation under diverse degradations. Besides, we propose a HOG loss to explicitly enhance structural fidelity and edge sharpness. Extensive experiments on a variety of benchmarks, including adverse weather and natural degradations, demonstrate that HOGformer achieves state-of-the-art performance and generalizes well to complex real-world scenarios.Code is available at https://github.com/Fire-friend/HOGformer.

[273] Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

Xiuyu Yang, Shuhan Tan, Philipp Krähenbühl

Main category: cs.CV

TL;DR: InfGen is a unified model for long-term traffic simulation, combining closed-loop motion simulation and scene generation, outperforming state-of-the-art methods.

DetailsMotivation: Existing traffic simulators focus on short-term closed-loop motion, lacking realism for long-term scenarios where agents enter/exit dynamically.

Method: InfGen uses next-token prediction to interleave closed-loop motion simulation and scene generation, switching modes automatically.

Result: InfGen excels in short-term (9s) simulation and significantly outperforms others in long-term (30s) simulation.

Conclusion: InfGen enables stable long-term traffic simulation and will be publicly released.

Abstract: An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at https://orangesodahub.github.io/InfGen

[274] KAN or MLP? Point Cloud Shows the Way Forward

Yan Shi, Qingdong He, Yijun Liu, Xiaoyu Liu, Jingyong Su

Main category: cs.CV

TL;DR: PointKAN, a KANs-based architecture for point cloud analysis, outperforms MLPs by efficiently capturing geometric features with reduced parameters and computational cost.

DetailsMotivation: MLPs struggle with complex geometric structures in point clouds due to fixed activation functions, poor parameter efficiency, and redundancy.

Method: Introduces PointKAN with Geometric Affine Module (GAM) for local feature transformation, Local Feature Processing (LFP) for group-level and global context, and Global Feature Processing (GFP) for hierarchical feature representation. Efficient-KANs reduce parameters.

Result: PointKAN outperforms PointMLP on ModelNet40, ScanObjectNN, and ShapeNetPart, excels in Few-shot Learning, and reduces parameters and FLOPs.

Conclusion: KANs-based architectures like PointKAN show promise for 3D vision and point cloud understanding.

Abstract: Multi-Layer Perceptrons (MLPs) have become one of the fundamental architectural component in point cloud analysis due to its effective feature learning mechanism. However, when processing complex geometric structures in point clouds, MLPs’ fixed activation functions struggle to efficiently capture local geometric features, while suffering from poor parameter efficiency and high model redundancy. In this paper, we propose PointKAN, which applies Kolmogorov-Arnold Networks (KANs) to point cloud analysis tasks to investigate their efficacy in hierarchical feature representation. First, we introduce a Geometric Affine Module (GAM) to transform local features, improving the model’s robustness to geometric variations. Next, in the Local Feature Processing (LFP), a parallel structure extracts both group-level features and global context, providing a rich representation of both fine details and overall structure. Finally, these features are combined and processed in the Global Feature Processing (GFP). By repeating these operations, the receptive field gradually expands, enabling the model to capture complete geometric information of the point cloud. To overcome the high parameter counts and computational inefficiency of standard KANs, we develop Efficient-KANs in the PointKAN-elite variant, which significantly reduces parameters while maintaining accuracy. Experimental results demonstrate that PointKAN outperforms PointMLP on benchmark datasets such as ModelNet40, ScanObjectNN, and ShapeNetPart, with particularly strong performance in Few-shot Learning task. Additionally, PointKAN achieves substantial reductions in parameter counts and computational complexity (FLOPs). This work highlights the potential of KANs-based architectures in 3D vision and opens new avenues for research in point cloud understanding.

[275] Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng

Main category: cs.CV

TL;DR: UniME is a two-stage framework using MLLMs to improve multimodal representation learning, addressing CLIP’s limitations and outperforming benchmarks.

DetailsMotivation: CLIP's limitations in text token truncation, isolated encoding, and compositionality hinder its efficacy, while MLLMs' potential for transferable representations is underexplored.

Method: UniME employs textual discriminative knowledge distillation and hard negative enhanced instruction tuning to enhance representation learning.

Result: UniME achieves consistent performance gains in retrieval tasks, showing superior discriminative and compositional capabilities.

Conclusion: UniME effectively leverages MLLMs to overcome CLIP’s limitations and excels in multimodal representation learning.

Abstract: The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM's language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

[276] LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs

Woo Yi Yang, Jiarui Wang, Sijing Wu, Huiyu Duan, Yuxin Zhu, Liu Yang, Kang Fu, Guangtao Zhai, Xiongkuo Min

Main category: cs.CV

TL;DR: The paper introduces Gen3DHF, a benchmark for assessing AI-generated 3D human faces, and LMME3DHF, a multimodal model for evaluating quality and authenticity, outperforming existing methods.

DetailsMotivation: Assessing the quality and realism of AI-generated 3D human faces is challenging due to subjective human perception.

Method: Created Gen3DHF (2,000 videos, 4,000 MOS scores, saliency maps) and proposed LMME3DHF, a multimodal model for evaluation.

Result: LMME3DHF achieves state-of-the-art performance in predicting quality scores and identifying distortions, aligning with human judgments.

Conclusion: The Gen3DHF database and LMME3DHF model will be released, advancing AI-generated 3D face assessment.

Abstract: The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate perceptual sensitivity to facial features. To this end, we conduct a comprehensive study on the quality assessment of AI-generated 3D human faces. We first introduce Gen3DHF, a large-scale benchmark comprising 2,000 videos of AI-Generated 3D Human Faces along with 4,000 Mean Opinion Scores (MOS) collected across two dimensions, i.e., quality and authenticity, 2,000 distortion-aware saliency maps and distortion descriptions. Based on Gen3DHF, we propose LMME3DHF, a Large Multimodal Model (LMM)-based metric for Evaluating 3DHF capable of quality and authenticity score prediction, distortion-aware visual question answering, and distortion-aware saliency prediction. Experimental results show that LMME3DHF achieves state-of-the-art performance, surpassing existing methods in both accurately predicting quality scores for AI-generated 3D human faces and effectively identifying distortion-aware salient regions and distortion types, while maintaining strong alignment with human perceptual judgments. Both the Gen3DHF database and the LMME3DHF will be released upon the publication.

[277] FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Lei Sun, Xiangxiang Chu

Main category: cs.CV

TL;DR: FLUX-Text is a multilingual scene text editing method using DiT architecture, improving glyph generation with lightweight modules and achieving high-quality results with minimal training data.

DetailsMotivation: Existing UNet-based diffusion models struggle with complex glyph structures, especially in non-Latin scripts like Chinese, Korean, and Japanese.

Method: FLUX-Text employs Visual and Text Embedding Modules for glyph understanding, a Regional Text Perceptual Loss, and a two-stage training strategy.

Result: FLUX-Text achieves superior visual quality and text fidelity with only 0.1M training examples, a 97% reduction compared to other methods.

Conclusion: FLUX-Text outperforms existing methods in multilingual scene text editing, offering efficiency and high-quality results.

Abstract: Scene text editing aims to modify or add texts on images while ensuring text fidelity and overall visual quality consistent with the background. Recent methods are primarily built on UNet-based diffusion models, which have improved scene text editing results, but still struggle with complex glyph structures, especially for non-Latin ones (\eg, Chinese, Korean, Japanese). To address these issues, we present \textbf{FLUX-Text}, a simple and advanced multilingual scene text editing DiT method. Specifically, our FLUX-Text enhances glyph understanding and generation through lightweight Visual and Text Embedding Modules, while preserving the original generative capability of FLUX. We further propose a Regional Text Perceptual Loss tailored for text regions, along with a matching two-stage training strategy to better balance text editing and overall image quality. Benefiting from the DiT-based architecture and lightweight feature injection modules, FLUX-Text can be trained with only $0.1$M training examples, a \textbf{97%} reduction compared to $2.9$M required by popular methods. Extensive experiments on multiple public datasets, including English and Chinese benchmarks, demonstrate that our method surpasses other methods in visual quality and text fidelity. All the code is available at https://github.com/AMAP-ML/FluxText.

[278] PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining

Ciyu Ruan, Ruishan Guo, Zihang Gong, Jingao Xu, Wenhan Yang, Xinlei Chen

Main category: cs.CV

TL;DR: PRE-Mamba is a novel event camera deraining framework that preserves temporal precision, enhances deraining, and maintains computational efficiency using a 4D event cloud, STDF module, and MS3M.

DetailsMotivation: Event cameras struggle with noise in rainy conditions, and existing deraining methods compromise between temporal precision, effectiveness, and efficiency.

Method: Uses a 4D event cloud, STDF module for spatiotemporal decoupling and fusion, and MS3M for rain dynamics with linear complexity. Enhanced by frequency-domain regularization.

Result: Achieves 0.95 SR, 0.91 NR, and 0.4s/M events with 0.26M parameters on EventRain-27K. Generalizes across rain intensities, viewpoints, and snowy conditions.

Conclusion: PRE-Mamba outperforms existing methods in deraining while maintaining efficiency and generalizability.

Abstract: Event cameras excel in high temporal resolution and dynamic range but suffer from dense noise in rainy conditions. Existing event deraining methods face trade-offs between temporal precision, deraining effectiveness, and computational efficiency. In this paper, we propose PRE-Mamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. Our framework introduces a 4D event cloud representation that integrates dual temporal scales to preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion module (STDF) that enhances deraining capability by enabling shallow decoupling and interaction of temporal and spatial information, and a Multi-Scale State Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and multi-spatial scales with linear computational complexity. Enhanced by frequency-domain regularization, PRE-Mamba achieves superior performance (0.95 SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a comprehensive dataset with labeled synthetic and real-world sequences. Moreover, our method generalizes well across varying rain intensities, viewpoints, and even snowy conditions.

[279] Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image

Yuxuan Wang, Xuanyu Yi, Qingshan Xu, Yuan Zhou, Long Chen, Hanwang Zhang

Main category: cs.CV

TL;DR: CP-GS is a framework for personalizing 3D scenes from a single image, addressing viewpoint bias by propagating reference appearance to novel views using pre-trained models and iterative fine-tuning.

DetailsMotivation: Existing methods struggle with viewpoint bias in single-image 3D scene personalization, leading to inconsistent results.

Method: CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extend reference appearance, guided by geometric cues.

Result: CP-GS outperforms existing methods, achieving high-quality, consistent 3D personalization.

Conclusion: CP-GS effectively mitigates viewpoint bias, offering superior performance in 3D scene personalization.

Abstract: Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image. However, these goals are particularly challenging due to the viewpoint bias caused by the limited perspective provided in a single image. Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. In particular, CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extract and extend the reference appearance, and finally produces faithful multi-view guidance images and the personalized 3DGS outputs through a view-consistent generation process guided by geometric cues. Extensive experiments on real-world scenes show that our CP-GS effectively mitigates the viewpoint bias, achieving high-quality personalization that significantly outperforms existing methods. The code will be released at https://github.com/Yuxuan-W/CP-GS.

[280] GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs

Xiaorong Zhu, Ziheng Jia, Jiarui Wang, Xiangyu Zhao, Haodong Duan, Xiongkuo Min, Jia Wang, Zicheng Zhang, Guangtao Zhai

Main category: cs.CV

TL;DR: GOBench is introduced to evaluate MLLMs’ abilities in generating optically authentic imagery and understanding geometric optics, revealing significant challenges in current models.

DetailsMotivation: To address the lack of comprehensive assessment of MLLMs' capabilities in fine-grained physical principles, particularly geometric optics.

Method: GOBench evaluates MLLMs through two tasks: generating optically authentic imagery and understanding optical phenomena, using curated prompts and subjective experiments.

Result: Current models struggle; GPT-4o-Image fails in generation tasks, and Gemini-2.5Pro achieves only 37.35% accuracy in understanding.

Conclusion: MLLMs face notable limitations in optical generation and understanding, highlighting the need for further improvement.

Abstract: The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs’ ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k dataset.We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs’ generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35% accuracy in optical understanding. Database and codes are publicly available at https://github.com/aiben-ch/GOBench.

[281] SemiSegECG: A Multi-Dataset Benchmark for Semi-Supervised Semantic Segmentation in ECG Delineation

Minje Park, Jeonghwa Lim, Taehyung Yu, Sunghoon Joo

Main category: cs.CV

TL;DR: SemiSegECG is a benchmark for semi-supervised ECG delineation, evaluating transformer and convolutional networks, with transformers outperforming.

DetailsMotivation: Limited annotated ECG datasets hinder deep learning progress; semi-supervised learning leverages unlabeled data.

Method: Curated datasets, implemented five SemiSeg algorithms on two architectures (convolutional and transformer), evaluated in-domain and cross-domain settings, proposed ECG-specific configurations.

Result: Transformer outperforms convolutional network in semi-supervised ECG delineation.

Conclusion: SemiSegECG provides a foundation for advancing semi-supervised ECG delineation and future research.

Abstract: Electrocardiogram (ECG) delineation, the segmentation of meaningful waveform features, is critical for clinical diagnosis. Despite recent advances using deep learning, progress has been limited by the scarcity of publicly available annotated datasets. Semi-supervised learning presents a promising solution by leveraging abundant unlabeled ECG data. In this study, we present SemiSegECG, the first systematic benchmark for semi-supervised semantic segmentation (SemiSeg) in ECG delineation. We curated and unified multiple public datasets, including previously underused sources, to support robust and diverse evaluation. We adopted five representative SemiSeg algorithms from computer vision, implemented them on two different architectures: the convolutional network and the transformer, and evaluated them in two different settings: in-domain and cross-domain. Additionally, we propose ECG-specific training configurations and augmentation strategies and introduce a standardized evaluation framework. Our results show that the transformer outperforms the convolutional network in semi-supervised ECG delineation. We anticipate that SemiSegECG will serve as a foundation for advancing semi-supervised ECG delineation methods and will facilitate further research in this domain.

[282] Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting

Nan Wang, Yuantao Chen, Lixing Xiao, Weiqing Xiao, Bohan Li, Zhaoxi Chen, Chongjie Ye, Shaocong Xu, Saining Zhang, Ziyang Yan, Pierre Merriaux, Lei Lei, Tianfan Xue, Hao Zhao

Main category: cs.CV

TL;DR: The paper introduces a multi-scale bilateral grid to unify appearance codes and bilateral grids, improving geometric accuracy in dynamic scene reconstruction for autonomous driving.

DetailsMotivation: Real-world scenarios often lack perfect photometric consistency, limiting the effectiveness of existing techniques like appearance codes and bilateral grids.

Method: Proposes a novel multi-scale bilateral grid to combine appearance codes and bilateral grids, addressing photometric inconsistency.

Result: Outperforms appearance codes and bilateral grids in geometric accuracy, validated on Waymo, NuScenes, Argoverse, and PandaSet datasets.

Conclusion: The multi-scale bilateral grid effectively reduces floaters caused by photometric inconsistency, enhancing autonomous driving scene reconstruction.

Abstract: Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.

[283] EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs

Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, Giovanni Maria Farinella

Main category: cs.CV

TL;DR: EASG-Bench is a QA benchmark for egocentric videos using dynamic scene graphs, revealing gaps in language-only and video-LLMs, especially in temporal understanding.

DetailsMotivation: To address the lack of benchmarks for spatio-temporally grounded QA in egocentric videos and evaluate model performance.

Method: Created QA pairs from dynamic scene graphs and evaluated language-only and video-LLMs systematically.

Result: Identified a performance gap in models, particularly for temporal ordering questions.

Conclusion: Highlights a research gap in long-context video understanding; benchmark and code are open for reproducibility.

Abstract: We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: https://github.com/fpv-iplab/EASG-bench.

[284] UniDet-D: A Unified Dynamic Spectral Attention Model for Object Detection under Adverse Weathers

Wei Zhang, Yuantao Wang, Haowei Yang, Yin Zhuang, Shijian Lu, Xuerui Mao

Main category: cs.CV

TL;DR: UniDet-D is a unified framework for object detection in adverse weather, combining detection and restoration with dynamic spectral attention for robust performance across various degradations.

DetailsMotivation: Existing methods for object detection in adverse weather are limited to specific conditions and lack generalization. UniDet-D addresses this by leveraging theoretical insights into visual detail loss.

Method: UniDet-D integrates a dynamic spectral attention mechanism to adaptively focus on informative spectral components, enabling robust feature representation for diverse degradations.

Result: UniDet-D achieves superior detection accuracy across adverse weather conditions and generalizes well to unseen scenarios like sandstorms and rain-fog mixtures.

Conclusion: UniDet-D is a promising solution for real-world object detection in adverse weather, offering strong generalization and performance.

Abstract: Real-world object detection is a challenging task where the captured images/videos often suffer from complex degradations due to various adverse weather conditions such as rain, fog, snow, low-light, etc. Despite extensive prior efforts, most existing methods are designed for one specific type of adverse weather with constraints of poor generalization, under-utilization of visual features while handling various image degradations. Leveraging a theoretical analysis on how critical visual details are lost in adverse-weather images, we design UniDet-D, a unified framework that tackles the challenge of object detection under various adverse weather conditions, and achieves object detection and image restoration within a single network. Specifically, the proposed UniDet-D incorporates a dynamic spectral attention mechanism that adaptively emphasizes informative spectral components while suppressing irrelevant ones, enabling more robust and discriminative feature representation across various degradation types. Extensive experiments show that UniDet-D achieves superior detection accuracy across different types of adverse-weather degradation. Furthermore, UniDet-D demonstrates superior generalization towards unseen adverse weather conditions such as sandstorms and rain-fog mixtures, highlighting its great potential for real-world deployment.

[285] Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

Youze Wang, Zijun Chen, Ruoyu Chen, Shishen Gu, Wenbo Hu, Jiayang Liu, Yinpeng Dong, Hang Su, Jun Zhu, Meng Wang, Richang Hong

Main category: cs.CV

TL;DR: Trust-videoLLMs evaluates 23 videoLLMs across truthfulness, robustness, safety, fairness, and privacy, revealing limitations in dynamic scene comprehension and real-world risk mitigation.

DetailsMotivation: Address challenges like factual inaccuracies, harmful content, biases, hallucinations, and privacy risks in videoLLMs.

Method: Introduces Trust-videoLLMs, a benchmark with 30 tasks using adapted, synthetic, and annotated videos to assess spatiotemporal risks and cross-modal impact.

Result: Open-source models sometimes outperform, but proprietary models generally show better credibility; scaling doesn’t consistently improve performance.

Conclusion: Highlights the need for diverse training data and robust multimodal alignment; provides a toolkit for standardized trustworthiness assessments.

Abstract: Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

[286] Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation

Sheng-Feng Yu, Jia-Jiun Yao, Wei-Chen Chiu

Main category: cs.CV

TL;DR: The paper introduces Self-Supervised Dataset Distillation (SSDD), a method to distill large datasets into compact sets while preserving performance. It addresses challenges like cross-architecture generalization and data augmentation instability with novel techniques.

DetailsMotivation: Large datasets increase training costs, making dataset distillation essential. Existing methods focus on supervised datasets, but this work targets self-supervised learning to enhance efficiency and generalizability.

Method: Proposes SSDD with three techniques: 1) low-dimensional bases for parameterization, 2) predetermined augmentations to reduce instability, and 3) a lightweight network to model representation connections.

Result: SSDD outperforms existing methods in distillation efficiency, cross-architecture generalization, and transfer learning, validated on multiple datasets.

Conclusion: SSDD effectively distills self-supervised datasets, offering compact, high-performance distilled sets with improved generalizability and stability.

Abstract: Although larger datasets are crucial for training large deep models, the rapid growth of dataset size has brought a significant challenge in terms of considerable training costs, which even results in prohibitive computational expenses. Dataset Distillation becomes a popular technique recently to reduce the dataset size via learning a highly compact set of representative exemplars, where the model trained with these exemplars ideally should have comparable performance with respect to the one trained with the full dataset. While most of existing works upon dataset distillation focus on supervised datasets, we instead aim to distill images and their self-supervisedly trained representations into a distilled set. This procedure, named as Self-Supervised Dataset Distillation, effectively extracts rich information from real datasets, yielding the distilled sets with enhanced cross-architecture generalizability. Particularly, in order to preserve the key characteristics of original dataset more faithfully and compactly, several novel techniques are proposed: 1) we introduce an innovative parameterization upon images and representations via distinct low-dimensional bases, where the base selection for parameterization is experimentally shown to play a crucial role; 2) we tackle the instability induced by the randomness of data augmentation – a key component in self-supervised learning but being underestimated in the prior work of self-supervised dataset distillation – by utilizing predetermined augmentations; 3) we further leverage a lightweight network to model the connections among the representations of augmented views from the same image, leading to more compact pairs of distillation. Extensive experiments conducted on various datasets validate the superiority of our approach in terms of distillation efficiency, cross-architecture generalization, and transfer learning performance.

[287] BSMamba: Brightness and Semantic Modeling for Long-Range Interaction in Low-Light Image Enhancement

Tongshun Zhang, Pingping Liu, Mengen Cai, Zijian Zhang, Yubing Lu, Qiuzhan Zhou

Main category: cs.CV

TL;DR: BSMamba introduces a novel visual Mamba architecture for low-light image enhancement, combining Brightness Mamba and Semantic Mamba to improve brightness and preserve semantics.

DetailsMotivation: Existing methods struggle with brightness enhancement and semantic consistency in low-light images, and current visual Mamba approaches limit token interactions.

Method: BSMamba uses Brightness Mamba for brightness-guided token interactions and Semantic Mamba for semantic consistency, avoiding fixed scanning patterns.

Result: BSMamba achieves state-of-the-art performance in low-light image enhancement while maintaining semantic consistency.

Conclusion: BSMamba effectively addresses limitations in brightness and semantic preservation, outperforming existing methods.

Abstract: Current low-light image enhancement (LLIE) methods face significant limitations in simultaneously improving brightness while preserving semantic consistency, fine details, and computational efficiency. With the emergence of state-space models, particularly Mamba, image restoration has achieved remarkable performance, yet existing visual Mamba approaches flatten 2D images into 1D token sequences using fixed scanning rules, critically limiting interactions between distant tokens with causal relationships and constraining their ability to capture meaningful long-range dependencies. To address these fundamental limitations, we propose BSMamba, a novel visual Mamba architecture comprising two specially designed components: Brightness Mamba and Semantic Mamba. The Brightness Mamba revolutionizes token interaction patterns by prioritizing connections between distant tokens with similar brightness levels, effectively addressing the challenge of brightness restoration in LLIE tasks through brightness-guided selective attention. Complementing this, the Semantic Mamba establishes priority interactions between tokens sharing similar semantic meanings, allowing the model to maintain contextual consistency by connecting semantically related regions across the image, thus preserving the hierarchical nature of image semantics during enhancement. By intelligently modeling tokens based on brightness and semantic similarity rather than arbitrary scanning patterns, BSMamba transcends the constraints of conventional token sequencing while adhering to the principles of causal modeling. Extensive experiments demonstrate that BSMamba achieves state-of-the-art performance in LLIE while preserving semantic consistency. Code is available at https://github.com/bywlzts/BSMamba.

[288] MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

Yuqi Pang, Bowen Yang, Yun Cao, Rong Fan, Xiaoyu Li, Chen He

Main category: cs.CV

TL;DR: MoCHA is a novel visual framework integrating multiple vision backbones and a sparse Mixture of Experts Connectors (MoECs) to enhance visual feature extraction and modality bridging, outperforming state-of-the-art models.

DetailsMotivation: Address high training/inference costs and challenges in extracting visual details and bridging modalities in Vision Large Language Models (VLLMs).

Method: Integrates CLIP, SigLIP, DINOv2, and ConvNeXt backbones with MoECs for dynamic expert selection and Hierarchical Group Attention (HGA) for adaptive feature gating.

Result: Outperforms open-weight models, e.g., 3.25% improvement in POPE and 153-point rise on MME.

Conclusion: MoCHA’s MoECs and HGA are effective and robust, enhancing performance in visual tasks.

Abstract: Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features. We train MoCHA on two mainstream LLMs (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks. Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks. For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA.

[289] MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention

Zunhui Xia, Hongxing Li, Libin Lan

Main category: cs.CV

TL;DR: MedFormer is an efficient medical vision transformer addressing task-specific limitations and high computational costs in medical image recognition. It uses a pyramid scaling structure and Dual Sparse Selection Attention (DSSA) for versatility and efficiency.

DetailsMotivation: Current vision transformer methods in medical image recognition are task-specific and computationally expensive, limiting their general applicability and performance.

Method: MedFormer employs a pyramid scaling structure for hierarchical feature representation and introduces DSSA for efficient, content-aware attention.

Result: MedFormer outperforms existing methods in generality and efficiency, enhancing performance in classification, segmentation, and lesion detection tasks.

Conclusion: MedFormer offers an efficient, versatile solution for medical image recognition with strong clinical potential.

Abstract: Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is designed to explicitly attend to the most relevant content. Theoretical analysis demonstrates that MedFormer outperforms existing medical vision transformers in terms of generality and efficiency. Extensive experiments across various imaging modality datasets show that MedFormer consistently enhances performance in all three medical image recognition tasks mentioned above. MedFormer provides an efficient and versatile solution for medical image recognition, with strong potential for clinical application.

[290] Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints

Santosh Patapati, Trisanth Srinivasan, Murari Ambati

Main category: cs.CV

TL;DR: XYZ-Drive integrates vision, maps, and waypoints into a single model for autonomous driving, achieving high success rates and efficiency.

DetailsMotivation: Autonomous cars require both geometric accuracy and semantic understanding, typically handled separately. XYZ-Drive unifies these aspects for better performance.

Method: Uses a vision-language model with goal-centered cross-attention to fuse front-camera frames, overhead maps, and waypoints. A fine-tuned LLaMA-3.2 11B model processes fused tokens.

Result: Achieves 95% success and 0.80 SPL on MD-NEX benchmark, outperforming PhysNav-DG by 15% and reducing collisions. Ablations confirm the importance of each modality and fusion method.

Conclusion: Early token-level fusion of intent and map layout enables accurate, transparent, and real-time autonomous driving.

Abstract: Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.

[291] Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation

Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo

Main category: cs.CV

TL;DR: DIVER is an end-to-end driving framework combining reinforcement learning and diffusion-based generation to create diverse, feasible trajectories, overcoming limitations of imitation learning.

DetailsMotivation: Imitation learning from single expert demonstrations leads to conservative behaviors and poor generalization in complex scenarios. DIVER aims to enhance diversity and practicality.

Method: DIVER uses a reinforced diffusion-based mechanism to generate multiple trajectories from one ground-truth, guided by reinforcement learning for safety and diversity. A novel Diversity metric evaluates trajectory diversity.

Result: Experiments on NAVSIM, Bench2Drive, and nuScenes show DIVER improves trajectory diversity and addresses mode collapse in imitation learning.

Conclusion: DIVER effectively enhances trajectory diversity and generalization, outperforming traditional imitation learning methods.

Abstract: Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.

[292] D3: Training-Free AI-Generated Video Detection Using Second-Order Features

Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, Chao Shen

Main category: cs.CV

TL;DR: The paper introduces D3, a training-free method for detecting AI-generated videos by analyzing second-order temporal artifacts, outperforming existing methods.

DetailsMotivation: Public concern over synthetic video content necessitates better detection methods, as current approaches lack exploration of temporal artifacts.

Method: The authors propose a theoretical framework using second-order dynamical analysis and introduce D3, leveraging second-order temporal discrepancies for detection.

Result: D3 outperforms prior methods by 10.39% mean Average Precision on Gen-Video and shows computational efficiency and robustness.

Conclusion: D3 provides an effective, efficient solution for detecting AI-generated videos, validated across multiple datasets.

Abstract: The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3’s exceptional computational efficiency and strong robust performance. Our code is available at https://github.com/Zig-HS/D3.

[293] Towards Imperceptible JPEG Image Hiding: Multi-range Representations-driven Adversarial Stego Generation

Junxue Yang, Xin Liao, Weixuan Tang, Jianhua Yang, Zheng Qin

Main category: cs.CV

TL;DR: MRAG is a framework for JPEG image hiding using multi-range representations and adversarial attacks to improve imperceptibility and evade detection.

DetailsMotivation: Existing image hiding schemes are easily detected due to limitations in cover type, feature extraction, and loss constraints.

Method: MRAG combines local-range (convolution) and global-range (transformer) modeling, with a features angle-norm disentanglement loss for adversarial attacks.

Result: MRAG achieves visual and steganalysis imperceptibility, outperforming existing methods.

Conclusion: MRAG advances image hiding by integrating multi-range representations and adversarial techniques, demonstrating state-of-the-art performance.

Abstract: Image hiding fully explores the hidden potential of deep learning-based models, aiming to conceal image-level messages within cover images and reveal them from stego images to achieve covert communication. Existing hiding schemes are easily detected by the naked eyes or steganalyzers due to the cover type confined to the spatial domain, single-range feature extraction and attacks, and insufficient loss constraints. To address these issues, we propose a multi-range representations-driven adversarial stego generation framework called MRAG for JPEG image hiding. This design stems from the fact that steganalyzers typically combine local-range and global-range information to better capture hidden traces. Specifically, MRAG integrates the local-range characteristic of the convolution and the global-range modeling of the transformer. Meanwhile, a features angle-norm disentanglement loss is designed to launch multi-range representations-driven feature-level adversarial attacks. It computes the adversarial loss between covers and stegos based on the surrogate steganalyzer’s classified features, i.e., the features before the last fully connected layer. Under the dual constraints of features angle and norm, MRAG can delicately encode the concatenation of cover and secret into subtle adversarial perturbations from local and global ranges relevant to steganalysis. Therefore, the resulting stego can achieve visual and steganalysis imperceptibility. Moreover, coarse-grained and fine-grained frequency decomposition operations are devised to transform the input, introducing multi-grained information. Extensive experiments demonstrate that MRAG can achieve state-of-the-art performance.

[294] MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion

Jihao Gu, Fei Wang, Kun Li, Yanyan Wei, Zhiliang Wu, Dan Guo

Main category: cs.CV

TL;DR: MM-Gesture is a multimodal fusion framework for micro-gesture recognition, achieving top performance in the MiGA Challenge with 73.213% accuracy.

DetailsMotivation: To address the challenge of recognizing subtle and short-duration micro-gestures by integrating multiple modalities.

Method: Uses PoseConv3D and Video Swin Transformer with a modality-weighted ensemble strategy, enhanced by transfer learning from the MA-52 dataset.

Result: Achieved 73.213% top-1 accuracy on the iMiGUE benchmark, outperforming previous methods.

Conclusion: MM-Gesture is effective for micro-gesture recognition, validated by ablation studies and benchmark performance.

Abstract: In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across different modalities, validate the effectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%. Code is available at: https://github.com/momiji-bit/MM-Gesture.

[295] Style Composition within Distinct LoRA modules for Traditional Art

Jaehyun Lee, Wonhark Park, Wonsik Shin, Hyunho Lee, Hyoung Min Na, Nojun Kwak

Main category: cs.CV

TL;DR: A zero-shot diffusion pipeline for blending multiple artistic styles in text-to-image models by fusing denoised latents from separately trained models, using spatial masks for precise control.

DetailsMotivation: Existing diffusion models struggle with controlled, region-specific style blending due to entangled latent spaces and lack of smooth interpolation.

Method: Proposes a pipeline that performs style composition on denoised latents from specialized models, leveraging spatial masks and depth-map conditioning for coherence.

Result: Achieves precise, region-specific style mixing while preserving individual style fidelity, validated by qualitative and quantitative experiments.

Conclusion: The method enables user-guided, controlled blending of multiple styles in text-to-image synthesis, addressing limitations of current approaches.

Abstract: Diffusion-based text-to-image models have achieved remarkable results in synthesizing diverse images from text prompts and can capture specific artistic styles via style personalization. However, their entangled latent space and lack of smooth interpolation make it difficult to apply distinct painting techniques in a controlled, regional manner, often causing one style to dominate. To overcome this, we propose a zero-shot diffusion pipeline that naturally blends multiple styles by performing style composition on the denoised latents predicted during the flow-matching denoising process of separately trained, style-specialized models. We leverage the fact that lower-noise latents carry stronger stylistic information and fuse them across heterogeneous diffusion pipelines using spatial masks, enabling precise, region-specific style control. This mechanism preserves the fidelity of each individual style while allowing user-guided mixing. Furthermore, to ensure structural coherence across different models, we incorporate depth-map conditioning via ControlNet into the diffusion framework. Qualitative and quantitative experiments demonstrate that our method successfully achieves region-specific style mixing according to the given masks.

[296] AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving

Jiawei Xu, Kai Deng, Zexin Fan, Shenlong Wang, Jin Xie, Jian Yang

Main category: cs.CV

TL;DR: AD-GS is a self-supervised framework for high-quality free-viewpoint rendering of driving scenes, eliminating the need for costly manual annotations while outperforming current methods.

DetailsMotivation: Current methods for dynamic urban driving scenes rely on expensive manual annotations or suffer from inaccuracies in motion capture and scene decomposition, leading to rendering artifacts.

Method: AD-GS uses a learnable motion model combining B-spline curves and trigonometric functions, pseudo 2D segmentation for automatic scene decomposition, dynamic Gaussians, and bidirectional temporal visibility masks. It also includes visibility reasoning and rigid regularization.

Result: AD-GS outperforms state-of-the-art annotation-free methods and competes with annotation-dependent approaches in rendering quality.

Conclusion: AD-GS provides a robust, annotation-free solution for high-quality dynamic scene rendering, addressing limitations of current methods.

Abstract: Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotation-free methods and is competitive with annotation-dependent approaches.

[297] Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor

Xiaoliu Guan, Lielin Jiang, Hanqi Chen, Xu Zhang, Jiaxing Yan, Guanzhong Wang, Yi Liu, Zetao Zhang, Yu Wu

Main category: cs.CV

TL;DR: The paper proposes a dynamic caching mechanism for Taylor-based acceleration in Diffusion Transformers (DiTs) to improve inference speed without significant quality loss.

DetailsMotivation: Current training-free approaches for accelerating DiTs suffer from memory overhead and fixed caching schedules, leading to degraded outputs when predictions fail.

Method: The method shifts Taylor prediction to the last block level, reducing cached features, and introduces a dynamic caching mechanism based on prediction reliability.

Result: The approach achieves significant speedups (3.17x on FLUX, 2.36x on DiT, 4.14x on Wan Video) with negligible quality drop.

Conclusion: The proposed method effectively balances speed and quality in DiTs, addressing limitations of prior work.

Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks. However, their low inference speed limits their deployment in low-resource applications. Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference. Building on this idea, TaylorSeer instead uses cached features to predict future ones via Taylor expansion. However, its module-level prediction across all transformer blocks (e.g., attention or feedforward modules) requires storing fine-grained intermediate features, leading to notable memory and computation overhead. Moreover, it adopts a fixed caching schedule without considering the varying accuracy of predictions across timesteps, which can lead to degraded outputs when prediction fails. To address these limitations, we propose a novel approach to better leverage Taylor-based acceleration. First, we shift the Taylor prediction target from the module level to the last block level, significantly reducing the number of cached features. Furthermore, observing strong sequential dependencies among Transformer blocks, we propose to use the error between the Taylor-estimated and actual outputs of the first block as an indicator of prediction reliability. If the error is small, we trust the Taylor prediction for the last block; otherwise, we fall back to full computation, thereby enabling a dynamic caching mechanism. Empirical results show that our method achieves a better balance between speed and quality, achieving a 3.17x acceleration on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop. The Project Page is \href{https://cg-taylor-acce.github.io/CG-Taylor/}{here.}

[298] PositionIC: Unified Position and Identity Consistency for Image Customization

Junjie Hu, Tianyang Han, Kai Ma, Jialin Gao, Hao Dou, Song Yang, Xianhua He, Jianhui Zhang, Junfeng Luo, Xiaoming Wei, Wenqiang Zhang

Main category: cs.CV

TL;DR: PositionIC introduces a framework for precise spatial control in image customization, addressing the lack of scalable datasets for multi-subject scenarios.

DetailsMotivation: Current image customization lacks fine-grained spatial control due to missing datasets with identity-position binding.

Method: Uses a bidirectional generation pipeline for subject consistency and a positional modulation operation for spatial decoupling.

Result: Achieves precise spatial control and high fidelity in multi-subject customization.

Conclusion: PositionIC enables controllable, high-fidelity customization in open-world scenarios and will be released for research.

Abstract: Recent subject-driven image customization has achieved significant advancements in fidelity, yet fine-grained instance-level spatial control remains elusive, hindering broader real-world application. This limitation is mainly attributed to the absence of scalable datasets that bind identity with precise positional cues. To this end, we introduce PositionIC, a unified framework that enforces position and identity consistency for multi-subject customization. We construct a scalable synthesis pipeline that employs a bidirectional generation paradigm to eliminate subject drift and maintain semantic coherence. On top of these data, we design a lightweight positional modulation operation that decouples spatial embeddings among subjects, enabling independent, accurate placement while preserving visual fidelity. Extensive experiments demonstrate that our approach can achieve precise spatial control while maintaining high consistency in image customization tasks. PositionIC paves the way for controllable, high-fidelity image customization in open-world, multi-entity scenarios and will be released to foster further research.

[299] Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano

Main category: cs.CV

TL;DR: Franca is the first fully open-source vision foundation model, outperforming proprietary models like DINOv2 and CLIP. It uses transparent training with public data and introduces innovations in SSL clustering and positional disentanglement.

DetailsMotivation: To create a high-performance, transparent, and open-source vision foundation model that addresses limitations in SSL clustering and positional biases.

Method: Uses a transparent training pipeline with public datasets (ImageNet-21K, ReLAION-2B), introduces a multi-head clustering projector with nested Matryoshka representations, and employs positional disentanglement to remove biases.

Result: Franca matches or surpasses proprietary models, improves feature space clarity, and achieves consistent gains on downstream benchmarks.

Conclusion: Franca sets a new standard for transparent, high-performance vision models, promoting reproducibility and generalizability in AI.

Abstract: We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

[300] Negation-Aware Test-Time Adaptation for Vision-Language Models

Haochen Han, Alex Jinpeng Wang, Fangming Liu, Jun Zhu

Main category: cs.CV

TL;DR: The paper addresses the challenge of negation understanding in Vision-Language Models (VLMs), proposing a low-resource method called NEAT to adapt models during inference without extensive training.

DetailsMotivation: Real-world applications, like medical imaging, require models to identify false or non-existent elements, but VLMs struggle with negation due to data scarcity.

Method: The proposed NEAT method adjusts distribution-related parameters during inference to handle negation by addressing dual-concept shifts between affirmation and negation distributions.

Result: NEAT achieves comparable or superior performance to state-of-the-art methods with minimal trainable parameters (less than 0.01%).

Conclusion: NEAT offers a sustainable, efficient solution for negation understanding in VLMs, validated across various tasks.

Abstract: In this paper, we study a practical but less-touched problem in Vision-Language Models (VLMs), \ie, negation understanding. Specifically, many real-world applications require models to explicitly identify what is false or non-existent, \eg, radiologists may search for images that exclude specific conditions. Despite the impressive transferability of VLMs through large-scale training, they suffer from a critical limitation that fails to handle negation. To address this challenge, existing methods attribute its root cause to the scarcity of negation training data and propose to fine-tune VLMs on massive data containing explicit negation. Undoubtedly, such data-centric solutions demand substantial data and computational resources, limiting their sustainable widespread adoption. To tackle negation in a low-carbon manner, we empirically observe that the key obstacle lies in the dual-concept shifts between the affirmation and negation distributions. Therefore, we propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference. In brief, NEAT can reduce distribution shift in consistent semantics while eliminating false distributional consistency in unrelated semantics. Extensive experiments on the various negation understanding tasks verify the effectiveness of the proposed method. Remarkably, with less than 0.01% of trainable parameters, NEAT achieves comparable or superior performance to state-of-the-art post-training approaches. Our code is available at https://github.com/hhc1997/NEAT.

[301] Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, Yu Qiao

Main category: cs.CV

TL;DR: RRVF framework reduces reliance on curated image-text supervision for MLLMs by using raw images and reinforcement learning, outperforming existing models in visual reasoning tasks.

DetailsMotivation: The heavy reliance on curated image-text supervision limits MLLMs' deep visual reasoning capabilities. RRVF aims to overcome this bottleneck.

Method: RRVF uses a closed-loop process of reasoning, rendering, and visual feedback, optimized via GRPO algorithm, to learn from raw images.

Result: RRVF-trained models outperform similarly sized open-source MLLMs and supervised baselines, showing superior generalization.

Conclusion: RRVF effectively enhances MLLMs’ visual reasoning with minimal supervision, demonstrating strong performance and generalization.

Abstract: Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework, Reasoning-Rendering-Visual-Feedback'' (RRVF), that enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the Asymmetry of Verification’’ principle, i.e., verifying the rendered output against the source image is substantially easier than performing deep visual reasoning to generate a faithful, structured representation such as code. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL), thereby reducing reliance on image-text supervision. RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform complex reasoning, including self-correction through multi-turn interactions. This process is optimized end-to-end using the GRPO algorithm. Extensive evaluations are conducted on image-to-code generation across two diverse domains: data charts and web interfaces. The RRVF-trained model not only outperforms existing similarly sized open-source MLLMs and supervised fine-tuning baselines but also exhibits superior generalization. Notably, the model outperforms the more advanced MLLM used to generate visual feedback during training. %only in arxiv Code is available at https://github.com/L-O-I/RRVF.

[302] Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs

Saeed Ghorbani

Main category: cs.CV

TL;DR: Aether Weaver is a multimodal narrative co-generation framework that integrates text, visuals, and sound for immersive storytelling, outperforming sequential pipelines.

DetailsMotivation: To overcome limitations of sequential text-to-visual pipelines by enabling concurrent generation of narratives, visuals, and soundscapes for richer storytelling.

Method: Uses a Narrator (LLM) for text and prompts, a Director for scene graph management, a Narrative Arc Controller for story structure, and an Affective Tone Mapper for emotional consistency.

Result: Qualitative evaluations show enhanced narrative depth, visual fidelity, and emotional resonance compared to baselines.

Conclusion: Aether Weaver offers a robust platform for creative prototyping and immersive storytelling.

Abstract: We introduce Aether Weaver, a novel, integrated framework for multimodal narrative co-generation that overcomes limitations of sequential text-to-visual pipelines. Our system concurrently synthesizes textual narratives, dynamic scene graph representations, visual scenes, and affective soundscapes, driven by a tightly integrated, co-generation mechanism. At its core, the Narrator, a large language model, generates narrative text and multimodal prompts, while the Director acts as a dynamic scene graph manager, and analyzes the text to build and maintain a structured representation of the story’s world, ensuring spatio-temporal and relational consistency for visual rendering and subsequent narrative generation. Additionally, a Narrative Arc Controller guides the high-level story structure, influencing multimodal affective consistency, further complemented by an Affective Tone Mapper that ensures congruent emotional expression across all modalities. Through qualitative evaluations on a diverse set of narrative prompts encompassing various genres, we demonstrate that Aether Weaver significantly enhances narrative depth, visual fidelity, and emotional resonance compared to cascaded baseline approaches. This integrated framework provides a robust platform for rapid creative prototyping and immersive storytelling experiences.

[303] Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition

Jihao Gu, Kun Li, Fei Wang, Yanyan Wei, Zhiliang Wu, Hehe Fan, Meng Wang

Main category: cs.CV

TL;DR: The paper introduces a Motion-guided Modulation Network (MMN) to improve Micro-Action Recognition by capturing subtle motion cues, achieving state-of-the-art results.

DetailsMotivation: Existing methods overlook subtle changes in Micro-Actions (MAs), limiting accuracy in distinguishing them.

Method: Proposes MMN with Motion-guided Skeletal Modulation (MSM) and Temporal Modulation (MTM) modules to enhance spatial-temporal representation learning.

Result: MMN outperforms on Micro-Action 52 and iMiGUE datasets, proving the value of modeling subtle motion cues.

Conclusion: Explicitly modeling subtle motion cues significantly improves micro-action recognition performance.

Abstract: Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN.

[304] Multimodal Referring Segmentation: A Survey

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: A survey on multimodal referring segmentation, covering problem definitions, datasets, methods, and applications across images, videos, and 3D scenes.

DetailsMotivation: To advance accurate object perception in visual scenes based on user instructions, leveraging modern techniques like CNNs, transformers, and large language models.

Method: Introduces a unified meta architecture, reviews methods for images, videos, and 3D scenes, and discusses Generalized Referring Expression (GREx) methods.

Result: Extensive performance comparisons on benchmarks and a continuously updated GitHub repository.

Conclusion: The survey highlights progress and challenges in multimodal referring segmentation, with ongoing tracking of advancements.

Abstract: Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field’s background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

[305] The Promise of RL for Autoregressive Image Editing

Saba Ahmadi, Rabiul Awal, Ankur Sikarwar, Amirhossein Kazemnejad, Ge Ya Luo, Juan A. Rodriguez, Sai Rajeswar, Siva Reddy, Christopher Pal, Benno Krojer, Aishwarya Agrawal

Main category: cs.CV

TL;DR: The paper introduces EARL, an RL-based image editing model, comparing SFT, RL, and CoT strategies, with RL proving most effective.

DetailsMotivation: To enhance performance in image editing tasks by exploring and comparing three strategies: SFT, RL, and CoT.

Method: Adopts an autoregressive multimodal model to process text and visual tokens uniformly, combining RL with a large multi-modal LLM verifier.

Result: RL combined with a multi-modal LLM verifier is most effective, leading to EARL, a competitive model with less training data.

Conclusion: EARL advances autoregressive multimodal models in image editing, with code and models publicly released.

Abstract: We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

[306] Zero-shot Segmentation of Skin Conditions: Erythema with Edit-Friendly Inversion

Konstantinos Moutselos, Ilias Maglogiannis

Main category: cs.CV

TL;DR: A zero-shot image segmentation framework using diffusion models detects erythema by synthesizing reference images and aligning them with originals, reducing reliance on labeled datasets.

DetailsMotivation: To address the challenge of detecting erythema without requiring extensive labeled dermatological datasets, leveraging generative models for scalable and flexible diagnostic support.

Method: Uses edit-friendly inversion in diffusion models to synthesize erythema-free reference images, aligns them with originals, and performs color-space analysis with minimal user intervention.

Result: Successfully isolated facial erythema in diverse cases, outperforming baseline threshold-based techniques, demonstrating the framework’s effectiveness.

Conclusion: Combining generative diffusion models with statistical color segmentation enables efficient erythema detection without prior training data, offering a promising tool for computer-aided dermatology.

Abstract: This study proposes a zero-shot image segmentation framework for detecting erythema (redness of the skin) using edit-friendly inversion in diffusion models. The method synthesizes reference images of the same patient that are free from erythema via generative editing and then accurately aligns these references with the original images. Color-space analysis is performed with minimal user intervention to identify erythematous regions. This approach significantly reduces the reliance on labeled dermatological datasets while providing a scalable and flexible diagnostic support tool by avoiding the need for any annotated training masks. In our initial qualitative experiments, the pipeline successfully isolated facial erythema in diverse cases, demonstrating performance improvements over baseline threshold-based techniques. These results highlight the potential of combining generative diffusion models and statistical color segmentation for computer-aided dermatology, enabling efficient erythema detection without prior training data.

[307] Open-Attribute Recognition for Person Retrieval: Finding People Through Distinctive and Novel Attributes

Minjeong Park, Hongbeen Park, Sangwon Lee, Yoonha Jang, Jinkyu Kim

Main category: cs.CV

TL;DR: The paper introduces Open-Attribute Recognition for Person Retrieval (OAPR) to handle novel attributes in real-world scenarios, proposing a framework for generalizable body part representations and reconstructing datasets for validation.

DetailsMotivation: Existing PAR methods assume closed-set attributes, limiting real-world applicability where novel attributes emerge. Predefined attributes are also less discriminative for person retrieval.

Method: Proposes OAPR task and a framework to learn generalizable body part representations for diverse attributes. Reconstructs four datasets for validation.

Result: Experiments show the necessity of OAPR and the framework’s effectiveness.

Conclusion: The OAPR task addresses real-world challenges, and the proposed framework demonstrates promising results.

Abstract: Pedestrian Attribute Recognition (PAR) plays a crucial role in various vision tasks such as person retrieval and identification. Most existing attribute-based retrieval methods operate under the closed-set assumption that all attribute classes are consistently available during both training and inference. However, this assumption limits their applicability in real-world scenarios where novel attributes may emerge. Moreover, predefined attributes in benchmark datasets are often generic and shared across individuals, making them less discriminative for retrieving the target person. To address these challenges, we propose the Open-Attribute Recognition for Person Retrieval (OAPR) task, which aims to retrieve individuals based on attribute cues, regardless of whether those attributes were seen during training. To support this task, we introduce a novel framework designed to learn generalizable body part representations that cover a broad range of attribute categories. Furthermore, we reconstruct four widely used datasets for open-attribute recognition. Comprehensive experiments on these datasets demonstrate the necessity of the OAPR task and the effectiveness of our framework. The source code and pre-trained models will be publicly available upon publication.

[308] 3DRot: 3D Rotation Augmentation for RGB-Based 3D Tasks

Shitian Yang, Deyu Li, Xiaoke Jiang, Lei Zhang

Main category: cs.CV

TL;DR: 3DRot is a plug-and-play augmentation method for RGB-based 3D tasks that preserves geometric consistency by rotating and mirroring images about the camera’s optical center, improving performance in monocular 3D detection.

DetailsMotivation: RGB-based 3D tasks face challenges due to scarce annotations and limited augmentation options that disrupt geometric consistency.

Method: 3DRot rotates and mirrors images about the camera’s optical center while updating RGB images, camera intrinsics, object poses, and 3D annotations to maintain projective geometry.

Result: On SUN RGB-D, 3DRot improves IoU3D (43.21 to 44.51), reduces rotation error (22.91° to 20.93°), and boosts mAP0.5 (35.70 to 38.11).

Conclusion: 3DRot is effective for monocular 3D detection and is transferable to other 3D tasks due to its camera-space transform approach.

Abstract: RGB-based 3D tasks, e.g., 3D detection, depth estimation, 3D keypoint estimation, still suffer from scarce, expensive annotations and a thin augmentation toolbox, since most image transforms, including resize and rotation, disrupt geometric consistency. In this paper, we introduce 3DRot, a plug-and-play augmentation that rotates and mirrors images about the camera’s optical center while synchronously updating RGB images, camera intrinsics, object poses, and 3D annotations to preserve projective geometry-achieving geometry-consistent rotations and reflections without relying on any scene depth. We validate 3DRot with a classical 3D task, monocular 3D detection. On SUN RGB-D dataset, 3DRot raises $IoU_{3D}$ from 43.21 to 44.51, cuts rotation error (ROT) from 22.91$^\circ$ to 20.93$^\circ$, and boosts $mAP_{0.5}$ from 35.70 to 38.11. As a comparison, Cube R-CNN adds 3 other datasets together with SUN RGB-D for monocular 3D estimation, with a similar mechanism and test dataset, increases $IoU_{3D}$ from 36.2 to 37.8, boosts $mAP_{0.5}$ from 34.7 to 35.4. Because it operates purely through camera-space transforms, 3DRot is readily transferable to other 3D tasks.

[309] VPN: Visual Prompt Navigation

Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang

Main category: cs.CV

TL;DR: The paper introduces Visual Prompt Navigation (VPN), a language-free method using visual prompts for guiding agents in navigation tasks, reducing ambiguity and improving usability for non-experts.

DetailsMotivation: Language ambiguity and verbosity hinder effective navigation guidance; visual prompts offer intuitive, spatially grounded alternatives.

Method: Proposes VPN with visual prompts on 2D top-view maps, introduces VPNet for handling VPN tasks, and employs data augmentation strategies.

Result: Constructs datasets R2R-VP and R2R-CE-VP, evaluates visual prompt forms, map formats, and augmentation strategies.

Conclusion: VPN is effective, user-friendly, and reduces ambiguity, with potential for broader applications in embodied navigation.

Abstract: While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.

[310] After the Party: Navigating the Mapping From Color to Ambient Lighting

Florin-Alexandru Vasluianu, Tim Seizinger, Zongwei Wu, Radu Timofte

Main category: cs.CV

TL;DR: CL3AN introduces a dataset and learning framework for restoring images under colored light sources, addressing limitations of existing methods.

DetailsMotivation: Existing methods oversimplify illumination complexities, leading to artifacts. CL3AN aims to disentangle illumination from reflectance accurately.

Method: A novel learning framework leverages chromaticity-luminance guidance inspired by the Retinex model.

Result: The approach shows robustness under non-homogeneous lighting and material variations while maintaining computational efficiency.

Conclusion: CL3AN provides a scalable solution for illumination restoration, with benchmarks and models publicly available.

Abstract: Illumination in practical scenarios is inherently complex, involving colored light sources, occlusions, and diverse material interactions that produce intricate reflectance and shading effects. However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities unaddressed. In this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images captured under multiple Colored Light sources to their Ambient-Normalized counterparts. Through benchmarking, we find that leading approaches often produce artifacts, such as illumination inconsistencies, texture leakage, and color distortion, primarily due to their limited ability to precisely disentangle illumination from reflectance. Motivated by this insight, we achieve such a desired decomposition through a novel learning framework that leverages explicit chromaticity-luminance components guidance, drawing inspiration from the principles of the Retinex model. Extensive evaluations on existing benchmarks and our dataset demonstrate the effectiveness of our approach, showcasing enhanced robustness under non-homogeneous color lighting and material-specific reflectance variations, all while maintaining a highly competitive computational cost. The benchmark, codes, and models are available at www.github.com/fvasluianu97/RLN2.

[311] CMIC: Content-Adaptive Mamba for Learned Image Compression

Yunuo Chen, Zezheng Lyu, Bing He, Hongwei Hu, Qi Wang, Yuan Tian, Li Song, Wenjun Zhang, Guo Lu

Main category: cs.CV

TL;DR: CAM enhances Mamba-style SSMs for LIC by introducing content-aware token reorganization and global priors, achieving superior rate-distortion performance.

DetailsMotivation: Vanilla Mamba's content-agnostic nature limits its ability to exploit content dependencies dynamically.

Method: CAM employs content-aware token clustering/reordering and integrates global priors via a prompt dictionary.

Result: CMIC outperforms VTM-21.0 by significant BD-rate reductions on multiple benchmarks.

Conclusion: CAM effectively captures global dependencies while maintaining computational efficiency, advancing LIC performance.

Abstract: Recent Learned image compression (LIC) leverages Mamba-style state-space models (SSMs) for global receptive fields with linear complexity. However, vanilla Mamba is content-agnostic, relying on fixed and predefined selective scans, which restricts its ability to dynamically and fully exploit content dependencies. We introduce Content-Adaptive Mamba (CAM), a dynamic SSM that addresses two critical limitations. First, it employs content-aware token reorganization, clustering and reordering tokens based on content similarity to prioritize proximity in feature space over Euclidean space. Second, it integrates global priors into SSM via a prompt dictionary, effectively mitigating the strict causality and long-range decay in the token interactions of Mamba. These innovations enable CAM to better capture global dependencies while preserving computational efficiency. Leveraging CAM, our Content-Adaptive Mamba-based LIC model (CMIC) achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by -15.91%, -21.34%, and -17.58% BD-rate on Kodak, Tecnick, and CLIC benchmarks, respectively.

[312] Glioblastoma Overall Survival Prediction With Vision Transformers

Yin Lin, Riccardo Barbieri, Domenico Aquino, Giuseppe Lauria, Marina Grisoli, Elena De Momi, Alberto Redaelli, Simona Ferrante

Main category: cs.CV

TL;DR: A novel AI approach using Vision Transformers (ViTs) for predicting glioblastoma overall survival (OS) from MRI images, eliminating tumor segmentation, achieves 62.5% accuracy on the BRATS dataset.

DetailsMotivation: Predicting OS in glioblastoma is crucial for personalized treatment, but traditional methods rely on tumor segmentation, which complicates workflows.

Method: The study uses ViTs to extract features directly from MRI images, bypassing segmentation, and evaluates performance on the BRATS dataset.

Result: The model achieves 62.5% accuracy, with balanced precision, recall, and F1 scores, outperforming other methods in these metrics.

Conclusion: ViTs show promise for efficient OS prediction in medical imaging, though dataset size limits generalization, suggesting a need for larger datasets.

Abstract: Glioblastoma is one of the most aggressive and common brain tumors, with a median survival of 10-15 months. Predicting Overall Survival (OS) is critical for personalizing treatment strategies and aligning clinical decisions with patient outcomes. In this study, we propose a novel Artificial Intelligence (AI) approach for OS prediction using Magnetic Resonance Imaging (MRI) images, exploiting Vision Transformers (ViTs) to extract hidden features directly from MRI images, eliminating the need of tumor segmentation. Unlike traditional approaches, our method simplifies the workflow and reduces computational resource requirements. The proposed model was evaluated on the BRATS dataset, reaching an accuracy of 62.5% on the test set, comparable to the top-performing methods. Additionally, it demonstrated balanced performance across precision, recall, and F1 score, overcoming the best model in these metrics. The dataset size limits the generalization of the ViT which typically requires larger datasets compared to convolutional neural networks. This limitation in generalization is observed across all the cited studies. This work highlights the applicability of ViTs for downsampled medical imaging tasks and establishes a foundation for OS prediction models that are computationally efficient and do not rely on segmentation.

[313] Low-Frequency First: Eliminating Floating Artifacts in 3D Gaussian Splatting

Jianchao Wang, Peng Zhou, Cen Li, Rong Quan, Jie Qin

Main category: cs.CV

TL;DR: EFA-GS addresses floating artifacts in 3D Gaussian Splatting by expanding under-optimized Gaussians and using depth/scale strategies, improving PSNR by 1.68 dB.

DetailsMotivation: Floating artifacts in 3DGS degrade visual fidelity, especially with low-quality initialization. Their origins are not fully understood.

Method: Proposes EFA-GS, which expands under-optimized Gaussians for low-frequency learning and refines expansion with depth/scale strategies.

Result: EFA-GS reduces artifacts, preserves details, and improves PSNR by 1.68 dB on RWLQ dataset. Effective in 3D editing tasks.

Conclusion: EFA-GS effectively mitigates floating artifacts while maintaining detail fidelity, validated by experiments and downstream tasks.

Abstract: 3D Gaussian Splatting (3DGS) is a powerful and computationally efficient representation for 3D reconstruction. Despite its strengths, 3DGS often produces floating artifacts, which are erroneous structures detached from the actual geometry and significantly degrade visual fidelity. The underlying mechanisms causing these artifacts, particularly in low-quality initialization scenarios, have not been fully explored. In this paper, we investigate the origins of floating artifacts from a frequency-domain perspective and identify under-optimized Gaussians as the primary source. Based on our analysis, we propose \textit{Eliminating-Floating-Artifacts} Gaussian Splatting (EFA-GS), which selectively expands under-optimized Gaussians to prioritize accurate low-frequency learning. Additionally, we introduce complementary depth-based and scale-based strategies to dynamically refine Gaussian expansion, effectively mitigating detail erosion. Extensive experiments on both synthetic and real-world datasets demonstrate that EFA-GS substantially reduces floating artifacts while preserving high-frequency details, achieving an improvement of 1.68 dB in PSNR over baseline method on our RWLQ dataset. Furthermore, we validate the effectiveness of our approach in downstream 3D editing tasks. We provide our implementation in https://jcwang-gh.github.io/EFA-GS.

cs.AI

[314] Efficient Agents: Building Effective Agents While Reducing Cost

Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, Jian Yang, Jiaheng Liu, Ge Zhang, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

Main category: cs.AI

TL;DR: The paper studies the efficiency-effectiveness trade-off in LLM-driven agent systems, proposing Efficient Agents to reduce costs while maintaining performance.

DetailsMotivation: Addressing the escalating costs of LLM-driven agents to ensure scalability and accessibility without sacrificing performance.

Method: Empirical analysis on the GAIA benchmark, evaluating LLM backbone selection, agent framework designs, and test-time scaling strategies using the cost-of-pass metric.

Result: Efficient Agents achieves 96.7% performance of OWL while reducing costs by 28.4%.

Conclusion: The work provides insights for designing cost-effective, high-performing agent systems, enhancing AI accessibility and sustainability.

Abstract: The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from $0.398 to $0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions.

[315] Planning with Dynamically Changing Domains

Mikhail Soutchanski, Yongmei Liu

Main category: cs.AI

TL;DR: The paper addresses planning problems where objects dynamically change, proposing a first-order logic approach without the Domain Closure Assumption (DCA).

DetailsMotivation: Practical planning problems involve dynamic object sets, which classical and conformant planning assumptions (DCA) cannot handle.

Method: Formulates planning in first-order logic, grounds actions at planning time, and bounds plan length.

Result: Proves soundness and completeness, solving bounded planning problems without DCA.

Conclusion: The approach is viable for dynamic object scenarios, with a proof-of-concept implementation.

Abstract: In classical planning and conformant planning, it is assumed that there are finitely many named objects given in advance, and only they can participate in actions and in fluents. This is the Domain Closure Assumption (DCA). However, there are practical planning problems where the set of objects changes dynamically as actions are performed; e.g., new objects can be created, old objects can be destroyed. We formulate the planning problem in first-order logic, assume an initial theory is a finite consistent set of fluent literals, discuss when this guarantees that in every situation there are only finitely many possible actions, impose a finite integer bound on the length of the plan, and propose to organize search over sequences of actions that are grounded at planning time. We show the soundness and completeness of our approach. It can be used to solve the bounded planning problems without DCA that belong to the intersection of sequential generalized planning (without sensing actions) and conformant planning, restricted to the case without the disjunction over fluent literals. We discuss a proof-of-the-concept implementation of our planner.

[316] Recovering Individual-Level Activity Sequences from Location-Based Service Data Using a Novel Transformer-Based Model

Weiyu Luo, Chenfeng Xiong

Main category: cs.AI

TL;DR: VSNIT improves incomplete LBS activity sequences by combining Insertion Transformer and Variable Selection Network, outperforming baselines in accuracy and diversity.

DetailsMotivation: LBS data sparsity complicates trip and activity inferences; the study aims to recover incomplete sequences using high-quality data.

Method: Proposes VSNIT, merging Insertion Transformer for sequence construction and Variable Selection Network for dynamic covariate handling.

Result: VSNIT inserts diverse, realistic patterns, aligns transitions better, and outperforms baselines in all metrics.

Conclusion: VSNIT enhances LBS data utility, offering a robust framework for mobility analysis and future research.

Abstract: Location-Based Service (LBS) data provides critical insights into human mobility, yet its sparsity often yields incomplete trip and activity sequences, making accurate inferences about trips and activities difficult. We raise a research problem: Can we use activity sequences derived from high-quality LBS data to recover incomplete activity sequences at the individual level? This study proposes a new solution, the Variable Selection Network-fused Insertion Transformer (VSNIT), integrating the Insertion Transformer’s flexible sequence construction with the Variable Selection Network’s dynamic covariate handling capability, to recover missing segments in incomplete activity sequences while preserving existing data. The findings show that VSNIT inserts more diverse, realistic activity patterns, more closely matching real-world variability, and restores disrupted activity transitions more effectively aligning with the target. It also performs significantly better than the baseline model across all metrics. These results highlight VSNIT’s superior accuracy and diversity in activity sequence recovery tasks, demonstrating its potential to enhance LBS data utility for mobility analysis. This approach offers a promising framework for future location-based research and applications.

[317] Large Language Model-based Data Science Agent: A Survey

Peiran Wang, Yaoning Yu, Ke Chen, Xianyang Zhan, Haohan Wang

Main category: cs.AI

TL;DR: A survey on LLM-based agents for data science tasks, analyzing design principles and workflows from agent and data science perspectives.

DetailsMotivation: To explore and summarize the application of LLM-based agents in data science, bridging agent design principles with practical workflows.

Method: Comprehensive review of recent studies, focusing on agent roles, execution, knowledge, reflection, and data science processes like preprocessing, model development, and evaluation.

Result: Identifies key design principles for LLM-based agents and maps them to data science workflows, providing a dual-perspective framework.

Conclusion: Offers a structured overview and framework for applying LLM-based agents in data science, highlighting their potential and practical integration.

Abstract: The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comprehensive analysis of LLM-based agents designed for data science tasks, summarizing insights from recent studies. From the agent perspective, we discuss the key design principles, covering agent roles, execution, knowledge, and reflection methods. From the data science perspective, we identify key processes for LLM-based agents, including data preprocessing, model development, evaluation, visualization, etc. Our work offers two key contributions: (1) a comprehensive review of recent developments in applying LLMbased agents to data science tasks; (2) a dual-perspective framework that connects general agent design principles with the practical workflows in data science.

[318] Cognitive Loop via In-Situ Optimization: Self-Adaptive Reasoning for Science

Newman Cheng, Gordon Broadbent, William Chappell

Main category: cs.AI

TL;DR: CLIO introduces a cognitive loop for AI reasoning, enabling deep control and transparency in scientific discovery, outperforming GPT-4.1 by 13.82% in accuracy.

DetailsMotivation: Existing AI lacks steerability and transparency for scientific discovery, necessitating a method like CLIO for precise reasoning control.

Method: CLIO uses in-situ optimization to let LLMs self-formulate problem-solving approaches, adapt behavior, and provide transparent reasoning via graph structures.

Result: CLIO with GPT-4.1 achieves 22.37% accuracy on HLE, a 13.82% net increase over base GPT-4.1, and reveals uncertainty oscillations as key to accuracy.

Conclusion: CLIO’s open design and internal mechanisms offer scientists insight and control, enhancing AI’s role in scientific decision-making.

Abstract: The capacity for artificial intelligence (AI) to formulate, evolve, and test altered thought patterns under dynamic conditions indicates advanced cognition that is crucial for scientific discovery. The existing AI development landscape falls into two categories: 1) frameworks over non-reasoning models that natively incorporate opinions on how humans think, and 2) reasoning models that abstract precise control of the reasoning intuition away from end users. While powerful, for scientists to maximize utility of AI in scientific discovery, they not only require accuracy and transparency in reasoning, but also steerability. Hence, we introduce an alternative approach that enables deep and precise control over the reasoning process called: a cognitive loop via in-situ optimization (CLIO). CLIO enables large language models (LLMs) to self-formulate ways of approaching a problem, adapt behavior when self-confidence is low, and ultimately provide scientists with a final belief or answer. Through CLIO’s open design, scientists can observe uncertainty levels, understand how final belief states are formulated using graph structures, and interject corrections. Without any further post-training, OpenAI’s GPT-4.1 with CLIO yields an accuracy of 22.37% in text-based biology and medicine questions on Humanity’s Last Exam (HLE). This yields a 13.82% net or 161.64% relative increase when compared to the base GPT-4.1 model and surpasses OpenAI’s o3 performance in high and low reasoning effort modes. We further discovered that oscillations within internal uncertainty measures are key in determining the accuracy of CLIO’s results, revealing how its open design and internal mechanisms can provide insight and control into scientific decision-making processes.

[319] A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering

Ziruo Yi, Jinyu Liu, Ting Xiao, Mark V. Albert

Main category: cs.AI

TL;DR: A multi-agent system (MAS) improves radiology visual question answering (RVQA) by addressing challenges like factual accuracy and cross-modal misalignment, outperforming existing multimodal large language models (MLLMs).

DetailsMotivation: To enhance RVQA by tackling issues like hallucinations and factual inaccuracies in current MLLM and RAG-based methods, ensuring reliable clinical AI applications.

Method: Introduces a MAS with specialized agents for context understanding, multimodal reasoning, and answer validation, evaluated on a challenging RVQA dataset.

Result: The MAS outperforms strong MLLM baselines, demonstrating superior accuracy, reliability, and interpretability in complex reasoning tasks.

Conclusion: Multi-agent systems show promise for trustworthy and explainable clinical AI, particularly in tasks requiring intricate reasoning like RVQA.

Abstract: Radiology visual question answering (RVQA) provides precise answers to questions about chest X-ray images, alleviating radiologists’ workload. While recent methods based on multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have shown promising progress in RVQA, they still face challenges in factual accuracy, hallucinations, and cross-modal misalignment. We introduce a multi-agent system (MAS) designed to support complex reasoning in RVQA, with specialized agents for context understanding, multimodal reasoning, and answer validation. We evaluate our system on a challenging RVQA set curated via model disagreement filtering, comprising consistently hard cases across multiple MLLMs. Extensive experiments demonstrate the superiority and effectiveness of our system over strong MLLM baselines, with a case study illustrating its reliability and interpretability. This work highlights the potential of multi-agent approaches to support explainable and trustworthy clinical AI applications that require complex reasoning.

[320] Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game

Michael Katz, Harsha Kokel, Sarath Sreedharan

Main category: cs.AI

TL;DR: The paper proposes a new planning benchmark using the game Countdown to address inadequacies in current benchmarks for evaluating foundational models’ long-term planning capabilities.

DetailsMotivation: Existing benchmarks for planning are either too vague or tailored for specific automated planners, failing to properly measure foundational models' planning abilities.

Method: Introduces a procedure for creating a benchmark based on the Countdown game, which involves forming a target number from given numbers via arithmetic operations. The benchmark is designed to be intuitive, computationally challenging, and resistant to memorization.

Result: Theoretical analysis confirms the benchmark’s computational complexity (NP-complete) and its superiority over existing benchmarks. Evaluations show it remains highly challenging for current LLM-based planning methods.

Conclusion: The Countdown-based benchmark effectively addresses limitations of existing planning benchmarks and provides a robust tool for evaluating foundational models’ planning capabilities.

Abstract: There is a broad consensus that the inability to form long-term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions. While the former tasks are hard to formalize and verify, the latter were specifically designed to test and challenge the weaknesses of existing automated planners. To address these shortcomings, we propose a procedure for creating a planning benchmark centered around the game called Countdown, where a player is expected to form a target number from a list of input numbers through arithmetic operations. We discuss how this problem meets many of the desiderata associated with an ideal benchmark for planning capabilities evaluation. Specifically, the domain allows for an intuitive, natural language description for each problem instance, it is computationally challenging (NP-complete), and the instance space is rich enough that we do not have to worry about memorization. We perform an extensive theoretical analysis, establishing the computational complexity result and demonstrate the advantage of our instance generation procedure over public benchmarks. We evaluate a variety of existing LLM-assisted planning methods on instances generated using our procedure. Our results show that, unlike other domains like 24 Game (a special case of Countdown), our proposed dynamic benchmark remains extremely challenging for existing LLM-based approaches.

[321] BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation

Jilong Li, Zhenxi Song, Jiaqi Wang, Meishan Zhang, Honghai Liu, Min Zhang, Zhiguo Zhang

Main category: cs.AI

TL;DR: BrainECHO is a multi-stage framework for EEG/MEG-to-text decoding, addressing teacher-forcing robustness, session noise, and linguistic misalignment. It achieves state-of-the-art performance via discrete autoencoding, frozen alignment, and constrained decoding.

DetailsMotivation: Current EEG/MEG-to-text systems face robustness, generalization, and misalignment issues. BrainECHO aims to overcome these by decoupling representation learning.

Method: BrainECHO uses three stages: (1) discrete autoencoding, (2) frozen alignment for noise filtering, and (3) constrained decoding with Whisper for balanced adaptation.

Result: BrainECHO improves BLEU-4 by 3.65%, achieves 74%-89% BLEU scores, and shows robustness across noise, sessions, and subjects.

Conclusion: BrainECHO enhances EEG/MEG-to-text decoding, proving effective for brain-computer interfaces with improved robustness and performance.

Abstract: Current EEG/MEG-to-text decoding systems suffer from three key limitations: (1) reliance on teacher-forcing methods, which compromises robustness during inference, (2) sensitivity to session-specific noise, hindering generalization across subjects, and (3) misalignment between brain signals and linguistic representations due to pre-trained language model over-dominance. To overcome these challenges, we propose BrainECHO (Brain signal decoding via vEctor-quantized speCtrogram reconstruction for WHisper-enhanced text generatiOn), a multi-stage framework that employs decoupled representation learning to achieve state-of-the-art performance on both EEG and MEG datasets. Specifically, BrainECHO consists of three stages: (1) Discrete autoencoding, which transforms continuous Mel spectrograms into a finite set of high-quality discrete representations for subsequent stages. (2) Frozen alignment, where brain signal embeddings are mapped to corresponding Mel spectrogram embeddings in a frozen latent space, effectively filtering session-specific noise through vector-quantized reconstruction, yielding a 3.65% improvement in BLEU-4 score. (3) Constrained decoding fine-tuning, which leverages the pre-trained Whisper model for audio-to-text translation, balancing signal adaptation with knowledge preservation, and achieving 74%-89% decoding BLEU scores without excessive reliance on teacher forcing. BrainECHO demonstrates robustness across sentence, session, and subject-independent conditions, passing Gaussian noise tests and showcasing its potential for enhancing language-based brain-computer interfaces.

[322] Enhancing Japanese Large Language Models with Reasoning Vectors

Carolina Minami Oguchi, Leo Wei, Koyo Kobayashi, Hsin-Tai Wu, Dipak Ghosal

Main category: cs.AI

TL;DR: A method using reasoning vectors from reasoning LLMs is applied to Japanese LLMs to enhance their performance, addressing resource challenges.

DetailsMotivation: Improving Japanese LLMs is resource-intensive; the paper aims to find a simpler, effective solution inspired by task vectors.

Method: Extract reasoning vectors from reasoning LLMs and apply them to Japanese LLMs.

Result: Demonstrates a way to significantly improve Japanese LLMs despite limited resources.

Conclusion: The approach is effective for Japanese LLMs and may inspire similar methods for other languages.

Abstract: Post-training methods have improved the performance and enhanced the reasoning capability for mainstream large language models (LLMs), but the same is challenging for Japanese LLMs to achieve due to the amount of resources required. Inspired by task vectors that extract the change of weights before and after training, specifically for a certain task, we obtain reasoning vectors from reasoning LLMs and apply them to Japanese LLMs to boost their performance. While the resources available present a challenge to improve Japanese LLMs, we present a simple and effective way to obtain high improvement and hope to inspire for other languages.

[323] PentestJudge: Judging Agent Behavior Against Operational Requirements

Shane Caldwell, Max Harley, Michael Kouremetis, Vincent Abruzzo, Will Pearce

Main category: cs.AI

TL;DR: PentestJudge is an LLM-based system for evaluating penetration testing agents using hierarchical rubrics, achieving an F1 score of 0.83 compared to human experts.

DetailsMotivation: To evaluate penetration testing agents' actions programmatically is impractical; PentestJudge provides a scalable solution.

Method: Uses a tree-structured rubric to break tasks into simpler sub-tasks, evaluated by an LLM-as-judge with tool access.

Result: Best model achieves F1 score of 0.83; models better at tool-use align closer to human experts.

Conclusion: PentestJudge enables scalable evaluation of AI security agents, with weaker models capable of verifying stronger ones.

Abstract: We introduce PentestJudge, a system for evaluating the operations of penetration testing agents. PentestJudge is a large language model (LLM)-as-judge with access to tools that allow it to consume arbitrary trajectories of agent states and tool call history to determine whether a security agent’s actions meet certain operating criteria that would be impractical to evaluate programmatically. We develop rubrics that use a tree structure to hierarchically collapse the penetration testing task for a particular environment into smaller, simpler, and more manageable sub-tasks and criteria until each leaf node represents simple yes-or-no criteria for PentestJudge to evaluate. Task nodes are broken down into different categories related to operational objectives, operational security, and tradecraft. LLM-as-judge scores are compared to human domain experts as a ground-truth reference, allowing us to compare their relative performance with standard binary classification metrics, such as F1 scores. We evaluate several frontier and open-source models acting as judge agents, with the best model reaching an F1 score of 0.83. We find models that are better at tool-use perform more closely to human experts. By stratifying the F1 scores by requirement type, we find even models with similar overall scores struggle with different types of questions, suggesting certain models may be better judges of particular operating criteria. We find that weaker and cheaper models can judge the trajectories of pentests performed by stronger and more expensive models, suggesting verification may be easier than generation for the penetration testing task. We share this methodology to facilitate future research in understanding the ability of judges to holistically and scalably evaluate the process quality of AI-based information security agents so that they may be confidently used in sensitive production environments.

[324] AQUAH: Automatic Quantification and Unified Agent in Hydrology

Songkun Yan, Zhi Li, Siyu Zhu, Yixin Wen, Mofan Zhang, Mengye Chen, Jie Cao, Yang Hong

Main category: cs.AI

TL;DR: AQUAH is an end-to-end language-based agent for hydrologic modeling, automating tasks from data retrieval to report generation using vision-enabled LLMs.

DetailsMotivation: To streamline complex environmental modeling and bridge the gap between Earth observation data, physics-based tools, and decision makers.

Method: AQUAH uses vision-enabled large language models to interpret maps and rasters, autonomously retrieving data, configuring models, running simulations, and generating reports.

Result: Initial experiments show AQUAH completes cold-start simulations and produces clear, transparent, and plausible results without manual intervention.

Conclusion: AQUAH demonstrates the potential of LLM-centered, vision-grounded agents to simplify hydrologic modeling, though further calibration is needed for operational use.

Abstract: We introduce AQUAH, the first end-to-end language-based agent designed specifically for hydrologic modeling. Starting from a simple natural-language prompt (e.g., ‘simulate floods for the Little Bighorn basin from 2020 to 2022’), AQUAH autonomously retrieves the required terrain, forcing, and gauge data; configures a hydrologic model; runs the simulation; and generates a self-contained PDF report. The workflow is driven by vision-enabled large language models, which interpret maps and rasters on the fly and steer key decisions such as outlet selection, parameter initialization, and uncertainty commentary. Initial experiments across a range of U.S. basins show that AQUAH can complete cold-start simulations and produce analyst-ready documentation without manual intervention. The results are judged by hydrologists as clear, transparent, and physically plausible. While further calibration and validation are still needed for operational deployment, these early outcomes highlight the promise of LLM-centered, vision-grounded agents to streamline complex environmental modeling and lower the barrier between Earth observation data, physics-based tools, and decision makers.

Mahtab Bigverdi, Wisdom Ikezogwo, Kevin Zhang, Hyewon Jeong, Mingyu Lu, Sungjae Cho, Linda Shapiro, Ranjay Krishna

Main category: cs.AI

TL;DR: Medblink benchmark evaluates multimodal language models (MLMs) on perceptual tasks in medical imaging, revealing significant gaps compared to human performance.

DetailsMotivation: To assess MLMs' perceptual abilities for clinical adoption, as errors in simple tasks hinder trust and usability.

Method: Medblink benchmark with 1,429 multiple-choice questions across 8 tasks and 1,605 images, testing 19 MLMs.

Result: Best MLM achieves 65% accuracy vs. human 96.4%, highlighting perceptual weaknesses.

Conclusion: Current MLMs lack visual grounding for clinical use, requiring improvements for adoption.

Abstract: Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption. Data is available on our project page.

[326] Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow

Chia-Tung Ho, Jing Gong, Xufeng Yao, Yunsheng Bai, Abhishek B Akkur, Haoxing Ren

Main category: cs.AI

TL;DR: Polymath is a self-optimizing agent with dynamic hierarchical workflows, improving performance by 8.1% over baselines without relying on labeled data.

DetailsMotivation: Manual embedding of foundation models into agentic systems limits scalability and efficiency, and existing automated methods rely on labeled datasets, which are impractical for dynamic problems.

Method: Polymath uses task flow graphs and code-represented workflows, optimized via multi-grid-inspired graph optimization and a self-reflection-guided evolutionary algorithm.

Result: Achieves an 8.1% average improvement over state-of-the-art baselines across six benchmark datasets.

Conclusion: Polymath offers a scalable, efficient solution for dynamic problems by eliminating dependency on labeled data.

Abstract: Large language models (LLMs) excel at solving complex tasks by executing agentic workflows composed of detailed instructions and structured operations. Yet, building general-purpose agents by manually embedding foundation models into agentic systems such as Chain-of-Thought, Self-Reflection, and ReACT through text interfaces limits scalability and efficiency. Recently, many researchers have sought to automate the generation and optimization of these workflows through code-based representations. However, existing methods often rely on labeled datasets to train and optimize workflows, making them ineffective and inflexible for solving real-world, dynamic problems where labeled data is unavailable. To address this challenge, we introduce Polymath, a self-optimizing agent with dynamic hierarchical workflow that leverages the flexibility of task flow graphs and the expressiveness of code-represented workflows to solve a wide range of real-world, dynamic problems. The proposed optimization methodology integrates multi-grid-inspired graph optimization with a self-reflection-guided evolutionary algorithm to refine workflows without labeled data. Experimental results on six benchmark datasets across coding, math, and multi-turn QA tasks show that Polymath achieves 8.1% average improvement over state-of-the-art baselines.

[327] Defend LLMs Through Self-Consciousness

Boshi Huang, Fabio Nonato de Paula

Main category: cs.AI

TL;DR: A novel self-consciousness defense mechanism for LLMs combats prompt injection attacks by leveraging the model’s reasoning, achieving high defense success rates with minimal computational overhead.

DetailsMotivation: To address prompt injection attacks in LLMs without relying on external classifiers, using the model's inherent capabilities for self-protection.

Method: Proposes a framework with Meta-Cognitive and Arbitration Modules for autonomous output evaluation and regulation, tested on seven LLMs using AdvBench and Prompt-Injection-Mixed-Techniques-2024 datasets.

Result: Significant defense success rate improvements, some achieving perfect/near-perfect defense in Enhanced Mode, with analyzed trade-offs between success and computational cost.

Conclusion: The self-consciousness method provides a lightweight, cost-effective solution for improving LLM ethics, suitable for diverse GenAI applications.

Abstract: This paper introduces a novel self-consciousness defense mechanism for Large Language Models (LLMs) to combat prompt injection attacks. Unlike traditional approaches that rely on external classifiers, our method leverages the LLM’s inherent reasoning capabilities to perform self-protection. We propose a framework that incorporates Meta-Cognitive and Arbitration Modules, enabling LLMs to evaluate and regulate their own outputs autonomously. Our approach is evaluated on seven state-of-the-art LLMs using two datasets: AdvBench and Prompt-Injection-Mixed-Techniques-2024. Experiment results demonstrate significant improvements in defense success rates across models and datasets, with some achieving perfect and near-perfect defense in Enhanced Mode. We also analyze the trade-off between defense success rate improvement and computational overhead. This self-consciousness method offers a lightweight, cost-effective solution for enhancing LLM ethics, particularly beneficial for GenAI use cases across various platforms.

[328] Unified Tool Integration for LLMs: A Protocol-Agnostic Approach to Function Calling

Peng Ding, Rick Stevens

Main category: cs.AI

TL;DR: A unified approach for tool integration in LLMs reduces development overhead and improves performance through automated schema generation and optimized concurrency.

DetailsMotivation: Addressing the fragmented ecosystem of tool-augmented LLMs with multiple protocols and complex workflows.

Method: Proposes protocol-agnostic design principles, automated schema generation, dual-mode concurrent execution, and multi-source tool management.

Result: 60-80% code reduction, up to 3.1x performance improvement, and compatibility with existing standards.

Conclusion: Provides theoretical and practical solutions for efficient LLM tool integration.

Abstract: The proliferation of tool-augmented Large Language Models (LLMs) has created a fragmented ecosystem where developers must navigate multiple protocols, manual schema definitions, and complex execution workflows. We address this challenge by proposing a unified approach to tool integration that abstracts protocol differences while optimizing execution performance. Our solution demonstrates how protocol-agnostic design principles can significantly reduce development overhead through automated schema generation, dual-mode concurrent execution, and seamless multi-source tool management. Experimental results show 60-80% code reduction across integration scenarios, performance improvements up to 3.1x through optimized concurrency, and full compatibility with existing function calling standards. This work contributes both theoretical insights into tool integration architecture and practical solutions for real-world LLM application development.

[329] When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs

Fangyi Yu

Main category: cs.AI

TL;DR: The paper reviews the ‘agent-as-a-judge’ paradigm, where AI agents evaluate LLM outputs, discussing its evolution, strengths, and challenges in reliability, cost, and alignment.

DetailsMotivation: The need for scalable and nuanced evaluation methods for LLM outputs, especially in complex tasks, drives the exploration of AI agents as evaluators.

Method: The review defines the agent-as-a-judge concept, traces its development, compares frameworks, and surveys real-world applications.

Result: Agent-based judging offers scalable, nuanced evaluation but faces challenges like bias and robustness. It complements human oversight but doesn’t replace it.

Conclusion: Agent-based evaluation is a promising step toward trustworthy, scalable LLM assessment, though further research is needed to address its limitations.

Abstract: As large language models (LLMs) grow in capability and autonomy, evaluating their outputs-especially in open-ended and complex tasks-has become a critical bottleneck. A new paradigm is emerging: using AI agents as the evaluators themselves. This “agent-as-a-judge” approach leverages the reasoning and perspective-taking abilities of LLMs to assess the quality and safety of other models, promising calable and nuanced alternatives to human evaluation. In this review, we define the agent-as-a-judge concept, trace its evolution from single-model judges to dynamic multi-agent debate frameworks, and critically examine their strengths and shortcomings. We compare these approaches across reliability, cost, and human alignment, and survey real-world deployments in domains such as medicine, law, finance, and education. Finally, we highlight pressing challenges-including bias, robustness, and meta evaluation-and outline future research directions. By bringing together these strands, our review demonstrates how agent-based judging can complement (but not replace) human oversight, marking a step toward trustworthy, scalable evaluation for next-generation LLMs.

[330] AGENTiGraph: A Multi-Agent Knowledge Graph Framework for Interactive, Domain-Specific LLM Chatbots

Xinjie Zhao, Moritz Blum, Fan Gao, Yingjian Chen, Boming Yang, Luis Marquez-Carpintero, Mónica Pina-Navarro, Yanran Fu, So Morikawa, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Irene Li

Main category: cs.AI

TL;DR: AGENTiGraph is an agent-driven system for intuitive knowledge graph management via natural language, outperforming baselines with high accuracy and success rates.

DetailsMotivation: To enable non-technical users to interact with and manage domain-specific data through natural language, eliminating the need for specialized query languages.

Method: Uses intent classification, task planning, and automatic knowledge integration to facilitate seamless reasoning and dynamic updates.

Result: Achieves 95.12% classification accuracy and 90.45% execution success on a 3,500-query benchmark, showing scalability for compliance-critical domains.

Conclusion: AGENTiGraph offers a powerful, open-source paradigm for multi-turn enterprise knowledge management, bridging LLMs and structured graphs.

Abstract: AGENTiGraph is a user-friendly, agent-driven system that enables intuitive interaction and management of domain-specific data through the manipulation of knowledge graphs in natural language. It gives non-technical users a complete, visual solution to incrementally build and refine their knowledge bases, allowing multi-round dialogues and dynamic updates without specialized query languages. The flexible design of AGENTiGraph, including intent classification, task planning, and automatic knowledge integration, ensures seamless reasoning between diverse tasks. Evaluated on a 3,500-query benchmark within an educational scenario, the system outperforms strong zero-shot baselines (achieving 95.12% classification accuracy, 90.45% execution success), indicating potential scalability to compliance-critical or multi-step queries in legal and medical domains, e.g., incorporating new statutes or research on the fly. Our open-source demo offers a powerful new paradigm for multi-turn enterprise knowledge management that bridges LLMs and structured graphs.

[331] Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

Yutong Wang, Pengliang Ji, Kaixin Li, Baolong Bi, Tao Feng, Guillaume Sartoretti

Main category: cs.AI

TL;DR: BPO is a three-stage framework (bootstrapping, extrapolation, refinement) for improving large language reasoning models in interactive, sparse-reward environments, achieving state-of-the-art results with token efficiency.

DetailsMotivation: Addressing challenges of credit assignment and computational overhead in multi-round agentic planning for large language models.

Method: Three-stage framework: bootstrapping with planning quaternions, extrapolation via curriculum learning, and refinement through reward-gated rejection sampling.

Result: Achieves state-of-the-art performance on ALFWorld, ScienceWorld, and WebShop with significant token efficiency.

Conclusion: BPO provides an effective recipe for enhancing reasoning models in agentic planning for long-horizon, sparse-reward tasks.

Abstract: Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.

[332] Collab-Solver: Collaborative Solving Policy Learning for Mixed-Integer Linear Programming

Siyuan Li, Yifan Yu, Yanchen Deng, Zhihao Zhang, Mengjing Chen, Fangzhou Zhu, Tao Zhong, Jianye Hao, Peng Liu, Bo An

Main category: cs.AI

TL;DR: A multi-agent policy learning framework (Collab-Solver) is proposed to collaboratively optimize MILP solver modules, improving solving speed and quality by treating their interdependence as a Stackelberg game.

DetailsMotivation: Existing learning-based MILP methods treat policy learning in solver modules independently, ignoring their interdependence, which harms performance.

Method: Formulates collaboration between cut selection and branching as a Stackelberg game, using a two-phase learning paradigm for stable policy training.

Result: Jointly learned policies improve solving performance on synthetic and real-world datasets and show strong generalization across instance sets.

Conclusion: Collab-Solver effectively addresses module interdependence, enhancing MILP solving efficiency and generalization.

Abstract: Mixed-integer linear programming (MILP) has been a fundamental problem in combinatorial optimization. Previous works have designed a plethora of hard-coded heuristics to accomplish challenging MILP solving with domain knowledge. Driven by the high capability of neural networks, recent research is devoted to replacing manually designed heuristics with learned policies. Although learning-based MILP methods have shown great promise, existing worksindependentlytreatthepolicylearningineachmoduleofMILPsolvers without considering their interdependence, severely hurting the solving speed and quality. To address this issue, we propose a novel multi-agent-based policy learning framework for MILP (Collab-Solver), which can collaboratively optimize the policies for multiple modules. Specifically, we formulate the collaboration of cut selection and branching in MILP solving as a Stackelberg game. Under this formulation, we develop a two-phase learning paradigm to stabilize the collaborative policy learning, where the first phase achieves the data-communicated policy pretraining and the second phase further orchestrates the policy learning for various modules. The jointly learned policy significantly improves the solving performance on both synthetic and large-scale real-world MILP datasets. Moreover, the policies learned by Collab-Solver have also demonstrated excellent generalization abilities across different instance sets.

[333] From Text to Trajectories: GPT-2 as an ODE Solver via In-Context

Ziyang Ma, Baojian Zhou, Deqing Yang, Yanghua Xiao

Main category: cs.AI

TL;DR: The paper explores In-Context Learning (ICL) in LLMs for solving ODEs, showing GPT-2 can learn a meta-ODE algorithm with performance comparable to the Euler method and robust generalization.

DetailsMotivation: To understand the nonlinear behavior of ICL in NLP tasks and its potential for solving numerical problems like ODEs.

Method: Formulate ODE problems as sequential prompts, evaluate GPT-2 on these tasks, and analyze convergence and generalization.

Result: GPT-2 achieves exponential accuracy gains with more demonstrations, outperforms the Euler method, and generalizes to OOD problems.

Conclusion: ICL in LLMs shows promise for solving nonlinear numerical problems, offering new insights into its mechanisms.

Abstract: In-Context Learning (ICL) has emerged as a new paradigm in large language models (LLMs), enabling them to perform novel tasks by conditioning on a few examples embedded in the prompt. Yet, the highly nonlinear behavior of ICL for NLP tasks remains poorly understood. To shed light on its underlying mechanisms, this paper investigates whether LLMs can solve ordinary differential equations (ODEs) under the ICL setting. We formulate standard ODE problems and their solutions as sequential prompts and evaluate GPT-2 models on these tasks. Experiments on two types of ODEs show that GPT-2 can effectively learn a meta-ODE algorithm, with convergence behavior comparable to, or better than, the Euler method, and achieve exponential accuracy gains with increasing numbers of demonstrations. Moreover, the model generalizes to out-of-distribution (OOD) problems, demonstrating robust extrapolation capabilities. These empirical findings provide new insights into the mechanisms of ICL in NLP and its potential for solving nonlinear numerical problems.

[334] Tree-of-Reasoning: Towards Complex Medical Diagnosis via Multi-Agent Reasoning with Evidence Tree

Qi Peng, Jialin Cui, Jiayuan Xie, Yi Cai, Qing Li

Main category: cs.AI

TL;DR: The paper proposes Tree-of-Reasoning (ToR), a multi-agent framework to enhance LLMs’ reasoning depth in complex medical diagnosis tasks.

DetailsMotivation: Existing LLMs lack sufficient reasoning depth for complex medical diagnosis, leading to errors.

Method: ToR uses a tree structure to record reasoning paths and clinical evidence, with a cross-validation mechanism for consistency.

Result: Experiments show ToR outperforms baseline methods on real-world medical data.

Conclusion: ToR improves LLMs’ clinical reasoning in complex scenarios.

Abstract: Large language models (LLMs) have shown great potential in the medical domain. However, existing models still fall short when faced with complex medical diagnosis task in the real world. This is mainly because they lack sufficient reasoning depth, which leads to information loss or logical jumps when processing a large amount of specialized medical data, leading to diagnostic errors. To address these challenges, we propose Tree-of-Reasoning (ToR), a novel multi-agent framework designed to handle complex scenarios. Specifically, ToR introduces a tree structure that can clearly record the reasoning path of LLMs and the corresponding clinical evidence. At the same time, we propose a cross-validation mechanism to ensure the consistency of multi-agent decision-making, thereby improving the clinical reasoning ability of multi-agents in complex medical scenarios. Experimental results on real-world medical data show that our framework can achieve better performance than existing baseline methods.

[335] Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang

Main category: cs.AI

TL;DR: The paper proposes the Cognitive-Driven Defense (CDD) framework to protect LLMs from jailbreak attacks by mimicking human reasoning and using reinforcement learning for generalization.

DetailsMotivation: Existing defenses against jailbreak attacks rely on shallow pattern matching, which fails against novel strategies. CDD aims to address this by targeting the underlying structure of attacks.

Method: CDD uses meta-operations to conceal harmful intent, emulates human reasoning via a structured chain (global perception to localized analysis), and employs supervised fine-tuning and entropy-guided reinforcement learning (EG-GRPO) for generalization.

Result: CDD achieves state-of-the-art defense performance and strong generalization to unseen jailbreak attacks.

Conclusion: The CDD framework effectively defends LLMs by combining structured reasoning and reinforcement learning, outperforming existing methods.

Abstract: Defending large language models (LLMs) against jailbreak attacks is essential for their safe and reliable deployment. Existing defenses often rely on shallow pattern matching, which struggles to generalize to novel and unseen attack strategies. To address this challenge, we propose the Cognitive-Driven Defense (CDD) framework, which targets the underlying structure of jailbreak prompts by applying meta-operations, defined as basic manipulations that conceal harmful intent.CDD emulates human cognitive reasoning through a structured reasoning chain. It begins with a global perception of the prompt and follows with a localized analysis to uncover hidden manipulations. By applying supervised fine-tuning on this structured chain, the model learns to identify and reason about known manipulation patterns. To enhance generalization to unseen threats, an entropy-guided reinforcement learning algorithm (EG-GRPO) is introduced to encourage exploration of new types and variants of meta-operations. Experiments demonstrate that CDD can achieve state-of-the-art defense performance and exhibit strong generalization to unseen jailbreak attacks.

Shuang Liu, Zelong Li, Ruoyun Ma, Haiyan Zhao, Mengnan Du

Main category: cs.AI

TL;DR: ContractEval evaluates open-source vs. proprietary LLMs for legal risk analysis, revealing proprietary models generally outperform but some open-source models show promise.

DetailsMotivation: To address the underexplored potential of LLMs in legal domains and the need for benchmarks to evaluate open-source models for local deployment while ensuring data confidentiality.

Method: Assessed 4 proprietary and 15 open-source LLMs using the CUAD dataset to identify clause-level legal risks in contracts.

Result: Proprietary models lead in correctness and effectiveness, but open-source models show promise in specific areas. Key findings include performance tradeoffs and model behavior insights.

Conclusion: Open-source LLMs need fine-tuning for legal tasks; ContractEval provides a benchmark for future legal-domain LLM development.

Abstract: The potential of large language models (LLMs) in specialized domains such as legal risk analysis remains underexplored. In response to growing interest in locally deploying open-source LLMs for legal tasks while preserving data confidentiality, this paper introduces ContractEval, the first benchmark to thoroughly evaluate whether open-source LLMs could match proprietary LLMs in identifying clause-level legal risks in commercial contracts. Using the Contract Understanding Atticus Dataset (CUAD), we assess 4 proprietary and 15 open-source LLMs. Our results highlight five key findings: (1) Proprietary models outperform open-source models in both correctness and output effectiveness, though some open-source models are competitive in certain specific dimensions. (2) Larger open-source models generally perform better, though the improvement slows down as models get bigger. (3) Reasoning (“thinking”) mode improves output effectiveness but reduces correctness, likely due to over-complicating simpler tasks. (4) Open-source models generate “no related clause” responses more frequently even when relevant clauses are present. This suggests “laziness” in thinking or low confidence in extracting relevant content. (5) Model quantization speeds up inference but at the cost of performance drop, showing the tradeoff between efficiency and accuracy. These findings suggest that while most LLMs perform at a level comparable to junior legal assistants, open-source models require targeted fine-tuning to ensure correctness and effectiveness in high-stakes legal settings. ContractEval offers a solid benchmark to guide future development of legal-domain LLMs.

[337] EoH-S: Evolution of Heuristic Set using LLMs for Automated Heuristic Design

Fei Liu, Yilu Liu, Qingfu Zhang, Xialiang Tong, Mingxuan Yuan

Main category: cs.AI

TL;DR: The paper introduces Automated Heuristic Set Design (AHSD) to generate diverse heuristics for varied problem instances, outperforming single-heuristic methods by up to 60%.

DetailsMotivation: Existing methods design a single heuristic, leading to poor generalization across different problem distributions or settings.

Method: Proposes AHSD, a formulation for LLM-driven heuristic design, and EoH-S, an algorithm with complementary population management and memetic search.

Result: EoH-S consistently outperforms state-of-the-art methods, achieving up to 60% performance improvements.

Conclusion: AHSD and EoH-S effectively address generalization issues in heuristic design, offering significant performance gains.

Abstract: Automated Heuristic Design (AHD) using Large Language Models (LLMs) has achieved notable success in recent years. Despite the effectiveness of existing approaches, they only design a single heuristic to serve all problem instances, often inducing poor generalization across different distributions or settings. To address this issue, we propose Automated Heuristic Set Design (AHSD), a new formulation for LLM-driven AHD. The aim of AHSD is to automatically generate a small-sized complementary heuristic set to serve diverse problem instances, such that each problem instance could be optimized by at least one heuristic in this set. We show that the objective function of AHSD is monotone and supermodular. Then, we propose Evolution of Heuristic Set (EoH-S) to apply the AHSD formulation for LLM-driven AHD. With two novel mechanisms of complementary population management and complementary-aware memetic search, EoH-S could effectively generate a set of high-quality and complementary heuristics. Comprehensive experimental results on three AHD tasks with diverse instances spanning various sizes and distributions demonstrate that EoH-S consistently outperforms existing state-of-the-art AHD methods and achieves up to 60% performance improvements.

[338] MissDDIM: Deterministic and Efficient Conditional Diffusion for Tabular Data Imputation

Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal

Main category: cs.AI

TL;DR: MissDDIM is a conditional diffusion framework for tabular imputation, addressing high latency and output variability in existing DDPM-based methods.

DetailsMotivation: Existing DDPM-based methods for missing data imputation suffer from high inference latency and variable outputs, limiting real-world applicability.

Method: MissDDIM adapts Denoising Diffusion Implicit Models (DDIM) for tabular imputation, focusing on reducing variability and latency.

Result: The framework aims to provide more stable and efficient imputation compared to stochastic DDPMs.

Conclusion: MissDDIM offers a promising solution for practical tabular data imputation by mitigating key limitations of current approaches.

Abstract: Diffusion models have recently emerged as powerful tools for missing data imputation by modeling the joint distribution of observed and unobserved variables. However, existing methods, typically based on stochastic denoising diffusion probabilistic models (DDPMs), suffer from high inference latency and variable outputs, limiting their applicability in real-world tabular settings. To address these deficiencies, we present in this paper MissDDIM, a conditional diffusion framework that adapts Denoising Diffusion Implicit Models (DDIM) for tabular imputation. While stochastic sampling enables diverse completions, it also introduces output variability that complicates downstream processing.

[339] T2UE: Generating Unlearnable Examples from Text Descriptions

Xingjun Ma, Hanxun Huang, Tianwei Song, Ye Sun, Yifeng Gao, Yu-Gang Jiang

Main category: cs.AI

TL;DR: T2UE introduces a text-based framework for generating unlearnable examples (UEs) to protect data privacy without exposing original images, resolving the privacy paradox in current UE methods.

DetailsMotivation: Current UE methods require exposing data to third-party services for noise generation, compromising privacy. T2UE aims to eliminate this need by using text descriptions alone.

Method: T2UE employs a text-to-image model to map text descriptions into noise space and an error-minimization framework to create effective unlearnable noise.

Result: T2UE significantly degrades downstream task performance for state-of-the-art models and generalizes across architectures and supervised learning.

Conclusion: T2UE enables zero-contact data protection, safeguarding personal data without direct exposure, offering a scalable and practical solution.

Abstract: Large-scale pre-training frameworks like CLIP have revolutionized multimodal learning, but their reliance on web-scraped datasets, frequently containing private user data, raises serious concerns about misuse. Unlearnable Examples (UEs) have emerged as a promising countermeasure against unauthorized model training, employing carefully crafted unlearnable noise to disrupt the learning of meaningful representations from protected data. Current approaches typically generate UEs by jointly optimizing unlearnable noise for both images and their associated text descriptions (or labels). However, this optimization process is often computationally prohibitive for on-device execution, forcing reliance on external third-party services. This creates a fundamental privacy paradox: users must initially expose their data to these very services to achieve protection, thereby compromising privacy in the process. Such a contradiction has severely hindered the development of practical, scalable data protection solutions. To resolve this paradox, we introduce \textbf{Text-to-Unlearnable Example (T2UE)}, a novel framework that enables users to generate UEs using only text descriptions. T2UE circumvents the need for original image data by employing a text-to-image (T2I) model to map text descriptions into the image (noise) space, combined with an error-minimization framework to produce effective unlearnable noise. Extensive experiments show that T2UE-protected data substantially degrades performance in downstream tasks (e.g., cross-modal retrieval) for state-of-the-art models. Notably, the protective effect generalizes across diverse architectures and even to supervised learning settings. Our work demonstrates the feasibility of “zero-contact data protection”, where personal data can be safeguarded based solely on their textual descriptions, eliminating the need for direct data exposure.

[340] Toward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework

Zikun Cui, Tianyi Huang, Chia-En Chiang, Cuiqianhe Du

Main category: cs.AI

TL;DR: The paper introduces a verifiable misinformation detection LLM agent that dynamically verifies claims, assesses source credibility, and synthesizes evidence, outperforming traditional methods in accuracy and transparency.

DetailsMotivation: The rise of LLMs and misinformation necessitates advanced detection methods beyond binary judgments, requiring verifiable and transparent reasoning.

Method: The agent uses three core tools (web search, source credibility assessment, numerical claim verification) for multi-step verification, evidence logging, and comprehensive assessments.

Result: The agent outperforms baseline methods in accuracy, reasoning transparency, and robustness against rewritten content on datasets like FakeNewsNet.

Conclusion: The proposed agent offers a trustworthy AI-assisted fact-checking paradigm, enhancing misinformation detection with verifiable reasoning.

Abstract: With the proliferation of Large Language Models (LLMs), the detection of misinformation has become increasingly important and complex. This research proposes an innovative verifiable misinformation detection LLM agent that goes beyond traditional true/false binary judgments. The agent actively verifies claims through dynamic interaction with diverse web sources, assesses information source credibility, synthesizes evidence, and provides a complete verifiable reasoning process. Our designed agent architecture includes three core tools: precise web search tool, source credibility assessment tool and numerical claim verification tool. These tools enable the agent to execute multi-step verification strategies, maintain evidence logs, and form comprehensive assessment conclusions. We evaluate using standard misinformation datasets such as FakeNewsNet, comparing with traditional machine learning models and LLMs. Evaluation metrics include standard classification metrics, quality assessment of reasoning processes, and robustness testing against rewritten content. Experimental results show that our agent outperforms baseline methods in misinformation detection accuracy, reasoning transparency, and resistance to information rewriting, providing a new paradigm for trustworthy AI-assisted fact-checking.

[341] AgentSME for Simulating Diverse Communication Modes in Smart Education

Wen-Xi Yang, Tian-Fang Zhao

Main category: cs.AI

TL;DR: Proposes AgentSME, a generative agent framework for smart education using LLMs, testing three communication modes (Solo, Mono, Echo) and evaluating accuracy and diversity. Echo mode achieves highest accuracy, while DeepSeek shows greatest diversity.

DetailsMotivation: Addressing the underdevelopment of generative agent models in smart education due to the complexity of personalized human-to-human communication and diverse cognitive behaviors.

Method: Introduces AgentSME, a unified framework with three communication modes (Solo, Mono, Echo), evaluates accuracy and diversity, and tests six LLMs across base- and high-capacity configurations.

Result: Echo communication mode achieves highest accuracy; DeepSeek exhibits greatest diversity in reasoning.

Conclusion: AgentSME improves agent learning capabilities and inspires smart education models, with Echo mode and DeepSeek standing out.

Abstract: Generative agent models specifically tailored for smart education are critical, yet remain relatively underdeveloped. A key challenge stems from the inherent complexity of educational contexts: learners are human beings with various cognitive behaviors, and pedagogy is fundamentally centered on personalized human-to-human communication. To address this issue, this paper proposes AgentSME, a unified generative agent framework powered by LLM. Three directional communication modes are considered in the models, namely Solo, Mono, and Echo, reflecting different types of agency autonomy and communicative reciprocity. Accuracy is adopted as the primary evaluation metric, complemented by three diversity indices designed to assess the diversity of reasoning contents. Six widely used LLMs are tested to validate the robustness of communication modes across different model tiers, which are equally divided into base-capacity and high-capacity configurations. The results show that generative agents that employ the Echo communication mode achieve the highest accuracy scores, while DeepSeek exhibits the greatest diversity. This study provides valuable information to improve agent learning capabilities and inspire smart education models.

[342] Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation

Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou

Main category: cs.AI

TL;DR: A framework for training trustworthy LLM agents for optimization modeling using verifiable synthetic data, achieving state-of-the-art performance.

DetailsMotivation: To create reliable LLM agents for optimization tasks by ensuring verifiability and quality in synthetic data generation.

Method: Uses a structured symbolic pipeline to generate natural language, math formulations, and solver code, with verified optimal solutions and teacher-generated demonstrations. Introduces OptiTrust, an LLM agent for multi-stage translation.

Result: Achieves highest accuracy on 6 out of 7 datasets, outperforming next-best by at least 8% on three.

Conclusion: Provides a scalable, verifiable path for building reliable LLM agents in real-world optimization.

Abstract: We present a framework for training trustworthy large language model (LLM) agents for optimization modeling via a verifiable synthetic data generation pipeline. Focusing on linear and mixed-integer linear programming, our approach begins with structured symbolic representations and systematically produces natural language descriptions, mathematical formulations, and solver-executable code. By programmatically constructing each instance with known optimal solutions, the pipeline ensures full verifiability and enables automatic filtering of low-quality demonstrations generated by teacher models. Each dataset instance includes a structured representation of the optimization problem, a corresponding natural language description, the verified optimal solution, and step-by-step demonstrations - generated by a teacher model - that show how to model and solve the problem across multiple optimization modeling languages. This enables supervised fine-tuning of open-source LLMs specifically tailored to optimization tasks. To operationalize this pipeline, we introduce OptiTrust, a modular LLM agent that performs multi-stage translation from natural language to solver-ready code, leveraging stepwise demonstrations, multi-language inference, and majority-vote cross-validation. Our agent achieves state-of-the-art performance on standard benchmarks. Out of 7 datasets, it achieves the highest accuracy on six and outperforms the next-best algorithm by at least 8 percentage on three of them. Our approach provides a scalable, verifiable, and principled path toward building reliable LLM agents for real-world optimization applications.

[343] Can Large Language Models Bridge the Gap in Environmental Knowledge?

Linda Smail, David Santandreu Calonge, Firuz Kamalov, Nur H. Orak

Main category: cs.AI

TL;DR: AI models like GPT-3.5, GPT-4, and others show promise in environmental education but may still require human experts to validate accuracy.

DetailsMotivation: To explore AI's role in bridging knowledge gaps in environmental education among university students.

Method: Used the Environmental Knowledge Test (EKT-19) and targeted questions to compare AI model responses with student knowledge.

Result: AI models have a vast knowledge base useful for education but need human specialists to verify accuracy.

Conclusion: AI can support environmental education but should complement, not replace, human expertise.

Abstract: This research investigates the potential of Artificial Intelligence (AI) models to bridge the knowledge gap in environmental education among university students. By focusing on prominent large language models (LLMs) such as GPT-3.5, GPT-4, GPT-4o, Gemini, Claude Sonnet, and Llama 2, the study assesses their effectiveness in conveying environmental concepts and, consequently, facilitating environmental education. The investigation employs a standardized tool, the Environmental Knowledge Test (EKT-19), supplemented by targeted questions, to evaluate the environmental knowledge of university students in comparison to the responses generated by the AI models. The results of this study suggest that while AI models possess a vast, readily accessible, and valid knowledge base with the potential to empower both students and academic staff, a human discipline specialist in environmental sciences may still be necessary to validate the accuracy of the information provided.

[344] Causal identification with $Y_0$

Charles Tapley Hoyt, Craig Bakker, Richard J. Callahan, Joseph Cottam, August George, Benjamin M. Gyori, Haley M. Hummel, Nathaniel Merrill, Sara Mohammad Taheri, Pruthvi Prakash Navada, Marc-Antoine Parent, Adam Rupe, Olga Vitek, Jeremy Zucker

Main category: cs.AI

TL;DR: The $Y_0$ Python package implements causal identification algorithms for interventional, counterfactual, and transportability queries, aiding researchers in determining causal relationships before estimation.

DetailsMotivation: To provide tools for qualitative causal investigation and symbolic estimand transformation, supporting causal inference from diverse data sources.

Method: Uses a domain-specific language for causal queries, represents causal graphical models (e.g., ADMGs), and implements identification algorithms.

Result: $Y_0$ enables researchers to assess causal estimability and transform queries into non-parametric estimands.

Conclusion: $Y_0$ is a practical tool for causal inference, available as an open-source Python package.

Abstract: We present the $Y_0$ Python package, which implements causal identification algorithms that apply interventional, counterfactual, and transportability queries to data from (randomized) controlled trials, observational studies, or mixtures thereof. $Y_0$ focuses on the qualitative investigation of causation, helping researchers determine whether a causal relationship can be estimated from available data before attempting to estimate how strong that relationship is. Furthermore, $Y_0$ provides guidance on how to transform the causal query into a symbolic estimand that can be non-parametrically estimated from the available data. $Y_0$ provides a domain-specific language for representing causal queries and estimands as symbolic probabilistic expressions, tools for representing causal graphical models with unobserved confounders, such as acyclic directed mixed graphs (ADMGs), and implementations of numerous identification algorithms from the recent causal inference literature. The $Y_0$ source code can be found under the MIT License at https://github.com/y0-causal-inference/y0 and it can be installed with pip install y0.

[345] Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions

Jingxuan Wei, Caijun Jia, Qi Chen, Honghao He, Linzhuang Sun, Conghui He, Lijun Wu, Bihui Yu, Cheng Tan

Main category: cs.AI

TL;DR: Geoint-R1 is a multimodal framework for formal geometric reasoning, integrating auxiliary element construction, Lean4-based verification, and visualization, outperforming existing models on the Geoint benchmark.

DetailsMotivation: Existing MLLMs struggle with formal geometric reasoning, especially in dynamically constructing and verifying auxiliary elements, necessitating a robust solution like Geoint-R1.

Method: Geoint-R1 combines auxiliary element construction, Lean4 for formal reasoning, and interactive visualization to generate verifiable geometric solutions from text and diagrams.

Result: Geoint-R1 outperforms existing models on the Geoint benchmark, especially in problems requiring auxiliary element construction.

Conclusion: Geoint-R1 advances formal geometric reasoning, offering a scalable and verifiable solution for complex geometric problems.

Abstract: Mathematical geometric reasoning is essential for scientific discovery and educational development, requiring precise logic and rigorous formal verification. While recent advances in Multimodal Large Language Models (MLLMs) have improved reasoning tasks, existing models typically struggle with formal geometric reasoning, particularly when dynamically constructing and verifying auxiliary geometric elements. To address these challenges, we introduce Geoint-R1, a multimodal reasoning framework designed to generate formally verifiable geometric solutions from textual descriptions and visual diagrams. Geoint-R1 uniquely integrates auxiliary elements construction, formal reasoning represented via Lean4, and interactive visualization. To systematically evaluate and advance formal geometric reasoning, we propose the Geoint benchmark, comprising 1,885 rigorously annotated geometry problems across diverse topics such as plane, spatial, and solid geometry. Each problem includes structured textual annotations, precise Lean4 code for auxiliary constructions, and detailed solution steps verified by experts. Extensive experiments demonstrate that Geoint-R1 significantly surpasses existing multimodal and math-specific reasoning models, particularly on challenging problems requiring explicit auxiliary element constructions.

[346] InqEduAgent: Adaptive AI Learning Partners with Gaussian Process Augmentation

Tian-Fang Zhao, Wen-Xi Yang

Main category: cs.AI

TL;DR: The paper introduces InqEduAgent, an LLM-empowered agent model for selecting optimal learning partners in inquiry-oriented education, outperforming traditional methods.

DetailsMotivation: Traditional partner selection methods in inquiry-oriented education lack scientific planning or flexibility, hindering knowledge expansion.

Method: The study designs generative agents to capture learner features and uses an adaptive matching algorithm with Gaussian process augmentation to identify knowledge patterns.

Result: InqEduAgent shows optimal performance in various knowledge-learning scenarios and LLM environments.

Conclusion: The study advances intelligent partner allocation, combining human and AI-based learning partners, with publicly available resources.

Abstract: Collaborative partnership matters in inquiry-oriented education. However, most study partners are selected either rely on experience-based assignments with little scientific planning or build on rule-based machine assistants, encountering difficulties in knowledge expansion and inadequate flexibility. This paper proposes an LLM-empowered agent model for simulating and selecting learning partners tailored to inquiry-oriented learning, named InqEduAgent. Generative agents are designed to capture cognitive and evaluative features of learners in real-world scenarios. Then, an adaptive matching algorithm with Gaussian process augmentation is formulated to identify patterns within prior knowledge. Optimal learning-partner matches are provided for learners facing different exercises. The experimental results show the optimal performance of InqEduAgent in most knowledge-learning scenarios and LLM environment with different levels of capabilities. This study promotes the intelligent allocation of human-based learning partners and the formulation of AI-based learning partners. The code, data, and appendix are publicly available at https://github.com/InqEduAgent/InqEduAgent.

[347] Full-History Graphs with Edge-Type Decoupled Networks for Temporal Reasoning

Osama Mohammed, Jiaxin Pan, Mojtaba Nayyeri, Daniel Hernández, Steffen Staab

Main category: cs.AI

TL;DR: The paper introduces a full-history graph and ETDNet for modeling evolving interactions, outperforming baselines in driver-intention prediction and fraud detection.

DetailsMotivation: To address the need for reasoning over evolving interactions in tasks like traffic prediction and fraud detection, which require tracking entity relations over time.

Method: Proposes a full-history graph with intra- and inter-time-step edges, and ETDNet with parallel modules for graph and temporal attention.

Result: ETDNet achieves 75.6% joint accuracy in Waymo and 88.1% F1 in Elliptic++, outperforming baselines.

Conclusion: Representing structural and temporal relations as distinct edges in a single graph improves performance in dynamic interaction tasks.

Abstract: Modeling evolving interactions among entities is critical in many real-world tasks. For example, predicting driver maneuvers in traffic requires tracking how neighboring vehicles accelerate, brake, and change lanes relative to one another over consecutive frames. Likewise, detecting financial fraud hinges on following the flow of funds through successive transactions as they propagate through the network. Unlike classic time-series forecasting, these settings demand reasoning over who interacts with whom and when, calling for a temporal-graph representation that makes both the relations and their evolution explicit. Existing temporal-graph methods typically use snapshot graphs to encode temporal evolution. We introduce a full-history graph that instantiates one node for every entity at every time step and separates two edge sets: (i) intra-time-step edges that capture relations within a single frame and (ii) inter-time-step edges that connect an entity to itself at consecutive steps. To learn on this graph we design an Edge-Type Decoupled Network (ETDNet) with parallel modules: a graph-attention module aggregates information along intra-time-step edges, a multi-head temporal-attention module attends over an entity’s inter-time-step history, and a fusion module combines the two messages after every layer. Evaluated on driver-intention prediction (Waymo) and Bitcoin fraud detection (Elliptic++), ETDNet consistently surpasses strong baselines, lifting Waymo joint accuracy to 75.6% (vs. 74.1%) and raising Elliptic++ illicit-class F1 to 88.1% (vs. 60.4%). These gains demonstrate the benefit of representing structural and temporal relations as distinct edges in a single graph.

[348] ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Shaofeng Yin, Ting Lei, Yang Liu

Main category: cs.AI

TL;DR: ToolVQA is a large-scale multimodal dataset designed to improve tool-augmented LFMs for real-world VQA tasks, featuring real-world contexts and multi-step reasoning. ToolEngine, a novel data generation pipeline, ensures human-like reasoning. Fine-tuned LFMs outperform GPT-3.5-turbo on OOD datasets.

DetailsMotivation: Existing tool-augmented LFMs lack proficiency in real-world, functionally diverse multimodal settings, necessitating a dataset like ToolVQA to bridge this gap.

Method: ToolVQA is constructed using ToolEngine, a pipeline employing DFS and dynamic in-context example matching to simulate human-like tool-use reasoning. The dataset includes 23K instances with 10 multimodal tools across 7 domains.

Result: Fine-tuned 7B LFMs achieve strong performance on ToolVQA and surpass GPT-3.5-turbo on OOD datasets, demonstrating generalizability.

Conclusion: ToolVQA and ToolEngine effectively enhance LFMs’ real-world tool-use proficiency, showcasing improved performance and generalizability.

Abstract: Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.

[349] Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science

Jiayan Nan, Wenquan Ma, Wenlong Wu, Yize Chen

Main category: cs.AI

TL;DR: Nemori is a self-organizing memory architecture for LLMs, addressing memory granularity and adaptive learning, outperforming existing systems in long-term contexts.

DetailsMotivation: LLMs lack persistent memory for long-term interactions, and existing memory systems have limitations in granularity and passive learning.

Method: Nemori uses the Two-Step Alignment Principle for memory organization and the Predict-Calibrate Principle for adaptive learning.

Result: Nemori outperforms state-of-the-art systems on LoCoMo and LongMemEval benchmarks, especially in longer contexts.

Conclusion: Nemori provides a viable solution for autonomous agents to handle dynamic, long-term workflows effectively.

Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities, yet their inability to maintain persistent memory in long contexts limits their effectiveness as autonomous agents in long-term interactions. While existing memory systems have made progress, their reliance on arbitrary granularity for defining the basic memory unit and passive, rule-based mechanisms for knowledge extraction limits their capacity for genuine learning and evolution. To address these foundational limitations, we present Nemori, a novel self-organizing memory architecture inspired by human cognitive principles. Nemori’s core innovation is twofold: First, its Two-Step Alignment Principle, inspired by Event Segmentation Theory, provides a principled, top-down method for autonomously organizing the raw conversational stream into semantically coherent episodes, solving the critical issue of memory granularity. Second, its Predict-Calibrate Principle, inspired by the Free-energy Principle, enables the agent to proactively learn from prediction gaps, moving beyond pre-defined heuristics to achieve adaptive knowledge evolution. This offers a viable path toward handling the long-term, dynamic workflows of autonomous agents. Extensive experiments on the LoCoMo and LongMemEval benchmarks demonstrate that Nemori significantly outperforms prior state-of-the-art systems, with its advantage being particularly pronounced in longer contexts.

[350] Adaptive AI Agent Placement and Migration in Edge Intelligence Systems

Xingdan Wang, Jiayi He, Zhiqing Tang, Jianxiong Guo, Jiong Lou, Liping Qian, Tian Wang, Weijia Jia

Main category: cs.AI

TL;DR: The paper introduces a framework for deploying and managing LLM-based AI agents in dynamic edge environments, optimizing resource use and QoS with adaptive placement and migration.

DetailsMotivation: The need for efficient, low-latency AI agents at the edge, given the challenges of limited resources and heterogeneous environments.

Method: Proposes an adaptive framework using ant colony algorithms and LLM-based optimization for agent placement and migration, focusing on resource and latency/cost constraints.

Result: Significantly reduces deployment latency and migration costs, validated on a distributed system with global edge servers.

Conclusion: The solution effectively addresses the challenges of deploying AI agents at the edge, improving efficiency and QoS.

Abstract: The rise of LLMs such as ChatGPT and Claude fuels the need for AI agents capable of real-time task handling. However, migrating data-intensive, multi-modal edge workloads to cloud data centers, traditionally used for agent deployment, introduces significant latency. Deploying AI agents at the edge improves efficiency and reduces latency. However, edge environments present challenges due to limited and heterogeneous resources. Maintaining QoS for mobile users necessitates agent migration, which is complicated by the complexity of AI agents coordinating LLMs, task planning, memory, and external tools. This paper presents the first systematic deployment and management solution for LLM-based AI agents in dynamic edge environments. We propose a novel adaptive framework for AI agent placement and migration in edge intelligence systems. Our approach models resource constraints and latency/cost, leveraging ant colony algorithms and LLM-based optimization for efficient decision-making. It autonomously places agents to optimize resource utilization and QoS and enables lightweight agent migration by transferring only essential state. Implemented on a distributed system using AgentScope and validated across globally distributed edge servers, our solution significantly reduces deployment latency and migration costs.

[351] Compressing Chain-of-Thought in LLMs via Step Entropy

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, Qiang Xu

Main category: cs.AI

TL;DR: A novel CoT compression framework using step entropy to prune redundant reasoning steps in LLMs, improving efficiency without significant accuracy loss.

DetailsMotivation: LLMs with CoT prompting generate verbose and redundant reasoning steps, increasing inference costs and reducing efficiency.

Method: Introduces step entropy to quantify redundancy, prunes low-entropy steps, and uses a two-stage training strategy (SFT and GRPO) to teach LLMs to generate compressed CoTs with [SKIP] tokens.

Result: 80% of low-entropy steps can be pruned with minor accuracy degradation, contrasting sharply with random or high-entropy pruning.

Conclusion: The method enhances LLM inference efficiency while preserving accuracy, with implications for practical deployment and understanding reasoning structures.

Abstract: Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures.

[352] CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment

Feng Rui, Zhiyao Luo, Wei Wang, Yuting Song, Yong Liu, Tingting Zhu, Jianqing Li, Xingyao Wang

Main category: cs.AI

TL;DR: CogBench evaluates cross-lingual and cross-site generalizability of LLMs for cognitive impairment assessment from speech, showing LLMs with chain-of-thought prompting outperform conventional models, while LoRA fine-tuning enhances adaptability.

DetailsMotivation: Current speech-based cognitive impairment assessment methods lack generalizability across languages and clinical settings, limiting practical use.

Method: Proposes CogBench, a benchmark using a unified multimodal pipeline to evaluate LLMs on English and Mandarin datasets (ADReSSo, NCMMSC2021-AD, CIR-E). Compares conventional deep learning models and LLMs with chain-of-thought prompting, and explores LoRA fine-tuning.

Result: Conventional models degrade across domains, while LLMs with chain-of-thought prompting adapt better but are sensitive to prompts. LoRA fine-tuning significantly improves generalization.

Conclusion: LLMs with advanced prompting and LoRA fine-tuning offer a promising approach for clinically useful, linguistically robust cognitive assessment tools.

Abstract: Automatic assessment of cognitive impairment from spontaneous speech offers a promising, non-invasive avenue for early cognitive screening. However, current approaches often lack generalizability when deployed across different languages and clinical settings, limiting their practical utility. In this study, we propose CogBench, the first benchmark designed to evaluate the cross-lingual and cross-site generalizability of large language models (LLMs) for speech-based cognitive impairment assessment. Using a unified multimodal pipeline, we evaluate model performance on three speech datasets spanning English and Mandarin: ADReSSo, NCMMSC2021-AD, and a newly collected test set, CIR-E. Our results show that conventional deep learning models degrade substantially when transferred across domains. In contrast, LLMs equipped with chain-of-thought prompting demonstrate better adaptability, though their performance remains sensitive to prompt design. Furthermore, we explore lightweight fine-tuning of LLMs via Low-Rank Adaptation (LoRA), which significantly improves generalization in target domains. These findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools.

[353] A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning

Michael K. Chen

Main category: cs.AI

TL;DR: The paper compares integrative and hybrid neurosymbolic approaches for improving general logical reasoning in LLMs, concluding that the hybrid approach (e.g., LLM-Symbolic Solver) is more promising due to interpretability and retaining LLM advantages.

DetailsMotivation: Current LLMs struggle with deterministic and interpretable logical reasoning, prompting interest in neurosymbolic AI to address these limitations.

Method: Two approaches are analyzed: integrative (Logic Neural Network) and hybrid (LLM-Symbolic Solver). Their performance on domain-agnostic tasks is compared.

Result: The hybrid approach (LLM-SS) outperforms the integrative approach (LNN) in interpretability and leveraging LLM capabilities.

Conclusion: The hybrid approach is more promising for general logical reasoning, and a modular, model-agnostic framework based on LLM-SS is proposed for future work.

Abstract: General logical reasoning, defined as the ability to reason deductively on domain-agnostic tasks, continues to be a challenge for large language models (LLMs). Current LLMs fail to reason deterministically and are not interpretable. As such, there has been a recent surge in interest in neurosymbolic AI, which attempts to incorporate logic into neural networks. We first identify two main neurosymbolic approaches to improving logical reasoning: (i) the integrative approach comprising models where symbolic reasoning is contained within the neural network, and (ii) the hybrid approach comprising models where a symbolic solver, separate from the neural network, performs symbolic reasoning. Both contain AI systems with promising results on domain-specific logical reasoning benchmarks. However, their performance on domain-agnostic benchmarks is understudied. To the best of our knowledge, there has not been a comparison of the contrasting approaches that answers the following question: Which approach is more promising for developing general logical reasoning? To analyze their potential, the following best-in-class domain-agnostic models are introduced: Logic Neural Network (LNN), which uses the integrative approach, and LLM-Symbolic Solver (LLM-SS), which uses the hybrid approach. Using both models as case studies and representatives of each approach, our analysis demonstrates that the hybrid approach is more promising for developing general logical reasoning because (i) its reasoning chain is more interpretable, and (ii) it retains the capabilities and advantages of existing LLMs. To support future works using the hybrid approach, we propose a generalizable framework based on LLM-SS that is modular by design, model-agnostic, domain-agnostic, and requires little to no human input.

[354] Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play

Lucia Cipolina-Kun, Marianna Nezhurina, Jenia Jitsev

Main category: cs.AI

TL;DR: The Board Game Arena library evaluates LLMs’ decision-making in strategic board games, comparing them to other agents, and provides tools for analysis and distributed execution.

DetailsMotivation: To systematically assess LLMs' reasoning and game-theoretic behavior by comparing them with diverse agents in various game scenarios.

Method: The framework integrates board and matrix games, supports multiple agent types (random, human, RL), and uses LiteLLM and vLLM for model access, with distributed execution via Ray.

Result: Enables empirical evaluation of LLMs’ strategic decision-making and reasoning traces.

Conclusion: The library contributes to understanding LLMs’ game-theoretic behavior and reasoning through structured, scalable evaluation.

Abstract: The Board Game Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library. The framework enables systematic comparisons between LLM based agents and other agents (random, human, reinforcement learning agents, etc.) in various game scenarios by wrapping multiple board and matrix games and supporting different agent types. It integrates API access to models via LiteLLM, local model deployment via vLLM, and offers distributed execution through Ray. Additionally it provides extensive analysis tools for the LLM reasoning traces. This paper summarizes the structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game-theoretic behavior

[355] Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams

Wenxin Mao, Zhitao Wang Long Wang, Sirong Chen, Cuiyun Gao, Luyang Cao, Ziming Liu, Qiming Zhang, Jun Zhou, Zhi Jin

Main category: cs.AI

TL;DR: A framework, UML2Dep, uses enhanced UML diagrams and data dependency inference to improve code generation from ambiguous natural language descriptions.

DetailsMotivation: Plain textual descriptions are ambiguous and inadequate for capturing complex requirements in service-oriented architectures.

Method: Proposes UML2Dep, using enhanced UML sequence diagrams with decision tables and API specs, and a data dependency inference task formalized as constrained mathematical reasoning.

Result: The framework rigorously eliminates ambiguity, enhances reasoning accuracy, and reduces cognitive load for complex specifications.

Conclusion: UML2Dep bridges the gap between ambiguous NL descriptions and precise code generation by leveraging formal specifications.

Abstract: Large language models (LLMs) excel at generating code from natural language (NL) descriptions. However, the plain textual descriptions are inherently ambiguous and often fail to capture complex requirements like intricate system behaviors, conditional logic, and architectural constraints; implicit data dependencies in service-oriented architectures are difficult to infer and handle correctly. To bridge this gap, we propose a novel step-by-step code generation framework named UML2Dep by leveraging unambiguous formal specifications of complex requirements. First, we introduce an enhanced Unified Modeling Language (UML) sequence diagram tailored for service-oriented architectures. This diagram extends traditional visual syntax by integrating decision tables and API specifications, explicitly formalizing structural relationships and business logic flows in service interactions to rigorously eliminate linguistic ambiguity. Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference (DDI) task. DDI systematically constructs an explicit data dependency graph prior to actual code synthesis. To ensure reliability, we formalize DDI as a constrained mathematical reasoning task through novel prompting strategies, aligning with LLMs’ excellent mathematical strengths. Additional static parsing and dependency pruning further reduce context complexity and cognitive load associated with intricate specifications, thereby enhancing reasoning accuracy and efficiency.

[356] Hide and Seek with LLMs: An Adversarial Game for Sneaky Error Generation and Self-Improving Diagnosis

Rui Zou, Mengqi Wei, Yutao Zhu, Jirong Wen, Xin Zhao, Jing Chen

Main category: cs.AI

TL;DR: The paper introduces Hide and Seek Game (HSG), an adversarial framework to improve LLMs’ error diagnosis by generating and detecting subtle errors, outperforming baselines like GPT-4o.

DetailsMotivation: LLMs struggle with diagnosing complex errors due to training focused on correct answers, lacking exposure to errors.

Method: HSG uses adversarial roles (Sneaky and Diagnosis) to dynamically generate and detect deceptive errors, enhancing diagnostic precision.

Result: HSG improves error diagnosis accuracy by 16.8%–31.4% over baselines on math reasoning tasks.

Conclusion: HSG effectively enhances LLMs’ error diagnosis, with a released dataset for future research.

Abstract: Large Language Models (LLMs) excel in reasoning and generation across domains, but still struggle with identifying and diagnosing complex errors. This stems mainly from training objectives that prioritize correct answers, limiting exposure to and learning from errors. While recent studies have begun to address this by introducing error signals, most rely on shallow, static errors, restricting improvement in deep diagnostic ability. To overcome this, we propose Hide and Seek Game (HSG), a dynamic adversarial framework for error generation and diagnosis, and evaluate it on mathematical problem-solving. HSG involves two adversarial roles: Sneaky, which “hides” by generating subtle, deceptive reasoning errors, and Diagnosis, which “seeks” to accurately detect them. Through adversarial co-evolution, both error stealth and diagnostic precision are enhanced. Experiments on several math reasoning tasks show that HSG significantly boosts error diagnosis, achieving 16.8%–31.4% higher accuracy than baselines like GPT-4o. We also release a challenging dataset of deceptive errors and diagnostic annotations as a benchmark for future research.

[357] Multi-Objective Infeasibility Diagnosis for Routing Problems Using Large Language Models

Kai Li, Ruihao Zheng, Xinye Hao, Zhenkun Wang

Main category: cs.AI

TL;DR: MOID combines LLM agents and multi-objective optimization to diagnose and suggest adjustments for infeasible routing models, outperforming existing LLM-based methods.

DetailsMotivation: Existing LLM-based methods fail to consider multiple potential adjustments for infeasible routing models, leading to limited practical insights.

Method: MOID integrates multi-objective optimization (balancing path cost and constraint violation) with LLM agents to generate diverse diagnostic suggestions.

Result: MOID outperforms existing methods by providing multiple actionable suggestions in a single run, enhancing feasibility restoration and decision-making.

Conclusion: MOID offers a more effective and practical approach for diagnosing and resolving infeasible routing models compared to current LLM-based solutions.

Abstract: In real-world routing problems, users often propose conflicting or unreasonable requirements, which result in infeasible optimization models due to overly restrictive or contradictory constraints, leading to an empty feasible solution set. Existing Large Language Model (LLM)-based methods attempt to diagnose infeasible models, but modifying such models often involves multiple potential adjustments that these methods do not consider. To fill this gap, we introduce Multi-Objective Infeasibility Diagnosis (MOID), which combines LLM agents and multi-objective optimization within an automatic routing solver, to provide a set of representative actionable suggestions. Specifically, MOID employs multi-objective optimization to consider both path cost and constraint violation, generating a set of trade-off solutions, each encompassing varying degrees of model adjustments. To extract practical insights from these solutions, MOID utilizes LLM agents to generate a solution analysis function for the infeasible model. This function analyzes these distinct solutions to diagnose the original infeasible model, providing users with diverse diagnostic insights and suggestions. Finally, we compare MOID with several LLM-based methods on 50 types of infeasible routing problems. The results indicate that MOID automatically generates multiple diagnostic suggestions in a single run, providing more practical insights for restoring model feasibility and decision-making compared to existing methods.

[358] Data Overdose? Time for a Quadruple Shot: Knowledge Graph Construction using Enhanced Triple Extraction

Taine J. Elliott, Stephen P. Levitt, Ken Nixon, Martin Bekker

Main category: cs.AI

TL;DR: The paper presents an LLM-based pipeline for extracting and enhancing biomedical knowledge from PubMed abstracts into a knowledge graph, improving accuracy with context variables and exploring relationship inference.

DetailsMotivation: The rapid growth of medical data overwhelms professionals, creating a gap between literature and practical applications. This work aims to bridge that gap by automating knowledge extraction.

Method: A pipeline of LLM agents decomposes PubMed abstracts into propositions, extracts KG triples, enhances them with ontological data, and adds context to form quadruples. Accuracy is validated via cosine similarity.

Result: The system achieved an average cosine similarity of 0.874 for generated sentences. Enhanced triples (quadruples) showed higher similarity than ordinary triples. LLMs also inferred new relationships.

Conclusion: This approach offers a real-time, centralized knowledge source for medical practitioners and has potential applications in other fields.

Abstract: The rapid expansion of publicly-available medical data presents a challenge for clinicians and researchers alike, increasing the gap between the volume of scientific literature and its applications. The steady growth of studies and findings overwhelms medical professionals at large, hindering their ability to systematically review and understand the latest knowledge. This paper presents an approach to information extraction and automatic knowledge graph (KG) generation to identify and connect biomedical knowledge. Through a pipeline of large language model (LLM) agents, the system decomposes 44 PubMed abstracts into semantically meaningful proposition sentences and extracts KG triples from these sentences. The triples are enhanced using a combination of open domain and ontology-based information extraction methodologies to incorporate ontological categories. On top of this, a context variable is included during extraction to allow the triple to stand on its own - thereby becoming `quadruples’. The extraction accuracy of the LLM is validated by comparing natural language sentences generated from the enhanced triples to the original propositions, achieving an average cosine similarity of 0.874. The similarity for generated sentences of enhanced triples were compared with generated sentences of ordinary triples showing an increase as a result of the context variable. Furthermore, this research explores the ability for LLMs to infer new relationships and connect clusters in the knowledge base of the knowledge graph. This approach leads the way to provide medical practitioners with a centralised, updated in real-time, and sustainable knowledge source, and may be the foundation of similar gains in a wide variety of fields.

[359] Toward a Graph-Theoretic Model of Belief: Confidence, Credibility, and Structural Coherence

Saleh Nikooroo

Main category: cs.AI

TL;DR: The paper introduces a graph-based formalism for belief systems, distinguishing credibility and confidence, to better represent internal structure and coherence without requiring belief updating or deductive closure.

DetailsMotivation: Existing representations of belief systems obscure internal structure, conflate credibility with coherence, and fail to model fragmented or contradictory states.

Method: Proposes a directed, weighted graph framework where nodes are beliefs, edges encode epistemic relationships, and two functions assign credibility and confidence.

Result: The model provides a foundational substrate for analyzing belief organization, coherence, tensions, and representational limits, offering richer classification than existing approaches.

Conclusion: This formalism enables a more nuanced representation of belief systems by separating structure from strength, addressing limitations of probabilistic, logical, or argumentation-based models.

Abstract: Belief systems are often treated as globally consistent sets of propositions or as scalar-valued probability distributions. Such representations tend to obscure the internal structure of belief, conflate external credibility with internal coherence, and preclude the modeling of fragmented or contradictory epistemic states. This paper introduces a minimal formalism for belief systems as directed, weighted graphs. In this framework, nodes represent individual beliefs, edges encode epistemic relationships (e.g., support or contradiction), and two distinct functions assign each belief a credibility (reflecting source trust) and a confidence (derived from internal structural support). Unlike classical probabilistic models, our approach does not assume prior coherence or require belief updating. Unlike logical and argumentation-based frameworks, it supports fine-grained structural representation without committing to binary justification status or deductive closure. The model is purely static and deliberately excludes inference or revision procedures. Its aim is to provide a foundational substrate for analyzing the internal organization of belief systems, including coherence conditions, epistemic tensions, and representational limits. By distinguishing belief structure from belief strength, this formalism enables a richer classification of epistemic states than existing probabilistic, logical, or argumentation-based approaches.

[360] Semantic-aware Graph-guided Behavior Sequences Generation with Large Language Models for Smart Homes

Zhiyao Xu, Dan Zhao, Qingsong Zou, Qing Li, Yong Jiang, Yuhang Wang, Jingyu Xiao

Main category: cs.AI

TL;DR: SmartGen is an LLM-based framework that synthesizes context-aware user behavior data to adapt smart home models to behavioral drift, improving anomaly detection and behavior prediction.

DetailsMotivation: Behavioral drift in smart homes (due to seasonal changes, lifestyle shifts, etc.) makes static models brittle, and collecting new data is impractical.

Method: SmartGen includes time/semantic-aware splitting, sequence compression, graph-guided synthesis, and outlier filtering to generate valid behavior data.

Result: SmartGen improves anomaly detection by 85.43% and behavior prediction by 70.51% on average.

Conclusion: SmartGen effectively addresses behavioral drift by generating high-quality synthetic data, enhancing model adaptability.

Abstract: As smart homes become increasingly prevalent, intelligent models are widely used for tasks such as anomaly detection and behavior prediction. These models are typically trained on static datasets, making them brittle to behavioral drift caused by seasonal changes, lifestyle shifts, or evolving routines. However, collecting new behavior data for retraining is often impractical due to its slow pace, high cost, and privacy concerns. In this paper, we propose SmartGen, an LLM-based framework that synthesizes context-aware user behavior data to support continual adaptation of downstream smart home models. SmartGen consists of four key components. First, we design a Time and Semantic-aware Split module to divide long behavior sequences into manageable, semantically coherent subsequences under dual time-span constraints. Second, we propose Semantic-aware Sequence Compression to reduce input length while preserving representative semantics by clustering behavior mapping in latent space. Third, we introduce Graph-guided Sequence Synthesis, which constructs a behavior relationship graph and encodes frequent transitions into prompts, guiding the LLM to generate data aligned with contextual changes while retaining core behavior patterns. Finally, we design a Two-stage Outlier Filter to identify and remove implausible or semantically inconsistent outputs, aiming to improve the factual coherence and behavioral validity of the generated sequences. Experiments on three real-world datasets demonstrate that SmartGen significantly enhances model performance on anomaly detection and behavior prediction tasks under behavioral drift, with anomaly detection improving by 85.43% and behavior prediction by 70.51% on average. The code is available at https://github.com/horizonsinzqs/SmartGen.

[361] VQA support to Arabic Language Learning Educational Tool

Khaled Bachir Delassi, Lakhdar Zeggane, Hadda Cherroun, Abdelhamid Haouhat, Kaoutar Bouzouad

Main category: cs.AI

TL;DR: An AI-powered tool for Arabic language learning uses visual quizzes and active learning to improve proficiency, showing promising results in evaluations.

DetailsMotivation: The scarcity of modern educational tools for Arabic language learning, especially those supporting active learning, drives the need for this AI-powered solution.

Method: The tool employs Vision-Language Pretraining models and Large Language Models to create interactive visual quizzes, focusing on vocabulary, grammar, and comprehension.

Result: Evaluation with 1266 quizzes and human feedback shows suitable accuracy, validating the tool’s effectiveness.

Conclusion: The tool successfully bridges gaps in Arabic education, offering personalized and interactive learning experiences.

Abstract: We address the problem of scarcity of educational Arabic Language Learning tools that advocate modern pedagogical models such as active learning which ensures language proficiency. In fact, we investigate the design and evaluation of an AI-powered educational tool designed to enhance Arabic language learning for non-native speakers with beginner-to-intermediate proficiency level. The tool leverages advanced AI models to generate interactive visual quizzes, deploying Visual Question Answering as the primary activity. Adopting a constructivist learning approach, the system encourages active learning through real-life visual quizzes, and image-based questions that focus on improving vocabulary, grammar, and comprehension. The system integrates Vision-Language Pretraining models to generate contextually relevant image description from which Large Language Model generate assignments based on customized Arabic language Learning quizzes thanks to prompting. The effectiveness of the tool is evaluated through a manual annotated benchmark consisting of 1266 real-life visual quizzes, with human participants providing feedback. The results show a suitable accuracy rates, validating the tool’s potential to bridge the gap in Arabic language education and highlighting the tool’s promise as a reliable, AI-powered resource for Arabic learners, offering personalized and interactive learning experiences.

[362] Error Detection and Correction for Interpretable Mathematics in Large Language Models

Yijin Yang, Cristina Cornelio, Mario Leiva, Paulo Shakarian

Main category: cs.AI

TL;DR: EDCIM introduces a method to detect and correct errors in LLM-generated intermediate steps for interpretable mathematics tasks, balancing cost and accuracy.

DetailsMotivation: LLMs often produce errors in multi-step reasoning, leading to inaccurate predictions and struggles with output format adherence, especially in tasks like mathematics.

Method: EDCIM generates a system of equations using LLMs, applies symbolic error detection, and provides feedback for correction, integrating lightweight and proprietary LLMs for efficiency.

Result: EDCIM reduces computational and financial costs while improving prediction accuracy when properly configured.

Conclusion: EDCIM effectively addresses LLM error propagation in interpretable mathematics tasks, offering a cost-efficient and accurate solution.

Abstract: Recent large language models (LLMs) have demonstrated the ability to perform explicit multi-step reasoning such as chain-of-thought prompting. However, their intermediate steps often contain errors that can propagate leading to inaccurate final predictions. Additionally, LLMs still struggle with hallucinations and often fail to adhere to prescribed output formats, which is particularly problematic for tasks like generating mathematical expressions or source code. This work introduces EDCIM (Error Detection and Correction for Interpretable Mathematics), a method for detecting and correcting these errors in interpretable mathematics tasks, where the model must generate the exact functional form that explicitly solve the problem (expressed in natural language) rather than a black-box solution. EDCIM uses LLMs to generate a system of equations for a given problem, followed by a symbolic error-detection framework that identifies errors and provides targeted feedback for LLM-based correction. To optimize efficiency, EDCIM integrates lightweight, open-source LLMs with more powerful proprietary models, balancing cost and accuracy. This balance is controlled by a single hyperparameter, allowing users to control the trade-off based on their cost and accuracy requirements. Experimental results across different datasets show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy when the balance is properly configured.

[363] Hidden Dynamics of Massive Activations in Transformer Training

Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos

Main category: cs.AI

TL;DR: The paper analyzes the emergence of massive activations in transformers during training, revealing predictable patterns modeled by an exponentially-modulated logarithmic function. A framework predicts these patterns from architectural specs, aiding model design.

DetailsMotivation: To understand the temporal dynamics of massive activations in transformers during training, which are critical for model functionality but poorly understood.

Method: Systematic analysis of Pythia model family across training checkpoints, modeling activation patterns with a five-parameter function, and developing a predictive framework.

Result: Massive activations follow predictable patterns; the framework predicts parameters accurately for steady-state and moderately for timing/magnitude.

Conclusion: Massive activation emergence is design-dependent and predictable, enabling control over model stability, training, and optimization.

Abstract: Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins.

[364] Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework

Jialin Li, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu

Main category: cs.AI

TL;DR: FPBench is a framework evaluating LLMs’ code generation under faulty premises, revealing poor reasoning, diminishing returns, and distinct defect patterns.

DetailsMotivation: To address LLMs' reliance on faulty premises in code generation, leading to hallucinations and lack of self-scrutiny.

Method: FPBench systematically constructs three faulty premise categories and uses multi-dimensional metrics to assess 15 LLMs.

Result: Most models perform poorly under faulty premises, show diminishing returns, and exhibit distinct defect patterns.

Conclusion: FPBench highlights the need for LLMs to verify premises and offers a pathway for reliable, human-centric code generation.

Abstract: With the advancement of code generation capabilities in large language models (LLMs), their reliance on input premises has intensified. When users provide inputs containing faulty premises, the probability of code generation hallucinations rises significantly, exposing deficiencies in their self-scrutiny capabilities. This paper proposes Faulty Premises Bench (FPBench), the first code generation evaluation framework targeting faulty premises. By systematically constructing three categories of faulty premises and integrating multi-dimensional evaluation metrics, it conducts in-depth assessments of 15 representative LLMs. The key findings are as follows: (1) Most models exhibit poor reasoning abilities and suboptimal code generation performance under faulty premises, heavily relying on explicit prompts for error detection, with limited self-scrutiny capabilities; (2) Faulty premises trigger a point of diminishing returns in resource investment, leading to blindly increasing length fails to enhance quality; (3) The three types of faulty premises respectively activate distinct defect patterns in models, revealing a triple dissociation in the cognitive mechanisms of code generation models. This study not only highlights the urgent need for LLMs to proactively verify premises in code generation but also, through the proposed FPBench framework and multi-dimensional evaluation system, provides a theoretical foundation and practical pathway for developing reliable, human-centric code generation models.

[365] Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search

He Wang, Liang Zeng

Main category: cs.AI

TL;DR: Evo-MCTS improves gravitational-wave signal detection by 20.2% over existing methods, offering interpretable and novel algorithmic solutions.

DetailsMotivation: Address limitations of matched filtering (computationally expensive) and deep neural networks (opaque decision logic) in gravitational-wave signal identification.

Method: Proposes Evolutionary Monte Carlo Tree Search (Evo-MCTS), combining tree search, evolutionary optimization, and language model heuristics.

Result: Achieves 20.2% improvement on MLGWSC-1 benchmark, with interpretable and high-performing algorithmic variants.

Conclusion: Evo-MCTS provides a transferable framework for automated algorithmic discovery in computational science.

Abstract: Computational scientific discovery increasingly relies on algorithms to process complex data and identify meaningful patterns - yet faces persistent challenges in gravitational-wave signal identification. While existing algorithmic approaches like matched filtering (MF) and deep neural networks (DNNs) have achieved partial success, their limitations directly stem from fundamental limitations: MF’s excessive computational demands arise from its reliance on predefined theoretical waveform templates, while DNNs’ black-box architectures obscure decision logic and introduce hidden biases. We propose Evolutionary Monte Carlo Tree Search (Evo-MCTS), a framework that addresses these limitations through systematic algorithm space exploration guided by domain-aware physical constraints. Our approach combines tree-structured search with evolutionary optimization and large language model heuristics to create interpretable algorithmic solutions. Our Evo-MCTS framework demonstrates substantial improvements, achieving a 20.2% improvement over state-of-the-art gravitational wave detection algorithms on the MLGWSC-1 benchmark dataset. High-performing algorithm variants consistently exceed thresholds. The framework generates human-interpretable algorithmic pathways that reveal distinct performance patterns. Beyond performance improvements, our framework discovers novel algorithmic combinations, thereby establishing a transferable methodology for automated algorithmic discovery across computational science domains.

[366] Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, Yuqing Yang

Main category: cs.AI

TL;DR: Agent Lightning is a decoupled RL framework for training LLMs in AI agents, enabling integration with diverse agent frameworks without code changes. It uses hierarchical RL and a unified data interface for complex tasks.

DetailsMotivation: Existing methods tightly couple RL training with agents or rely on masking, limiting flexibility. Agent Lightning aims to decouple training and execution for broader compatibility.

Method: Formulates agent execution as a Markov decision process, introduces a hierarchical RL algorithm (LightningRL) with credit assignment, and uses a Training-Agent Disaggregation architecture.

Result: Demonstrates stable improvements in tasks like text-to-SQL, retrieval-augmented generation, and math tool-use, proving real-world applicability.

Conclusion: Agent Lightning offers a flexible, scalable solution for RL-based training of LLMs in diverse agent frameworks, with potential for real-world deployment.

Abstract: We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework’s potential for real-world agent training and deployment.

[367] A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, Shafiq Joty

Main category: cs.AI

TL;DR: The paper surveys reasoning in large language models (LLMs), categorizing methods by regimes (inference vs. training) and architectures (standalone vs. agentic systems), while analyzing input and output techniques.

DetailsMotivation: To systematically understand and categorize the evolving capabilities of LLMs in reasoning, distinguishing them from conventional models.

Method: Categorizes methods along regimes (inference/training) and architectures (standalone/agentic), analyzing input (prompting) and output (refinement) techniques.

Result: Highlights trends like learning-to-reason and agentic workflows, covering algorithms from fine-tuning to RL, and designs like generator-evaluator.

Conclusion: Provides a structured overview of LLM reasoning, showcasing advancements and emerging trends in the field.

Abstract: Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi-agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. …

[368] Antidistillation Sampling

Yash Savani, Asher Trockman, Zhili Feng, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter

Main category: cs.AI

TL;DR: Antidistillation sampling modifies token probabilities to hinder distillation while maintaining model performance.

DetailsMotivation: Prevent model distillation from exploiting reasoning traces without degrading utility.

Method: Strategic modification of next-token probability distributions to poison reasoning traces.

Result: Reduced effectiveness of distillation while preserving model utility.

Conclusion: Antidistillation sampling offers a practical solution to protect models from distillation vulnerabilities.

Abstract: Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. Antidistillation sampling provides exactly this capability. By strategically modifying a model’s next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model’s practical utility. For further details, see https://antidistillation.com.

[369] Learning telic-controllable state representations

Nadav Amir, Stas Tiomkin

Main category: cs.AI

TL;DR: A framework for learning state representations in bounded agents, coupling descriptive and prescriptive aspects through goal-directed states, balancing flexibility and complexity.

DetailsMotivation: To address the interdependence of goals and state representations in reinforcement learning, where traditional models assume fixed state representations.

Method: Introduces telic-controllability and an algorithm for learning telic-controllable state representations, demonstrated in a simulated navigation task.

Result: The framework successfully balances goal flexibility and cognitive complexity through deliberate ignorance.

Conclusion: The approach highlights the importance of coupling descriptive and prescriptive aspects in learning state representations for bounded agents.

Abstract: Computational models of purposeful behavior comprise both descriptive and prescriptive aspects, used respectively to ascertain and evaluate situations in the world. In reinforcement learning, prescriptive reward functions are assumed to depend on predefined and fixed descriptive state representations. Alternatively, these two aspects may emerge interdependently: goals can shape the acquired state representations and vice versa. Here, we present a computational framework for state representation learning in bounded agents, where descriptive and prescriptive aspects are coupled through the notion of goal-directed, or telic, states. We introduce the concept of telic-controllability to characterize the tradeoff between the granularity of a telic state representation and the policy complexity required to reach all telic states. We propose an algorithm for learning telic-controllable state representations, illustrating it using a simulated navigation task. Our framework highlights the role of deliberate ignorance – knowing what to ignore – for learning state representations that balance goal flexibility and cognitive complexity.

[370] LLM-Generated Heuristics for AI Planning: Do We Even Need Domain-Independence Anymore?

Alexander Tuisov, Yonatan Vernik, Alexander Shleyfman

Main category: cs.AI

TL;DR: LLMs can generate domain-specific heuristics for AI planning, challenging traditional domain-independent methods. They show promise in performance and handling non-PDDL tasks.

DetailsMotivation: To explore if LLMs can replace or complement domain-independent heuristics in AI planning by generating tailored heuristics from task descriptions.

Method: Use LLMs to derive planning heuristics from task descriptions (successor generators and goal tests) and compare them with traditional domain-independent methods.

Result: LLM-generated heuristics achieve state-of-the-art performance in some domains and solve tasks without PDDL representation.

Conclusion: LLMs offer a viable alternative or complement to domain-independent heuristics, potentially signaling a paradigm shift in AI planning.

Abstract: Domain-independent heuristics have long been a cornerstone of AI planning, offering general solutions applicable across a wide range of tasks without requiring domain-specific engineering. However, the advent of large language models (LLMs) presents an opportunity to generate heuristics tailored to specific planning problems, potentially challenging the necessity of domain independence as a strict design principle. In this paper, we explore the use of LLMs to automatically derive planning heuristics from task descriptions represented as successor generators and goal tests written in general purpose programming language. We investigate the trade-offs between domain-specific LLM-generated heuristics and traditional domain-independent methods in terms of computational efficiency and explainability. Our experiments demonstrate that LLMs can create heuristics that achieve state-of-the-art performance on some standard IPC domains, as well as their ability to solve problems that lack an adequate Planning Domain Definition Language ({\sc pddl}) representation. We discuss whether these results signify a paradigm shift and how they can complement existing approaches.

[371] Game Theory Meets Large Language Models: A Systematic Survey with Taxonomy and New Frontiers

Haoran Sun, Yusen Wu, Peng Wang, Wei Chen, Yukun Cheng, Xiaotie Deng, Xu Chu

Main category: cs.AI

TL;DR: This paper surveys the bidirectional relationship between game theory and LLMs, proposing a taxonomy of four research perspectives and identifying future challenges.

DetailsMotivation: Existing surveys narrowly focus on using game theory to evaluate LLMs, leaving a gap in understanding the bidirectional influence between the two fields.

Method: The paper introduces a novel taxonomy categorizing research into four perspectives: evaluating LLMs in games, improving LLMs with game theory, modeling LLM competition, and using LLMs to advance game theory.

Result: The survey systematically explores the interdisciplinary landscape, highlighting mutual influence and fostering progress.

Conclusion: The paper outlines key challenges and future directions, emphasizing the potential for further research at the intersection of game theory and LLMs.

Abstract: Game theory is a foundational framework for analyzing strategic interactions, and its intersection with large language models (LLMs) is a rapidly growing field. However, existing surveys mainly focus narrowly on using game theory to evaluate LLM behavior. This paper provides the first comprehensive survey of the bidirectional relationship between Game Theory and LLMs. We propose a novel taxonomy that categorizes the research in this intersection into four distinct perspectives: (1) evaluating LLMs in game-based scenarios; (2) improving LLMs using game-theoretic concepts for better interpretability and alignment; (3) modeling the competitive landscape of LLM development and its societal impact; and (4) leveraging LLMs to advance game models and to solve corresponding game theory problems. Furthermore, we identify key challenges and outline future research directions. By systematically investigating this interdisciplinary landscape, our survey highlights the mutual influence of game theory and LLMs, fostering progress at the intersection of these fields.

[372] REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks

Longling Geng, Edward Y. Chang

Main category: cs.AI

TL;DR: A benchmark suite for evaluating LLMs and multi-agent systems in real-world planning and scheduling, featuring 14 scalable problems, 15 comparison methods, and multiple evaluation metrics and frameworks.

DetailsMotivation: To provide a standardized and comprehensive framework for assessing AI planning systems, addressing multi-agent coordination, dynamic disruptions, and scalability.

Method: Includes 14 planning problems with varying complexity, 15 comparison methods, and baseline implementations using 3+ LLMs and 4 frameworks for testing.

Result: Enables rigorous evaluation of single-agent and multi-agent planning capabilities through scalable problem dimensions and standardized metrics.

Conclusion: Aims to drive progress in developing adaptable, robust, and scalable AI planning systems for real-world applications by making the benchmark publicly available.

Abstract: This benchmark suite provides a comprehensive evaluation framework for assessing both individual LLMs and multi-agent systems in Real-world planning and scheduling scenarios. The suite encompasses 14 designed planning and scheduling problems that progress from basic to highly complex, incorporating key aspects such as multi-agent coordination, inter-agent dependencies, and dynamic environmental disruptions. Each problem can be scaled along three dimensions: the number of parallel planning threads, the complexity of inter-dependencies, and the frequency of unexpected disruptions requiring Real-time adaptation. The benchmark includes 14 detailed problem specifications, 15 comparison methods including Random, LPT, SPT, STPT, MPSR, DRL-Liu, GP, GEP, LSO, SPT/TWKR, DRL-Chen, DRL-Zhang, 2+ evaluation metrics, and baseline implementations using 3+ LLMs including GPT-4o, Claude-3.7, DeepSeek-R1, and 4 contemporary frameworks including LangGraph, AutoGen, CrewAI, and Swarm, enabling rigorous testing of both single-agent and multi-agent planning capabilities. Through standardized evaluation criteria and scalable complexity, this benchmark aims to be opened to public, and drive progress in developing more adaptable, robust, and scalable AI planning systems for Real-world applications.

[373] A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models

Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S. Yu, Qing Li

Main category: cs.AI

TL;DR: The paper surveys WebAgents, AI-driven autonomous agents for automating repetitive web tasks, leveraging Large Foundation Models (LFMs) for enhanced performance and convenience.

DetailsMotivation: Repetitive web tasks negatively impact quality of life; AI Agents (WebAgents) can automate these tasks, improving productivity.

Method: Review of existing research on WebAgents, focusing on architectures, training, and trustworthiness.

Result: LFMs show promise in developing powerful WebAgents for automating web tasks, enhancing convenience.

Conclusion: The survey highlights the potential of WebAgents and suggests future research directions for deeper insights.

Abstract: With the advancement of web techniques, they have significantly revolutionized various aspects of people’s lives. Despite the importance of the web, many tasks performed on it are repetitive and time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious daily tasks, one of the most promising approaches is to advance autonomous agents based on Artificial Intelligence (AI) techniques, referred to as AI Agents, as they can operate continuously without fatigue or performance degradation. In the context of the web, leveraging AI Agents – termed WebAgents – to automatically assist people in handling tedious daily tasks can dramatically enhance productivity and efficiency. Recently, Large Foundation Models (LFMs) containing billions of parameters have exhibited human-like language understanding and reasoning capabilities, showing proficiency in performing various complex tasks. This naturally raises the question: `Can LFMs be utilized to develop powerful AI Agents that automatically handle web tasks, providing significant convenience to users?’ To fully explore the potential of LFMs, extensive research has emerged on WebAgents designed to complete daily web tasks according to user instructions, significantly enhancing the convenience of daily human life. In this survey, we comprehensively review existing research studies on WebAgents across three key aspects: architectures, training, and trustworthiness. Additionally, several promising directions for future research are explored to provide deeper insights.

[374] HypRL: Reinforcement Learning of Control Policies for Hyperproperties

Tzu-Han Hsu, Arshia Rafieioskouei, Borzoo Bonakdarpour

Main category: cs.AI

TL;DR: HYPRL is a specification-guided MARL framework using HyperLTL to optimize policies for complex tasks, outperforming baselines.

DetailsMotivation: Addressing the challenge of reward shaping in MARL for complex tasks where existing methods fall short.

Method: Uses HyperLTL for specification, Skolemization for quantifier alternations, and robustness functions for reward shaping in an RL framework.

Result: Demonstrated effectiveness on benchmarks like safety-aware planning and Deep Sea Treasure, outperforming baselines.

Conclusion: HYPRL efficiently learns policies satisfying HyperLTL specifications, proving superior to existing approaches.

Abstract: Reward shaping in multi-agent reinforcement learning (MARL) for complex tasks remains a significant challenge. Existing approaches often fail to find optimal solutions or cannot efficiently handle such tasks. We propose HYPRL, a specification-guided reinforcement learning framework that learns control policies w.r.t. hyperproperties expressed in HyperLTL. Hyperproperties constitute a powerful formalism for specifying objectives and constraints over sets of execution traces across agents. To learn policies that maximize the satisfaction of a HyperLTL formula $\phi$, we apply Skolemization to manage quantifier alternations and define quantitative robustness functions to shape rewards over execution traces of a Markov decision process with unknown transitions. A suitable RL algorithm is then used to learn policies that collectively maximize the expected reward and, consequently, increase the probability of satisfying $\phi$. We evaluate HYPRL on a diverse set of benchmarks, including safety-aware planning, Deep Sea Treasure, and the Post Correspondence Problem. We also compare with specification-driven baselines to demonstrate the effectiveness and efficiency of HYPRL.

[375] Genetic Programming with Reinforcement Learning Trained Transformer for Real-World Dynamic Scheduling Problems

Xinan Chen, Rong Qu, Jing Dong, Ruibin Bai, Yaochu Jin

Main category: cs.AI

TL;DR: The paper introduces GPRT, a hybrid method combining Genetic Programming (GP) and a Transformer trained via Reinforcement Learning (RL), to improve dynamic scheduling. It outperforms traditional methods in container terminal truck scheduling and is adaptable to other dynamic scenarios.

DetailsMotivation: Traditional static scheduling and human-designed heuristics fail to adapt to unforeseen disruptions in dynamic environments, necessitating a more robust solution.

Method: GPRT integrates GP and a Transformer trained with RL. The Transformer refines GP-generated heuristics and guides GP evolution, enhancing adaptability.

Result: GPRT outperforms traditional GP, standalone Transformers, and other state-of-the-art methods in container terminal truck scheduling.

Conclusion: GPRT is a versatile, interpretable, and practical framework for dynamic scheduling, applicable beyond container ports to diverse real-world challenges.

Abstract: Dynamic scheduling in real-world environments often struggles to adapt to unforeseen disruptions, making traditional static scheduling methods and human-designed heuristics inadequate. This paper introduces an innovative approach that combines Genetic Programming (GP) with a Transformer trained through Reinforcement Learning (GPRT), specifically designed to tackle the complexities of dynamic scheduling scenarios. GPRT leverages the Transformer to refine heuristics generated by GP while also seeding and guiding the evolution of GP. This dual functionality enhances the adaptability and effectiveness of the scheduling heuristics, enabling them to better respond to the dynamic nature of real-world tasks. The efficacy of this integrated approach is demonstrated through a practical application in container terminal truck scheduling, where the GPRT method outperforms traditional GP, standalone Transformer methods, and other state-of-the-art competitors. The key contribution of this research is the development of the GPRT method, which showcases a novel combination of GP and Reinforcement Learning (RL) to produce robust and efficient scheduling solutions. Importantly, GPRT is not limited to container port truck scheduling; it offers a versatile framework applicable to various dynamic scheduling challenges. Its practicality, coupled with its interpretability and ease of modification, makes it a valuable tool for diverse real-world scenarios.

[376] RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, Ge Li

Main category: cs.AI

TL;DR: RL-PLUS is a hybrid-policy optimization method for LLMs that combines internal exploitation with external data to overcome the limitations of RLVR, achieving superior reasoning performance and addressing capability boundary collapse.

DetailsMotivation: RLVR struggles with the inherent limitations of base LLMs, such as sparse rewards and large action spaces, leading to capability boundary collapse. RL-PLUS aims to surpass these boundaries by leveraging external data and novel optimization techniques.

Method: RL-PLUS integrates Multiple Importance Sampling to handle distributional mismatch from external data and an Exploration-Based Advantage Function to guide reasoning paths.

Result: RL-PLUS outperforms RLVR, achieving state-of-the-art results on math reasoning benchmarks and out-of-distribution tasks, with up to 69.2% relative improvement. It also resolves capability boundary collapse.

Conclusion: RL-PLUS effectively enhances LLM reasoning capabilities, surpassing base model boundaries and addressing the limitations of RLVR.

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM’s immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address for distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.

[377] From Promising Capability to Pervasive Bias: Assessing Large Language Models for Emergency Department Triage

Joseph Lee, Tianqi Shang, Jae Young Baik, Duy Duong-Tran, Shu Yang, Lingyao Li, Li Shen

Main category: cs.AI

TL;DR: LLMs show promise in emergency department triage, demonstrating robustness to distribution shifts and missing data, but exhibit biases based on sex and race intersections.

DetailsMotivation: To explore the potential of LLMs in clinical triage, focusing on robustness and biases in demographic intersections.

Method: Systematically evaluated LLMs across robustness and bias dimensions, comparing pre-training, in-context learning, and traditional ML approaches.

Result: LLMs outperform in robustness but reveal sex and race-based biases, especially in specific demographic intersections.

Conclusion: LLMs are promising for triage but require bias mitigation for equitable clinical application.

Abstract: Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored. We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions: (1) robustness to distribution shifts and missing data, and (2) counterfactual analysis of intersectional biases across sex and race. We assess multiple LLM-based approaches, ranging from continued pre-training to in-context learning, as well as machine learning approaches. Our results indicate that LLMs exhibit superior robustness, and we investigate the key factors contributing to the promising LLM-based approaches. Furthermore, in this setting, we identify gaps in LLM preferences that emerge in particular intersections of sex and race. LLMs generally exhibit sex-based differences, but they are most pronounced in certain racial groups. These findings suggest that LLMs encode demographic preferences that may emerge in specific clinical contexts or particular combinations of characteristics.

[378] Mind the Gap: The Divergence Between Human and LLM-Generated Tasks

Yi-Long Lu, Jiajun Song, Chunhui Zhang, Wei Wang

Main category: cs.AI

TL;DR: The study compares human and LLM (GPT-4o) task generation, finding humans are driven by psychological factors, while LLMs produce less social, less physical, and abstract tasks, despite being perceived as more fun and novel.

DetailsMotivation: To explore whether LLMs simulate human-like task generation driven by internal motivations and psychological factors.

Method: Conducted a task-generation experiment comparing human responses with those of GPT-4o, including explicit psychological drivers for the LLM.

Result: Humans’ tasks reflect psychological drivers, while LLMs’ tasks are less social, less physical, and more abstract, though seen as more fun and novel.

Conclusion: A gap exists between human cognition and LLMs, emphasizing the need for intrinsic motivation and physical grounding in human-aligned agents.

Abstract: Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task-generation experiment comparing human responses with those of an LLM agent (GPT-4o). We find that human task generation is consistently influenced by psychological drivers, including personal values (e.g., Openness to Change) and cognitive style. Even when these psychological drivers are explicitly provided to the LLM, it fails to reflect the corresponding behavioral patterns. They produce tasks that are markedly less social, less physical, and thematically biased toward abstraction. Interestingly, while the LLM’s tasks were perceived as more fun and novel, this highlights a disconnect between its linguistic proficiency and its capacity to generate human-like, embodied goals. We conclude that there is a core gap between the value-driven, embodied nature of human cognition and the statistical patterns of LLMs, highlighting the necessity of incorporating intrinsic motivation and physical grounding into the design of more human-aligned agents.

[379] UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, Yixin Cao

Main category: cs.AI

TL;DR: UFEval is a unified fine-grained evaluator for multimodal tasks, leveraging a hierarchical aspect taxonomy and a large-scale dataset (FRABench) to generalize across unseen aspects and tasks.

DetailsMotivation: The need for a comprehensive evaluator for open-ended outputs of Large Multimodal Models due to the limitations of narrow, task-specific evaluators.

Method: Constructed a hierarchical aspect taxonomy (112 aspects), created FRABench (60.4k pairwise samples with 325k labels), and developed UFEval for unified evaluation.

Result: UFEval generalizes to unseen aspects and benefits from joint learning across tasks and aspects.

Conclusion: UFEval demonstrates the potential of unified evaluation frameworks for multimodal tasks, enabling generalization and synergistic learning.

Abstract: Evaluating the open-ended outputs of Large Multimodal Models has become a bottleneck as model capabilities, task diversity, and modality rapidly expand. Existing “LLM-as-a-Judge” evaluators are typically narrow in specific tasks and aspects. In this paper, we argue that, on one hand, based on the interconnected nature of aspects, learning specific aspects can generalize to unseen aspects; on the other hand, jointly learning to assess multiple visual aspects and tasks may foster a synergistic effect. To this end, we propose UFEval, the first unified fine-grained evaluator with task and aspect generalization for four evaluation tasks – Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. Specifically, (1) We first construct a hierarchical aspect taxonomy encompassing 112 distinct aspects across the aforementioned four tasks. (2) Then, building upon this taxonomy, we create FRABench, a fine-grained evaluation dataset comprising 60.4k pairwise samples with 325k evaluation labels obtained from a combination of human and GPT-4o annotations. FRABench provides a large-scale, multi-modal, and aspect-level resource for training and testing evaluators. (3) Finally, leveraging FRABench, we develop UFEval, a unified fine-grained evaluator. Experiments show that learning on specific aspects enables UFEval to generalize to unseen aspects, and joint learning to assess diverse tasks and aspects can lead to substantial mutual benefits.

[380] Enhancing AI System Resiliency: Formulation and Guarantee for LSTM Resilience Based on Control Theory

Sota Yoshihara, Ryosuke Yamamoto, Hiroyuki Kusumoto, Masanari Shimura

Main category: cs.AI

TL;DR: A new framework for evaluating LSTM resilience in control systems, introducing ‘recovery time’ as a metric and deriving a data-independent upper bound for it.

DetailsMotivation: To ensure LSTM networks in control systems can recover from anomalous inputs, providing rigorous quality assurance for safety-critical AI.

Method: Mathematically refines incremental input-to-state stability (δISS) theory for LSTM to derive a resilience metric and upper bound.

Result: Experimental validation shows effectiveness in resilience estimation and control, enabling resilience-aware training.

Conclusion: The framework enhances quality assurance for safety-critical AI by quantifying and improving LSTM resilience.

Abstract: This paper proposes a novel theoretical framework for guaranteeing and evaluating the resilience of long short-term memory (LSTM) networks in control systems. We introduce “recovery time” as a new metric of resilience in order to quantify the time required for an LSTM to return to its normal state after anomalous inputs. By mathematically refining incremental input-to-state stability ($\delta$ISS) theory for LSTM, we derive a practical data-independent upper bound on recovery time. This upper bound gives us resilience-aware training. Experimental validation on simple models demonstrates the effectiveness of our resilience estimation and control methods, enhancing a foundation for rigorous quality assurance in safety-critical AI applications.

[381] CADDesigner: Conceptual Design of CAD Models Based on General-Purpose Agent

Jingzhe Ni, Xiaolong Yin, Xingyu Lu, Xintong Li, Ji Wei, Ruofeng Tong, Min Tang, Peng Du

Main category: cs.AI

TL;DR: An LLM-powered CAD design agent simplifies conceptual design by accepting text and sketches, refining requirements via dialogue, and generating high-quality CAD code using CIP, with iterative visual feedback and a knowledge base for continuous improvement.

DetailsMotivation: Lowering the expertise barrier and improving efficiency in CAD design by leveraging LLMs for intuitive, interactive design assistance.

Method: Uses a Context-Independent Imperative Paradigm (CIP) for CAD code generation, incorporates iterative visual feedback, and stores cases in a knowledge base for learning.

Result: Achieves state-of-the-art performance in CAD code generation.

Conclusion: The agent effectively bridges the gap between novice designers and CAD expertise, enhancing design efficiency and accessibility.

Abstract: Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing but typically requires a high level of expertise from designers. To lower the entry barrier and improve design efficiency, we present an agent for CAD conceptual design powered by large language models (LLMs). The agent accepts both abstract textual descriptions and freehand sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Context-Independent Imperative Paradigm (CIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases are stored in a structured knowledge base, enabling continuous improvement of the agent’s code generation capabilities. Experimental results demonstrate that our method achieves state-of-the-art performance in CAD code generation.

[382] Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments

Mario Leiva, Noel Ngu, Joshua Shay Kricheli, Aditya Taparia, Ransalu Senanayake, Paulo Shakarian, Nathaniel Bastian, John Corcoran, Gerardo Simari

Main category: cs.AI

TL;DR: The paper proposes a consistency-based abduction framework to integrate predictions from multiple pre-trained models, improving performance in novel environments by balancing precision and recall.

DetailsMotivation: Performance degradation in novel environments due to distributional shifts and the trade-off between precision and recall in error detection methods.

Method: Formulates the problem as consistency-based abduction, using logic programs to encode predictions and error rules. Proposes Integer Programming (IP) and Heuristic Search (HS) algorithms to maximize coverage while minimizing inconsistencies.

Result: Outperforms individual models and standard ensembles, achieving ~13.6% F1-score and ~16.6% accuracy improvements in simulated aerial imagery tests.

Conclusion: Consistency-based abduction effectively integrates knowledge from multiple imperfect models, enhancing robustness in novel scenarios.

Abstract: The deployment of pre-trained perception models in novel environments often leads to performance degradation due to distributional shifts. Although recent artificial intelligence approaches for metacognition use logical rules to characterize and filter model errors, improving precision often comes at the cost of reduced recall. This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction. We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem, building on the idea of abductive learning (ABL) but applying it to test-time instead of training. The input predictions and the learned error detection rules derived from each model are encoded in a logic program. We then seek an abductive explanation–a subset of model predictions–that maximizes prediction coverage while ensuring the rate of logical inconsistencies (derived from domain constraints) remains below a specified threshold. We propose two algorithms for this knowledge representation task: an exact method based on Integer Programming (IP) and an efficient Heuristic Search (HS). Through extensive experiments on a simulated aerial imagery dataset featuring controlled, complex distributional shifts, we demonstrate that our abduction-based framework outperforms individual models and standard ensemble baselines, achieving, for instance, average relative improvements of approximately 13.6% in F1-score and 16.6% in accuracy across 15 diverse test datasets when compared to the best individual model. Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect models in challenging, novel scenarios.

[383] Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu

Main category: cs.AI

TL;DR: CoT prompting improves LLM performance but may be superficial. This paper investigates if CoT reasoning is a learned bias from training data, finding it fails beyond training distributions.

DetailsMotivation: To determine if CoT reasoning is a genuine inferential process or a learned bias from training data.

Method: Study CoT reasoning via task, length, and format dimensions using DataAlchemy, a controlled environment to train and probe LLMs.

Result: CoT reasoning is brittle and fails when pushed beyond training distributions.

Conclusion: CoT reasoning is not generalizable, highlighting challenges in achieving genuine reasoning in LLMs.

Abstract: Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

[384] The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning

Edward Y. Chang, Zeyneb N. Kaya, Ethan Chang

Main category: cs.AI

TL;DR: UCCT proposes that reasoning in LLMs arises from external anchoring mechanisms activating task-relevant patterns, formalized as Bayesian competition between pre-trained priors and context-driven targets.

DetailsMotivation: To unify existing adaptation techniques and explain LLM intelligence as a result of semantic anchoring rather than inherent model properties.

Method: Formalizes the process as Bayesian competition, grounded in three principles (threshold crossing, modality universality, density-distance predictive power), validated via cross-domain demonstrations and depth-oriented experiments.

Result: Experiments confirm UCCT’s predictions of threshold behavior, asymmetric interference, and memory hysteresis, supporting its quantitative account.

Conclusion: UCCT provides a foundation for interpretable diagnostics and practical guidance in prompt engineering, model selection, and alignment-centric design.

Abstract: Unified Cognitive Consciousness Theory} (UCCT) casts them instead as vast unconscious pattern repositories: apparent reasoning arises only when external anchoring mechanisms, few shot prompts, retrieval-augmented context, fine-tuning, or multi-agent debate, activate task-relevant patterns. UCCT formalizes this process as Bayesian competition between statistical priors learned in pre-training and context-driven target patterns, yielding a single quantitative account that unifies existing adaptation techniques. We ground the theory in three principles: threshold crossing, modality universality, and density-distance predictive power, and validate them with (i) cross-domain demonstrations (text QA, image captioning, multi-agent debate) and (ii) two depth-oriented experiments: a controlled numeral-base study (bases 8, 9, 10) that isolates pattern-density effects, and a layer-wise trajectory analysis that reveals phase transitions inside a 7B-parameter model. Both experiments confirm UCCT’s predictions of threshold behavior, asymmetric interference, and memory hysteresis. By showing that LLM ``intelligence’’ is created through semantic anchoring rather than contained within the model, UCCT offers a principled foundation for interpretable diagnostics and practical guidance for prompt engineering, model selection, and alignment-centric system design.

[385] Automatic Prompt Optimization for Knowledge Graph Construction: Insights from an Empirical Study

Nandana Mihindukulasooriya, Niharika S. D’Souza, Faisal Chowdhury, Horst Samulowitz

Main category: cs.AI

TL;DR: The paper explores automatic prompt optimization for triple extraction in KG construction, showing it can generate human-like prompts and improve results, especially with complex schemas and large texts.

DetailsMotivation: Handcrafting task-specific prompts for LLMs in KG construction is labor-intensive and brittle. Automatic prompt optimization can address this challenge.

Method: The study evaluates automatic prompt optimization by varying prompting strategies, LLMs, schema complexity, text diversity, metrics, and datasets. It tests three optimizers (DSPy, APE, TextGrad) on SynthIE and REBEL datasets.

Result: Automatic prompt optimization generates human-like prompts and improves triple extraction results, particularly with complex schemas and larger texts.

Conclusion: Automatic prompt optimization is effective for triple extraction in KG construction, offering a scalable alternative to manual prompt engineering.

Abstract: A KG represents a network of entities and illustrates relationships between them. KGs are used for various applications, including semantic search and discovery, reasoning, decision-making, natural language processing, machine learning, and recommendation systems. Triple (subject-relation-object) extraction from text is the fundamental building block of KG construction and has been widely studied, for example, in early benchmarks such as ACE 2002 to more recent ones, such as WebNLG 2020, REBEL and SynthIE. While the use of LLMs is explored for KG construction, handcrafting reasonable task-specific prompts for LLMs is a labour-intensive exercise and can be brittle due to subtle changes in the LLM models employed. Recent work in NLP tasks (e.g. autonomy generation) uses automatic prompt optimization/engineering to address this challenge by generating optimal or near-optimal task-specific prompts given input-output examples. This empirical study explores the application of automatic prompt optimization for the triple extraction task using experimental benchmarking. We evaluate different settings by changing (a) the prompting strategy, (b) the LLM being used for prompt optimization and task execution, (c) the number of canonical relations in the schema (schema complexity), (d) the length and diversity of input text, (e) the metric used to drive the prompt optimization, and (f) the dataset being used for training and testing. We evaluate three different automatic prompt optimizers, namely, DSPy, APE, and TextGrad and use two different triple extraction datasets, SynthIE and REBEL. Through rigorous empirical evaluation, our main contribution highlights that automatic prompt optimization techniques can generate reasonable prompts similar to humans for triple extraction. In turn, these optimized prompts achieve improved results, particularly with increasing schema complexity and text size.

[386] Modeling Deontic Modal Logic in the s(CASP) Goal-directed Predicate Answer Set Programming System

Gopal Gupta, Abhiramon Rajasekharan, Alexis R. Tudor, Elmer Salazar, Joaquín Arias

Main category: cs.AI

TL;DR: The paper proposes using answer set programming (ASP) to elegantly implement deontic modal logic, resolving its paradoxes.

DetailsMotivation: To address the challenge of implementing deontic modal logic and resolving its inherent paradoxes.

Method: Uses default negation and strong negation in ASP to represent deontic modal operators, employing global constraints for obligations and impermissibilities.

Result: The proposed ASP representation elegantly resolves various paradoxes of deontic modal logic.

Conclusion: ASP provides an effective framework for implementing and resolving issues in deontic modal logic.

Abstract: We consider the problem of implementing deontic modal logic. We show how (deontic) modal operators can be expressed elegantly using default negation (negation-as-failure) and strong negation present in answer set programming (ASP). We propose using global constraints of ASP to represent obligations and impermissibilities of deontic modal logic. We show that our proposed representation results in the various paradoxes of deontic modal logic being elegantly resolved.

[387] The AlphaPhysics Term Rewriting System for Marking Algebraic Expressions in Physics Exams

Peter Baumgartner, Lachlan McGinness

Main category: cs.AI

TL;DR: A method for automatically marking Physics exams using a combination of a computer algebra system, SMT solver, term rewriting, and a Large Language Model to assess student answers.

DetailsMotivation: To automate the challenging task of assessing typed student answers for correctness in Physics exams.

Method: Combines a computer algebra system, SMT solver, term rewriting, and a Large Language Model to formalize and assess student responses.

Result: Evaluated on over 1500 real-world student exam responses from the 2023 Australian Physics Olympiad.

Conclusion: The method effectively automates exam marking, though developing the term rewrite system was non-trivial.

Abstract: We present our method for automatically marking Physics exams. The marking problem consists in assessing typed student answers for correctness with respect to a ground truth solution. This is a challenging problem that we seek to tackle using a combination of a computer algebra system, an SMT solver and a term rewriting system. A Large Language Model is used to interpret and remove errors from student responses and rewrite these in a machine readable format. Once formalized and language-aligned, the next step then consists in applying automated reasoning techniques for assessing student solution correctness. We consider two methods of automated theorem proving: off-the-shelf SMT solving and term rewriting systems tailored for physics problems involving trigonometric expressions. The development of the term rewrite system and establishing termination and confluence properties was not trivial, and we describe it in some detail in the paper. We evaluate our system on a rich pool of over 1500 real-world student exam responses from the 2023 Australian Physics Olympiad.

[388] Tiny-BioMoE: a Lightweight Embedding Model for Biosignal Analysis

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: The paper introduces Tiny-BioMoE, a lightweight pretrained embedding model for biosignal analysis, aimed at improving automatic pain assessment through multimodal physiological signals.

DetailsMotivation: Accurate pain assessment is crucial for patient care and management. Current systems need continuous monitoring and objective insights, which biosignals can provide.

Method: The study proposes Tiny-BioMoE, a model trained on 4.4 million biosignal image representations with 7.3 million parameters, for extracting high-quality embeddings from diverse physiological signals.

Result: Experiments show the model’s effectiveness in pain recognition tasks using electrodermal activity, blood volume pulse, respiratory signals, and oxygen saturation.

Conclusion: Tiny-BioMoE offers a lightweight, efficient solution for biosignal analysis in pain assessment, with publicly available code and weights.

Abstract: Pain is a complex and pervasive condition that affects a significant portion of the population. Accurate and consistent assessment is essential for individuals suffering from pain, as well as for developing effective management strategies in a healthcare system. Automatic pain assessment systems enable continuous monitoring, support clinical decision-making, and help minimize patient distress while mitigating the risk of functional deterioration. Leveraging physiological signals offers objective and precise insights into a person’s state, and their integration in a multimodal framework can further enhance system performance. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed approach introduces \textit{Tiny-BioMoE}, a lightweight pretrained embedding model for biosignal analysis. Trained on $4.4$ million biosignal image representations and consisting of only $7.3$ million parameters, it serves as an effective tool for extracting high-quality embeddings for downstream tasks. Extensive experiments involving electrodermal activity, blood volume pulse, respiratory signals, peripheral oxygen saturation, and their combinations highlight the model’s effectiveness across diverse modalities in automatic pain recognition tasks. \textit{\textcolor{blue}{The model’s architecture (code) and weights are available at https://github.com/GkikasStefanos/Tiny-BioMoE.

[389] Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power

Jobst Heitzig, Ram Potham

Main category: cs.AI

TL;DR: The paper proposes an objective function for AI agents to balance human empowerment and safety, considering diverse human goals and bounded rationality.

DetailsMotivation: To address AI safety and human wellbeing by ensuring AI agents empower humans and manage power balance effectively.

Method: A principled, partially axiomatic approach to design a parametrizable objective function, with algorithms for computation via backward induction or multi-agent reinforcement learning.

Result: Demonstrates the implications of maximizing human power metrics in various scenarios, suggesting safer outcomes than direct utility-based objectives.

Conclusion: Softly maximizing human power metrics may offer a safer and beneficial objective for AI systems compared to traditional utility-based approaches.

Abstract: Power is a key concept in AI safety: power-seeking as an instrumental goal, sudden or gradual disempowerment of humans, power balance in human-AI interaction and international AI governance. At the same time, power as the ability to pursue diverse goals is essential for wellbeing. This paper explores the idea of promoting both safety and wellbeing by forcing AI agents explicitly to empower humans and to manage the power balance between humans and AI agents in a desirable way. Using a principled, partially axiomatic approach, we design a parametrizable and decomposable objective function that represents an inequality- and risk-averse long-term aggregate of human power. It takes into account humans’ bounded rationality and social norms, and, crucially, considers a wide variety of possible human goals. We derive algorithms for computing that metric by backward induction or approximating it via a form of multi-agent reinforcement learning from a given world model. We exemplify the consequences of (softly) maximizing this metric in a variety of paradigmatic situations and describe what instrumental sub-goals it will likely imply. Our cautious assessment is that softly maximizing suitable aggregate metrics of human power might constitute a beneficial objective for agentic AI systems that is safer than direct utility-based objectives.

[390] KCR: Resolving Long-Context Knowledge Conflicts via Reasoning in LLMs

Xianda Zheng, Zijian Huang, Meng-Fen Chiang, Michael J. Witbrock, Kaiqi Zhao

Main category: cs.AI

TL;DR: The paper introduces the Knowledge Conflict Reasoning (KCR) framework to help LLMs resolve inter-context knowledge conflicts by rewarding logical consistency in reasoning paths.

DetailsMotivation: Addressing the confusion LLMs face with lengthy and conflicting contexts, which is increasingly common due to the rise of diverse knowledge sources.

Method: KCR trains LLMs to select logically consistent contexts using Reinforcement Learning on extracted reasoning paths (text or knowledge graphs).

Result: The framework significantly improves LLMs’ ability to resolve knowledge conflicts in long-context scenarios, showing notable performance gains.

Conclusion: KCR effectively enhances LLMs’ reasoning capabilities for handling conflicting knowledge, proving its practical utility.

Abstract: Knowledge conflicts commonly arise across diverse sources, and their prevalence has increased with the advent of LLMs. When dealing with conflicts between multiple contexts, also known as \emph{inter-context knowledge conflicts}, LLMs are often confused by lengthy and conflicting contexts. To address this challenge, we propose the Knowledge Conflict Reasoning (KCR) framework, which enhances the ability of LLMs to resolve conflicting knowledge. The key idea of KCR is to train backbone LLMs to establish a correct reasoning process by rewarding them for selecting and adhering to the context with stronger logical consistency when presented with conflicting contexts. Specifically, we first extract reasoning paths, represented by either text or local knowledge graphs, from the conflicting long contexts. Subsequently, we employ Reinforcement Learning to encourage the model to learn the paradigm of reasoning process that follows correct reasoning paths rather than the incorrect counterparts. This enables the backbone models to genuinely acquire the capability to resolve inter-context knowledge conflicts within long contexts. Experimental results demonstrate that our framework significantly improves the ability of various backbone models to resolve knowledge conflicts in long-context scenarios, yielding substantial performance gains.

cs.SD

[391] Adaptive Knowledge Distillation for Device-Directed Speech Detection

Hyung Gun Chi, Florian Pesce, Wonil Chang, Oggi Rudovic, Arturo Argueta, Stefan Braun, Vineet Garg, Ahmed Hussen Abdelaziz

Main category: cs.SD

TL;DR: The paper proposes an adaptive knowledge distillation method to improve device-directed speech detection (DDSD) accuracy by leveraging a pre-trained acoustic encoder, achieving significant performance gains.

DetailsMotivation: Enhancing DDSD accuracy is crucial for naturalistic user experience with voice assistants, distinguishing user queries from background speech.

Method: The authors introduce adaptive knowledge distillation (KD) using a pre-trained acoustic encoder (teacher) and task-specific adapters, jointly trained with the student model on DDSD.

Result: The method improves Equal Error Rate by +26% for keyword and +19% for keyword-free invocations, and generalizes across transformer and conformer architectures.

Conclusion: Adaptive KD effectively boosts DDSD performance, demonstrating its utility for voice assistant applications.

Abstract: Device-directed speech detection (DDSD) is a binary classification task that separates the user’s queries to a voice assistant (VA) from background speech or side conversations. This is important for achieving naturalistic user experience. To this end, we propose knowledge distillation (KD) to enhance DDSD accuracy while ensuring efficient deployment. Specifically, we introduce a novel adaptive KD method that transfers knowledge from general representations of an ASR large pre-trained acoustic encoder (teacher). We apply task-specific adapters, on top of the (frozen) teacher encoder, trained jointly with the student model on DDSD. We demonstrate that the proposed adaptive KD outperforms the student model without distillation in the keyword and keyword-free (follow-up) invocations, with an improvement of +26% and +19% in terms of Equal Error Rate, respectively. We also show that this approach generalizes across the transformer and conformer-based model architectures.

[392] Neural Speech Extraction with Human Feedback

Malek Itani, Ashton Graves, Sefik Emre Eskimez, Shyamnath Gollakota

Main category: cs.SD

TL;DR: A neural target speech extraction (TSE) system uses human feedback for iterative refinement, improving marked segments while preserving unmarked ones. Synthetic datasets train models, with noise power-based masking performing best. Users prefer refined outputs.

DetailsMotivation: To enhance neural TSE systems by incorporating human feedback for iterative refinement, addressing the challenge of collecting large-scale human-marked error datasets.

Method: Generates synthetic datasets using automated masking functions, trains models on these, and refines outputs based on user-marked segments. Noise power-based masking and probabilistic thresholding are key techniques.

Result: Models trained with noise power-based masking align best with human annotations. User studies show preference for refined outputs over baseline TSE.

Conclusion: Human-in-the-loop refinement improves neural speech extraction, with synthetic datasets and noise power-based masking being effective.

Abstract: We present the first neural target speech extraction (TSE) system that uses human feedback for iterative refinement. Our approach allows users to mark specific segments of the TSE output, generating an edit mask. The refinement system then improves the marked sections while preserving unmarked regions. Since large-scale datasets of human-marked errors are difficult to collect, we generate synthetic datasets using various automated masking functions and train models on each. Evaluations show that models trained with noise power-based masking (in dBFS) and probabilistic thresholding perform best, aligning with human annotations. In a study with 22 participants, users showed a preference for refined outputs over baseline TSE. Our findings demonstrate that human-in-the-loop refinement is a promising approach for improving the performance of neural speech extraction.

[393] TF-MLPNet: Tiny Real-Time Neural Speech Separation

Malek Itani, Tuochao Chen, Shyamnath Gollakota

Main category: cs.SD

TL;DR: TF-MLPNet is a real-time speech separation network for low-power hearable devices, outperforming existing models with efficient mixed-precision quantization.

DetailsMotivation: Enable real-time speech separation on tiny, low-power hearable devices, overcoming compute limitations of existing models.

Method: Time-frequency domain processing with alternating fully connected layers and convolutional layers, optimized via mixed-precision quantization-aware training.

Result: Processes 6 ms audio chunks in real-time on GAP9, achieving 3.5-4x runtime reduction over prior models.

Conclusion: TF-MLPNet is a breakthrough for real-time speech separation on low-power hearable devices.

Abstract: Speech separation on hearable devices can enable transformative augmented and enhanced hearing capabilities. However, state-of-the-art speech separation networks cannot run in real-time on tiny, low-power neural accelerators designed for hearables, due to their limited compute capabilities. We present TF-MLPNet, the first speech separation network capable of running in real-time on such low-power accelerators while outperforming existing streaming models for blind speech separation and target speech extraction. Our network operates in the time-frequency domain, processing frequency sequences with stacks of fully connected layers that alternate along the channel and frequency dimensions, and independently processing the time sequence at each frequency bin using convolutional layers. Results show that our mixed-precision quantization-aware trained (QAT) model can process 6 ms audio chunks in real-time on the GAP9 processor, achieving a 3.5-4x runtime reduction compared to prior speech separation models.

[394] Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback

Jingyi Chen, Ju Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault

Main category: cs.SD

TL;DR: DLPO improves diffusion-based TTS models by integrating training loss into RLHF, enhancing efficiency and speech quality.

DetailsMotivation: Diffusion models for TTS are inefficient for real-time use due to slow denoising and poor intonation/rhythm modeling.

Method: Proposes DLPO, an RLHF framework that incorporates training loss into rewards and uses naturalness scores for feedback.

Result: DLPO achieves better objective (UTMOS 3.65, NISQA 4.02) and subjective (67% preference) scores.

Conclusion: DLPO enables efficient, high-quality diffusion TTS for real-time applications.

Abstract: Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models. DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model’s structure, improving speech quality. We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67% of the time. These findings demonstrate DLPO’s potential for efficient, high-quality diffusion TTS in real-time, resource-limited settings.

[395] MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction

Mohammed Salah Al-Radhi, Géza Németh, Branislav Gerazov

Main category: cs.SD

TL;DR: MiSTR is a deep-learning framework for speech synthesis from iEEG signals, improving intelligibility and naturalness through wavelet-based features, Transformer-based prosody modeling, and neural phase vocoder.

DetailsMotivation: To restore communication in individuals with severe speech impairments by addressing challenges in feature representation, prosody modeling, and phase reconstruction.

Method: Integrates wavelet-based feature extraction, Transformer-based decoder for prosody-aware spectrogram prediction, and neural phase vocoder for harmonic consistency.

Result: Achieves state-of-the-art speech intelligibility with a mean Pearson correlation of 0.91 between reconstructed and original Mel spectrograms.

Conclusion: MiSTR outperforms existing neural speech synthesis baselines, offering a promising solution for speech restoration from iEEG signals.

Abstract: Speech synthesis from intracranial EEG (iEEG) signals offers a promising avenue for restoring communication in individuals with severe speech impairments. However, achieving intelligible and natural speech remains challenging due to limitations in feature representation, prosody modeling, and phase reconstruction. We introduce MiSTR, a deep-learning framework that integrates: 1) Wavelet-based feature extraction to capture fine-grained temporal, spectral, and neurophysiological representations of iEEG signals, 2) A Transformer-based decoder for prosody-aware spectrogram prediction, and 3) A neural phase vocoder enforcing harmonic consistency via adaptive spectral correction. Evaluated on a public iEEG dataset, MiSTR achieves state-of-the-art speech intelligibility, with a mean Pearson correlation of 0.91 between reconstructed and original Mel spectrograms, improving over existing neural speech synthesis baselines.

[396] SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

Jan Melechovsky, Ambuj Mehrish, Dorien Herremans

Main category: cs.SD

TL;DR: SonicMaster is a unified generative model for music restoration and mastering, using text-based control to address various audio artifacts. It outperforms traditional methods in quality and listener preference.

DetailsMotivation: Music recordings often suffer from quality issues due to non-professional settings, requiring multiple tools for correction. SonicMaster aims to unify and simplify this process.

Method: SonicMaster uses a flow-matching generative training paradigm, trained on a dataset of paired degraded and high-quality tracks simulated with 19 degradation functions. It operates via text prompts or automatically.

Result: Objective metrics and subjective tests show SonicMaster significantly improves audio quality and is preferred by listeners over degraded originals.

Conclusion: SonicMaster offers an effective, unified solution for music restoration and mastering, outperforming traditional methods.

Abstract: Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster’s enhanced outputs over the original degraded audio, highlighting the effectiveness of our unified approach.

[397] When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin

Main category: cs.SD

TL;DR: WhisperInject is a two-stage adversarial audio attack framework that manipulates audio language models to generate harmful content using imperceptible perturbations.

DetailsMotivation: Audio interfaces for human-AI interaction introduce vulnerabilities, making them potential attack surfaces for adversaries.

Method: Uses Reinforcement Learning with Projected Gradient Descent (RL-PGD) in Stage 1 to bypass safety protocols, followed by Payload Injection with PGD in Stage 2 to embed perturbations in benign audio.

Result: Achieves a success rate exceeding 86% across multiple models (Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, Phi-4-Multimodal).

Conclusion: Demonstrates a practical, covert method for manipulating AI behavior via audio-native threats.

Abstract: As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior.

[398] EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, Li Liu

Main category: cs.SD

TL;DR: EmoSteer-TTS enables fine-grained, training-free emotion control in TTS by modifying internal activations, outperforming SOTA methods.

DetailsMotivation: Existing TTS systems lack fine-grained emotion control and require extensive datasets, limiting flexibility and accessibility.

Method: Proposes EmoSteer-TTS, a training-free approach using activation steering for emotion control (conversion, interpolation, erasure). It involves activation extraction, emotional token searching, and inference-time steering.

Result: Outperforms SOTA methods, enabling fine-grained, interpretable, and continuous emotion control without additional training.

Conclusion: EmoSteer-TTS is the first training-free method for continuous fine-grained emotion control in TTS, offering broad applicability.

Abstract: Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS.

[399] AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation

Yan Rong, Jinting Wang, Guangzhi Lei, Shan Yang, Li Liu

Main category: cs.SD

TL;DR: AudioGenie is a multi-agent system for MM2MA tasks, addressing multimodal input understanding, diverse audio handling, and self-correction. It outperforms benchmarks and user studies confirm its effectiveness.

DetailsMotivation: Challenges in MM2MA include dataset scarcity, lack of robust frameworks, and inadequate multimodal understanding. Multi-agent systems offer potential but face specific hurdles.

Method: AudioGenie uses a dual-layer architecture with generation and supervisor teams, featuring fine-grained task decomposition, MoE collaboration, iterative refinement, and feedback loops.

Result: Achieves SOTA or comparable performance on 9 metrics across 8 tasks. User studies validate quality, accuracy, alignment, and aesthetic.

Conclusion: AudioGenie effectively addresses MM2MA challenges, demonstrating superior performance and usability, with potential for broader applications.

Abstract: Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine-grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self-correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training-free multi-agent system featuring a dual-layer architecture with a generation team and a supervisor team. For the generation team, a fine-grained task decomposition and an adaptive Mixture-of-Experts (MoE) collaborative entity are designed for detailed comprehensive multimodal understanding and dynamic model selection, and a trial-and-error iterative refinement module is designed for self-correction. The supervisor team ensures temporal-spatial consistency and verifies outputs through feedback loops. Moreover, we build MA-Bench, the first benchmark for MM2MA tasks, comprising 198 annotated videos with multi-type audios. Experiments demonstrate that our AudioGenie achieves state-of-the-art (SOTA) or comparable performance across 9 metrics in 8 tasks. User study further validates the effectiveness of our method in terms of quality, accuracy, alignment, and aesthetic. The project website with audio samples can be found at https://audiogenie.github.io/.

[400] AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Le Wang, Jun Wang, Feng Deng, Chen Zhang, Di Zhang, Kun Gai

Main category: cs.SD

TL;DR: AudioGen-Omni is a multimodal diffusion transformer model for generating high-fidelity audio, speech, and songs synchronized with video, using a novel joint training paradigm and advanced cross-modal alignment techniques.

DetailsMotivation: To create a unified model capable of generating diverse, semantically rich audio synchronized with video, overcoming limitations of text-frozen paradigms.

Method: Uses multimodal diffusion transformers (MMDit), a joint training paradigm with large-scale video-text-audio corpora, and a unified lyrics-transcription encoder with AdaLN-based joint attention and PAAPI for cross-modal alignment.

Result: Achieves state-of-the-art performance in Text-to-Audio/Speech/Song tasks, with high audio quality, semantic alignment, and lip-sync accuracy, and efficient inference (1.91s for 8s audio).

Conclusion: AudioGen-Omni offers a robust, efficient, and generalizable solution for multimodal audio generation tasks.

Abstract: We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and songs coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both sung and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.

[401] Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers

Liang Lin, Miao Yu, Kaiwen Luo, Yibo Zhang, Lilan Peng, Dexian Wang, Xuehai Tang, Yuanhe Zhang, Xikang Yang, Zhenhong Zhou, Kun Wang, Yang Liu

Main category: cs.SD

TL;DR: The paper investigates vulnerabilities of Audio Large Language Models (ALLMs) to backdoor attacks using acoustic triggers, introduces the HIN attack framework, and evaluates risks with the AudioSafe benchmark.

DetailsMotivation: The distinct characteristics of audio pose unique safety challenges for ALLMs, which have not been as thoroughly explored as textual or vision safety.

Method: The HIN framework modifies raw audio waveforms with acoustic changes (e.g., temporal dynamics and spectrally tailored noise) to embed triggers. The AudioSafe benchmark evaluates nine risk types.

Result: Experiments show high attack success rates (over 90%) with audio features like noise and speech rate, varying sensitivity across features, and stealthy attack impacts.

Conclusion: ALLMs are vulnerable to audio-specific backdoor attacks, highlighting the need for improved safety measures in audio processing.

Abstract: As Audio Large Language Models (ALLMs) emerge as powerful tools for speech processing, their safety implications demand urgent attention. While considerable research has explored textual and vision safety, audio’s distinct characteristics present significant challenges. This paper first investigates: Is ALLM vulnerable to backdoor attacks exploiting acoustic triggers? In response to this issue, we introduce Hidden in the Noise (HIN), a novel backdoor attack framework designed to exploit subtle, audio-specific features. HIN applies acoustic modifications to raw audio waveforms, such as alterations to temporal dynamics and strategic injection of spectrally tailored noise. These changes introduce consistent patterns that an ALLM’s acoustic feature encoder captures, embedding robust triggers within the audio stream. To evaluate ALLM robustness against audio-feature-based triggers, we develop the AudioSafe benchmark, assessing nine distinct risk types. Extensive experiments on AudioSafe and three established safety datasets reveal critical vulnerabilities in existing ALLMs: (I) audio features like environment noise and speech rate variations achieve over 90% average attack success rate. (II) ALLMs exhibit significant sensitivity differences across acoustic features, particularly showing minimal response to volume as a trigger, and (III) poisoned sample inclusion causes only marginal loss curve fluctuations, highlighting the attack’s stealth.

cs.LG

[402] A Bayesian Hybrid Parameter-Efficient Fine-Tuning Method for Large Language Models

Yidong Chai, Yang Liu, Yonghang Zhou, Jiaheng Xie, Daniel Dajun Zeng

Main category: cs.LG

TL;DR: The paper introduces BH-PEFT, a Bayesian hybrid PEFT method for fine-tuning LLMs, addressing uncertainty quantification and dynamic adaptation challenges in specialized applications.

DetailsMotivation: Existing hybrid PEFT methods lack uncertainty quantification and dynamic adaptation, limiting their reliability and adaptability in real-world scenarios.

Method: BH-PEFT integrates Bayesian learning with hybrid PEFT (Adapter, LoRA, prefix-tuning) to model parameters as distributions and enable dynamic fine-tuning.

Result: BH-PEFT outperforms baselines in tasks like sentiment analysis, news categorization, and commonsense reasoning, offering uncertainty-aware and adaptive decision-making.

Conclusion: BH-PEFT enhances business analytics by providing a reliable, adaptive fine-tuning method for LLMs in dynamic, real-world applications.

Abstract: Large Language Models (LLMs) have demonstrated transformative potential in reshaping the world. As these models are pretrained on general corpora, they often require domain-specific fine-tuning to optimize performance in specialized business applications. Due to their massive scale, parameter-efficient fine-tuning (PEFT) methods are widely used to reduce training costs. Among them, hybrid PEFT methods that combine multiple PEFT techniques have achieved the best performance. However, existing hybrid PEFT methods face two main challenges when fine-tuning LLMs for specialized applications: (1) relying on point estimates, lacking the ability to quantify uncertainty for reliable decision-making, and (2) struggling to dynamically adapt to emerging data, lacking the ability to suit real-world situations. We propose Bayesian Hybrid Parameter-Efficient Fine-Tuning (BH-PEFT), a novel method that integrates Bayesian learning into hybrid PEFT. BH-PEFT combines Adapter, LoRA, and prefix-tuning to fine-tune feedforward and attention layers of the Transformer. By modeling learnable parameters as distributions, BH-PEFT enables uncertainty quantification. We further propose a Bayesian dynamic fine-tuning approach where the last posterior serves as the prior for the next round, enabling effective adaptation to new data. We evaluated BH-PEFT on business tasks such as sentiment analysis, news categorization, and commonsense reasoning. Results show that our method outperforms existing PEFT baselines, enables uncertainty quantification for more reliable decisions, and improves adaptability in dynamic scenarios. This work contributes to business analytics and data science by proposing a novel BH-PEFT method and dynamic fine-tuning approach that support uncertainty-aware and adaptive decision-making in real-world situations.

[403] ZetA: A Riemann Zeta-Scaled Extension of Adam for Deep Learning

Samiksha BC

Main category: cs.LG

TL;DR: ZetA is a new deep learning optimizer combining Adam with Riemann zeta function-based dynamic scaling, improving generalization and robustness.

DetailsMotivation: To enhance deep learning optimization by introducing dynamic scaling via the Riemann zeta function, addressing limitations of existing optimizers like Adam.

Method: ZetA integrates adaptive damping, cosine similarity-based momentum boosting, entropy-regularized loss, and SAM-style perturbations.

Result: Empirical tests on SVHN, CIFAR10, CIFAR100, STL10, and noisy CIFAR10 show improved test accuracy over Adam.

Conclusion: ZetA is a computationally efficient, robust optimizer, especially effective for noisy or high-granularity tasks.

Abstract: This work introduces ZetA, a novel deep learning optimizer that extends Adam by incorporating dynamic scaling based on the Riemann zeta function. To the best of our knowledge, ZetA is the first optimizer to apply zeta-based gradient scaling within deep learning optimization. The method improves generalization and robustness through a hybrid update mechanism that integrates adaptive damping, cosine similarity-based momentum boosting, entropy-regularized loss, and Sharpness-Aware Minimization (SAM)-style perturbations. Empirical evaluations on SVHN, CIFAR10, CIFAR100, STL10, and noisy CIFAR10 consistently show test accuracy improvements over Adam. All experiments employ a lightweight fully connected network trained for five epochs under mixed-precision settings. The results demonstrate that ZetA is a computationally efficient and robust alternative to Adam, particularly effective in noisy or high-granularity classification tasks.

[404] ECGTwin: Personalized ECG Generation Using Controllable Diffusion Model

Yongfan Lai, Bo Liu, Xinyan Guan, Qinghao Zhao, Hongyan Li, Shenda Hong

Main category: cs.LG

TL;DR: ECGTwin is a two-stage framework for personalized ECG generation, addressing challenges of individual feature extraction and condition injection using contrastive learning and diffusion-based generation.

DetailsMotivation: To transform healthcare into a personalized paradigm by simulating patient-specific ECG digital twins while preserving population-level synthesis benefits.

Method: Uses a two-stage approach: 1) Individual Base Extractor with contrastive learning for personal feature extraction, and 2) AdaX Condition Injector in a diffusion-based generation process for condition-specific ECG synthesis.

Result: Generates high-fidelity, diverse ECG signals with fine-grained controllability and preserves individual-specific features. Enhances ECG auto-diagnosis in downstream applications.

Conclusion: ECGTwin demonstrates potential for precise personalized healthcare solutions by combining individual feature extraction and condition-specific generation.

Abstract: Personalized electrocardiogram (ECG) generation is to simulate a patient’s ECG digital twins tailored to specific conditions. It has the potential to transform traditional healthcare into a more accurate individualized paradigm, while preserving the key benefits of conventional population-level ECG synthesis. However, this promising task presents two fundamental challenges: extracting individual features without ground truth and injecting various types of conditions without confusing generative model. In this paper, we present ECGTwin, a two-stage framework designed to address these challenges. In the first stage, an Individual Base Extractor trained via contrastive learning robustly captures personal features from a reference ECG. In the second stage, the extracted individual features, along with a target cardiac condition, are integrated into the diffusion-based generation process through our novel AdaX Condition Injector, which injects these signals via two dedicated and specialized pathways. Both qualitative and quantitative experiments have demonstrated that our model can not only generate ECG signals of high fidelity and diversity by offering a fine-grained generation controllability, but also preserving individual-specific features. Furthermore, ECGTwin shows the potential to enhance ECG auto-diagnosis in downstream application, confirming the possibility of precise personalized healthcare solutions.

[405] Mathematical Foundations of Geometric Deep Learning

Haitz Sáez de Ocáriz Borde, Michael Bronstein

Main category: cs.LG

TL;DR: A review of essential mathematical concepts for Geometric Deep Learning.

DetailsMotivation: To provide foundational knowledge for understanding and advancing Geometric Deep Learning.

Method: Review and summarize key mathematical principles relevant to the field.

Result: A comprehensive overview of necessary mathematical tools for Geometric Deep Learning.

Conclusion: The review serves as a valuable resource for researchers and practitioners in the field.

Abstract: We review the key mathematical concepts necessary for studying Geometric Deep Learning.

[406] Online Robust Multi-Agent Reinforcement Learning under Model Uncertainties

Zain Ulabedeen Farhat, Debamita Ghosh, George K. Atia, Yue Wang

Main category: cs.LG

TL;DR: The paper introduces RONAVI, an online learning algorithm for Distributionally Robust Markov Games (DRMGs), providing provable guarantees for robust multi-agent systems without prior data.

DetailsMotivation: Multi-agent systems often fail in real-world deployments due to model mismatches from uncertainties like noise or adversarial attacks. Current DRMG methods rely on simulators or offline data, which are often unavailable.

Method: The paper proposes the Robust Optimistic Nash Value Iteration (RONAVI) algorithm, enabling agents to learn directly from environmental interactions.

Result: RONAVI achieves low regret and efficiently finds optimal robust policies for uncertainty sets measured by Total Variation and Kullback-Leibler divergence.

Conclusion: The work establishes a practical approach to developing robust multi-agent systems through online learning in DRMGs.

Abstract: Well-trained multi-agent systems can fail when deployed in real-world environments due to model mismatches between the training and deployment environments, caused by environment uncertainties including noise or adversarial attacks. Distributionally Robust Markov Games (DRMGs) enhance system resilience by optimizing for worst-case performance over a defined set of environmental uncertainties. However, current methods are limited by their dependence on simulators or large offline datasets, which are often unavailable. This paper pioneers the study of online learning in DRMGs, where agents learn directly from environmental interactions without prior data. We introduce the {\it Robust Optimistic Nash Value Iteration (RONAVI)} algorithm and provide the first provable guarantees for this setting. Our theoretical analysis demonstrates that the algorithm achieves low regret and efficiently finds the optimal robust policy for uncertainty sets measured by Total Variation divergence and Kullback-Leibler divergence. These results establish a new, practical path toward developing truly robust multi-agent systems.

[407] Forecasting NCAA Basketball Outcomes with Deep Learning: A Comparative Study of LSTM and Transformer Models

Md Imtiaz Habib

Main category: cs.LG

TL;DR: The paper explores deep learning models (LSTM and Transformer) for predicting 2025 NCAA basketball tournament outcomes, highlighting the impact of model choice and loss functions on performance.

DetailsMotivation: To forecast NCAA basketball tournament results using advanced deep learning techniques, leveraging historical data for improved accuracy.

Method: Implemented LSTM and Transformer models with feature engineering (GLM metrics, Elo ratings, seed differences, box-score stats) and evaluated using BCE and Brier loss functions.

Result: Transformer with BCE achieved highest AUC (0.8473), while LSTM with Brier loss had best calibration (Brier score 0.1589).

Conclusion: Model and loss function selection should align with task requirements; the pipeline offers a reproducible framework for sports analytics.

Abstract: In this research, I explore advanced deep learning methodologies to forecast the outcomes of the 2025 NCAA Division 1 Men’s and Women’s Basketball tournaments. Leveraging historical NCAA game data, I implement two sophisticated sequence-based models: Long Short-Term Memory (LSTM) and Transformer architectures. The predictive power of these models is augmented through comprehensive feature engineering, including team quality metrics derived from Generalized Linear Models (GLM), Elo ratings, seed differences, and aggregated box-score statistics. To evaluate the robustness and reliability of predictions, I train each model variant using both Binary Cross-Entropy (BCE) and Brier loss functions, providing insights into classification performance and probability calibration. My comparative analysis reveals that while the Transformer architecture optimized with BCE yields superior discriminative power (highest AUC of 0.8473), the LSTM model trained with Brier loss demonstrates superior probabilistic calibration (lowest Brier score of 0.1589). These findings underscore the importance of selecting appropriate model architectures and loss functions based on the specific requirements of forecasting tasks. The detailed analytical pipeline presented here serves as a reproducible framework for future predictive modeling tasks in sports analytics and beyond.

[408] Embedding-Enhanced Probabilistic Modeling of Ferroelectric Field Effect Transistors (FeFETs)

Tasnia Nobi Afee, Jack Hutchins, Md Mazharul Islam, Thomas Kampfe, Ahmedullah Aziz

Main category: cs.LG

TL;DR: The paper introduces a probabilistic modeling framework for FeFETs to address variability challenges, using a Mixture Density Network with smooth activation functions and device-specific embeddings.

DetailsMotivation: FeFETs' inherent randomness from cycling and fabrication variability complicates accurate modeling, necessitating a solution for reliable prediction and optimization.

Method: The framework uses a Mixture Density Network (MDN) with C-infinity continuous activation functions and a device-specific embedding layer to model variability.

Result: The model achieves high accuracy (R2 of 0.92) in capturing FeFET current variability and enables synthetic device generation for simulations.

Conclusion: The proposed framework offers a scalable, data-driven solution for modeling FeFET stochastic behavior, aiding future compact model and circuit simulation development.

Abstract: FeFETs hold strong potential for advancing memory and logic technologies, but their inherent randomness arising from both operational cycling and fabrication variability poses significant challenges for accurate and reliable modeling. Capturing this variability is critical, as it enables designers to predict behavior, optimize performance, and ensure reliability and robustness against variations in manufacturing and operating conditions. Existing deterministic and machine learning-based compact models often fail to capture the full extent of this variability or lack the mathematical smoothness required for stable circuit-level integration. In this work, we present an enhanced probabilistic modeling framework for FeFETs that addresses these limitations. Building upon a Mixture Density Network (MDN) foundation, our approach integrates C-infinity continuous activation functions for smooth, stable learning and a device-specific embedding layer to capture intrinsic physical variability across devices. Sampling from the learned embedding distribution enables the generation of synthetic device instances for variability-aware simulation. With an R2 of 0.92, the model demonstrates high accuracy in capturing the variability of FeFET current behavior. Altogether, this framework provides a scalable, data-driven solution for modeling the full stochastic behavior of FeFETs and offers a strong foundation for future compact model development and circuit simulation integration.

[409] DeepGB-TB: A Risk-Balanced Cross-Attention Gradient-Boosted Convolutional Network for Rapid, Interpretable Tuberculosis Screening

Zhixiang Lu, Yulong Li, Feilong Tang, Zhengyong Jiang, Chong Li, Mian Zhou, Tenglong Li, Jionglong Su

Main category: cs.LG

TL;DR: DeepGB-TB is an AI system for TB screening using cough audio and demographic data, achieving high accuracy and efficiency for low-resource settings.

DetailsMotivation: Traditional TB diagnostics are costly and complex, necessitating affordable, scalable AI solutions.

Method: Combines a 1D CNN for audio and gradient-boosted trees for tabular data, with a Cross-Modal Bidirectional Cross-Attention module (CM-BCA) and Tuberculosis Risk-Balanced Loss (TRBL).

Result: Achieves AUROC of 0.903 and F1-score of 0.851 on a diverse dataset of 1,105 patients.

Conclusion: DeepGB-TB is a scalable, efficient tool for global TB control, meeting clinical and public-health needs.

Abstract: Large-scale tuberculosis (TB) screening is limited by the high cost and operational complexity of traditional diagnostics, creating a need for artificial-intelligence solutions. We propose DeepGB-TB, a non-invasive system that instantly assigns TB risk scores using only cough audio and basic demographic data. The model couples a lightweight one-dimensional convolutional neural network for audio processing with a gradient-boosted decision tree for tabular features. Its principal innovation is a Cross-Modal Bidirectional Cross-Attention module (CM-BCA) that iteratively exchanges salient cues between modalities, emulating the way clinicians integrate symptoms and risk factors. To meet the clinical priority of minimizing missed cases, we design a Tuberculosis Risk-Balanced Loss (TRBL) that places stronger penalties on false-negative predictions, thereby reducing high-risk misclassifications. DeepGB-TB is evaluated on a diverse dataset of 1,105 patients collected across seven countries, achieving an AUROC of 0.903 and an F1-score of 0.851, representing a new state of the art. Its computational efficiency enables real-time, offline inference directly on common mobile devices, making it ideal for low-resource settings. Importantly, the system produces clinically validated explanations that promote trust and adoption by frontline health workers. By coupling AI innovation with public-health requirements for speed, affordability, and reliability, DeepGB-TB offers a tool for advancing global TB control.

[410] Considering Spatial Structure of the Road Network in Pavement Deterioration Modeling

Lu Gao, Ke Yu, Pan Lu

Main category: cs.LG

TL;DR: A GNN-based model improves pavement deterioration prediction by incorporating spatial relationships in road networks.

DetailsMotivation: To enhance pavement deterioration modeling by leveraging spatial dependencies in road networks using GNNs.

Method: Used a graph neural network (GNN) to exploit structural information in road networks, tested on a large dataset from Texas PMIS.

Result: Models considering spatial relationships outperformed traditional methods.

Conclusion: Spatial structure improves pavement deterioration prediction accuracy.

Abstract: Pavement deterioration modeling is important in providing information regarding the future state of the road network and in determining the needs of preventive maintenance or rehabilitation treatments. This research incorporated spatial dependence of road network into pavement deterioration modeling through a graph neural network (GNN). The key motivation of using a GNN for pavement performance modeling is the ability to easily and directly exploit the rich structural information in the network. This paper explored if considering spatial structure of the road network will improve the prediction performance of the deterioration models. The data used in this research comprises a large pavement condition data set with more than a half million observations taken from the Pavement Management Information System (PMIS) maintained by the Texas Department of Transportation. The promising comparison results indicates that pavement deterioration prediction models perform better when spatial relationship is considered.

[411] Pulse Shape Discrimination Algorithms: Survey and Benchmark

Haoran Liu, Yihan Zhan, Mingzhe Liu, Yanhua Liu, Peng Li, Zhuo Zuo, Bingqi Liu, Runxi Liu

Main category: cs.LG

TL;DR: A survey and benchmark of 60 PSD algorithms for radiation detection, comparing statistical and prior-knowledge methods. Deep learning models, especially MLPs and hybrids, outperform traditional methods. Includes open-source tools and datasets.

DetailsMotivation: To provide a comprehensive evaluation of PSD algorithms, identify top-performing methods, and promote reproducibility in radiation detection research.

Method: Classified 60 PSD methods into statistical (time-domain, frequency-domain, neural networks) and prior-knowledge (machine/deep learning) paradigms. Evaluated on two datasets using metrics like FOM, F1-score, and ROC-AUC.

Result: Deep learning models (MLPs and hybrid approaches) often outperform traditional methods. Discusses limitations of FOM and performance across energy thresholds.

Conclusion: Deep learning is superior for PSD in radiation detection. Open-source tools and datasets are released to support further research and reproducibility.

Abstract: This review presents a comprehensive survey and benchmark of pulse shape discrimination (PSD) algorithms for radiation detection, classifying nearly sixty methods into statistical (time-domain, frequency-domain, neural network-based) and prior-knowledge (machine learning, deep learning) paradigms. We implement and evaluate all algorithms on two standardized datasets: an unlabeled set from a 241Am-9Be source and a time-of-flight labeled set from a 238Pu-9Be source, using metrics including Figure of Merit (FOM), F1-score, ROC-AUC, and inter-method correlations. Our analysis reveals that deep learning models, particularly Multi-Layer Perceptrons (MLPs) and hybrid approaches combining statistical features with neural regression, often outperform traditional methods. We discuss architectural suitabilities, the limitations of FOM, alternative evaluation metrics, and performance across energy thresholds. Accompanying this work, we release an open-source toolbox in Python and MATLAB, along with the datasets, to promote reproducibility and advance PSD research.

[412] SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

Yi Zhao, Yajuan Peng, Cam-Tu Nguyen, Zuchao Li, Xiaoliang Wang, Hai Zhao, Xiaoming Fu

Main category: cs.LG

TL;DR: SmallKV introduces a small-model-assisted compensation method for KV cache compression, addressing irreversible eviction and marginal token issues in LLMs, improving throughput and performance.

DetailsMotivation: Existing KV cache eviction methods fail to adapt to dynamic attention patterns and over-compress marginally important tokens, limiting LLM performance in long-context scenarios.

Method: SmallKV uses a smaller model’s attention scores to compensate for the larger model’s KV cache, maintaining attention matching and approximating marginal tokens.

Result: SmallKV achieves 1.75-2.56x higher throughput and demonstrates effectiveness on benchmarks like GSM8K, BBH, MT-Bench, and LongBench.

Conclusion: SmallKV offers an efficient and performant solution for LLM inference in resource-constrained environments by addressing key eviction challenges.

Abstract: KV cache eviction has emerged as an effective solution to alleviate resource constraints faced by LLMs in long-context scenarios. However, existing token-level eviction methods often overlook two critical aspects: (1) their irreversible eviction strategy fails to adapt to dynamic attention patterns during decoding (the saliency shift problem), and (2) they treat both marginally important tokens and truly unimportant tokens equally, despite the collective significance of marginal tokens to model performance (the marginal information over-compression problem). To address these issues, we design two compensation mechanisms based on the high similarity of attention matrices between LLMs of different scales. We propose SmallKV, a small model assisted compensation method for KV cache compression. SmallKV can maintain attention matching between different-scale LLMs to: 1) assist the larger model in perceiving globally important information of attention; and 2) use the smaller model’s attention scores to approximate those of marginal tokens in the larger model. Extensive experiments on benchmarks including GSM8K, BBH, MT-Bench, and LongBench demonstrate the effectiveness of SmallKV. Moreover, efficiency evaluations show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods, highlighting its potential for efficient and performant LLM inference in resource constrained environments.

[413] DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting

Haonan Yang, Jianchao Tang, Zhuo Li, Long Lan

Main category: cs.LG

TL;DR: The paper introduces DMSC, a dynamic multi-scale framework for time series forecasting, addressing static decomposition, fragmented dependencies, and inflexible fusion with novel components like EMPD, TIB, and ASR-MoE.

DetailsMotivation: Existing methods struggle with static decomposition, fragmented dependency modeling, and rigid fusion, limiting their ability to capture intricate temporal dependencies.

Method: Proposes DMSC with EMPD for dynamic patch decomposition, TIB for dependency modeling, and ASR-MoE for adaptive fusion, integrated into a multi-layer cascade.

Result: DMSC achieves state-of-the-art performance and computational efficiency on 13 real-world benchmarks.

Conclusion: DMSC effectively addresses key challenges in time series forecasting with dynamic multi-scale coordination.

Abstract: Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales. Despite recent advances leveraging different decomposition operations and novel architectures based on CNN, MLP or Transformer, existing methods still struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies. To explicitly solve the mentioned three problems respectively, we propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE). Specifically, EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities, eliminating predefined scale constraints through input-adaptive patch adjustment. TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer’s decomposed representations. EMPD and TIB are jointly integrated into layers forming a multi-layer progressive cascade architecture, where coarse-grained representations from earlier layers adaptively guide fine-grained feature extraction in subsequent layers via gated pathways. And ASR-MoE dynamically fuses multi-scale predictions by leveraging specialized global and local experts with temporal-aware weighting. Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art (SOTA) performance and superior computational efficiency for TSF tasks. Code is available at https://github.com/1327679995/DMSC.

[414] Context-Adaptive Multi-Prompt LLM Embedding for Vision-Language Alignment

Dahun Kim, Anelia Angelova

Main category: cs.LG

TL;DR: Context-Adaptive Multi-Prompt Embedding enhances vision-language contrastive learning by using multiple adaptive prompts for richer semantic alignment with visuals.

DetailsMotivation: Standard CLIP-style models use a single text embedding, limiting semantic richness. This method aims to capture diverse semantic aspects for better alignment.

Method: Introduces multiple structured prompts with adaptive tokens, processed jointly. Uses diversity regularization and negation-aware loss for improved contrastive discrimination.

Result: Achieves consistent improvements on image-text and video-text retrieval benchmarks.

Conclusion: The method enriches semantic representations and enhances alignment between text and visual features.

Abstract: We propose Context-Adaptive Multi-Prompt Embedding, a novel approach to enrich semantic representations in vision-language contrastive learning. Unlike standard CLIP-style models that rely on a single text embedding, our method introduces multiple structured prompts, each containing a distinct adaptive token that captures diverse semantic aspects of the input text. We process all prompts jointly in a single forward pass. The resulting prompt embeddings are combined into a unified text representation, enabling semantically richer alignment with visual features. To further promote semantic diversity and representation quality, we incorporate a diversity regularization loss and a negation-aware loss, encouraging specialization across prompts and improving contrastive discrimination. Our method achieves consistent improvements on both image-text and video-text retrieval benchmarks.

[415] Synthetic medical data generation: state of the art and application to trauma mechanism classification

Océane Doremus, Ariel Guerra-Adames, Marta Avalos-Fernandez, Vianney Jouhet, Cédric Gil-Jardiné, Emmanuel Lagarde

Main category: cs.LG

TL;DR: Overview of machine learning methods for generating synthetic medical data, focusing on trauma classification, and proposing a new method for combining tabular and text data.

DetailsMotivation: Address challenges of patient confidentiality and scientific reproducibility in health research by using synthetic data.

Method: Proposes a methodology for generating high-quality synthetic medical records combining tabular and unstructured text data.

Result: Not explicitly stated in the abstract.

Conclusion: Synthetic data generation is a promising solution for balancing confidentiality and reproducibility in health research.

Abstract: Faced with the challenges of patient confidentiality and scientific reproducibility, research on machine learning for health is turning towards the conception of synthetic medical databases. This article presents a brief overview of state-of-the-art machine learning methods for generating synthetic tabular and textual data, focusing their application to the automatic classification of trauma mechanisms, followed by our proposed methodology for generating high-quality, synthetic medical records combining tabular and unstructured text data.

[416] Uncertainty Sets for Distributionally Robust Bandits Using Structural Equation Models

Katherine Avery, Chinmay Pendse, David Jensen

Main category: cs.LG

TL;DR: Proposes a practical bandit algorithm for distributionally robust evaluation and learning, using SEMs to tailor uncertainty sets, improving accuracy and reducing variance.

DetailsMotivation: Current methods for distributionally robust evaluation and learning are overly conservative, leading to suboptimal policies.

Method: Uses structural equation models (SEMs) to tailor uncertainty sets and conditional independence testing to detect shifted variables.

Result: SEM approach provides more accurate evaluations and lower-variance policies, especially for large shifts.

Conclusion: SEM-based method learns optimal policies if the model is well-specified, outperforming traditional approaches.

Abstract: Distributionally robust evaluation estimates the worst-case expected return over an uncertainty set of possible covariate and reward distributions, and distributionally robust learning finds a policy that maximizes that worst-case return across that uncertainty set. Unfortunately, current methods for distributionally robust evaluation and learning create overly conservative evaluations and policies. In this work, we propose a practical bandit evaluation and learning algorithm that tailors the uncertainty set to specific problems using mathematical programs constrained by structural equation models. Further, we show how conditional independence testing can be used to detect shifted variables for modeling. We find that the structural equation model (SEM) approach gives more accurate evaluations and learns lower-variance policies than traditional approaches, particularly for large shifts. Further, the SEM approach learns an optimal policy, assuming the model is sufficiently well-specified.

[417] On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence

Lei Pang, Ruinan Jin

Main category: cs.LG

TL;DR: GRPO is a critic-free RL algorithm for fine-tuning LLMs, replacing PPO’s value function with group-normalized rewards. A new variant, TIC GRPO, improves it by using trajectory-level importance ratios for unbiased gradient estimates.

DetailsMotivation: To simplify and improve the efficiency of fine-tuning large language models by removing the critic and addressing bias in policy gradient estimates.

Method: GRPO replaces PPO’s value function with group-normalized rewards and uses token-level importance sampling. TIC GRPO further simplifies this by using trajectory-level ratios for unbiased estimates.

Result: GRPO performs comparably even with simplified updates. TIC GRPO provides unbiased gradient estimates while maintaining critic-free structure.

Conclusion: GRPO and TIC GRPO offer efficient, critic-free alternatives for fine-tuning LLMs, with theoretical convergence guarantees.

Abstract: Group Relative Policy Optimization (GRPO), recently proposed by DeepSeek, is a critic-free reinforcement learning algorithm for fine tuning large language models. It replaces the value function in Proximal Policy Optimization (PPO) with group normalized rewards, while retaining PPO style token level importance sampling based on an old policy. We show that GRPO update rule in fact estimates the policy gradient at the old policy rather than the current one. However, since the old policy is refreshed every few steps, the discrepancy between the two remains small limiting the impact of this bias in practice. We validate this through an ablation study in which importance sampling is entirely removed, and updates are instead performed using the gradient estimated at a fixed old policy across multiple optimization steps. Remarkably, this simplification results in performance comparable to standard GRPO. Motivated by these findings, we propose a new algorithm: Trajectory level Importance Corrected GRPO (TIC GRPO). TIC GRPO replaces token level importance ratios with a single trajectory level probability ratio, yielding an unbiased estimate of the current policy gradient while preserving the critic free structure. Furthermore, we present the first theoretical convergence analysis for GRPO style methods, covering both the original GRPO and our proposed variant.

[418] Learning from B Cell Evolution: Adaptive Multi-Expert Diffusion for Antibody Design via Online Optimization

Hanqi Feng, Peng Qiu, Mengchun Zhang, Yiran Tao, You Fan, Jingtao Xu, Barnabas Poczos

Main category: cs.LG

TL;DR: A biologically-motivated framework for antibody design using adaptive, physics-based meta-learning with specialized experts, outperforming uniform generation strategies.

DetailsMotivation: Existing antibody design methods use uniform strategies, lacking adaptability to unique antigen requirements. Inspired by B cell affinity maturation, the paper aims to create a more flexible and effective approach.

Method: Leverages physics-based domain knowledge in an online meta-learning system with multiple specialized experts (e.g., van der Waals, molecular recognition). Parameters evolve iteratively, mimicking natural antibody refinement.

Result: Achieves target-specific adaptation, balanced multi-objective optimization, and preserves molecular symmetries. Outperforms in hotspot coverage, interface quality, and generalization across diverse design challenges.

Conclusion: The framework enables precision-focused antibody design, adapting to individual targets through iterative refinement and online learning.

Abstract: Recent advances in diffusion models have shown remarkable potential for antibody design, yet existing approaches apply uniform generation strategies that cannot adapt to each antigen’s unique requirements. Inspired by B cell affinity maturation, where antibodies evolve through multi-objective optimization balancing affinity, stability, and self-avoidance, we propose the first biologically-motivated framework that leverages physics-based domain knowledge within an online meta-learning system. Our method employs multiple specialized experts (van der Waals, molecular recognition, energy balance, and interface geometry) whose parameters evolve during generation based on iterative feedback, mimicking natural antibody refinement cycles. Instead of fixed protocols, this adaptive guidance discovers personalized optimization strategies for each target. Our experiments demonstrate that this approach: (1) discovers optimal SE(3)-equivariant guidance strategies for different antigen classes without pre-training, preserving molecular symmetries throughout optimization; (2) significantly enhances hotspot coverage and interface quality through target-specific adaptation, achieving balanced multi-objective optimization characteristic of therapeutic antibodies; (3) establishes a paradigm for iterative refinement where each antibody-antigen system learns its unique optimization profile through online evaluation; (4) generalizes effectively across diverse design challenges, from small epitopes to large protein interfaces, enabling precision-focused campaigns for individual targets.

[419] Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

Kennedy Edemacu, Vinay M. Shashidhar, Micheal Tuape, Dan Abudu, Beakcheol Jang, Jong Wook Kim

Main category: cs.LG

TL;DR: Proposes FilterRAG and ML-FilterRAG defenses against PoisonedRAG attacks in Retrieval-Augmented Generation (RAG) systems by identifying and filtering adversarial texts.

DetailsMotivation: RAG systems are vulnerable to knowledge poisoning attacks, where adversarial texts mislead the model. This work aims to defend against such attacks.

Method: Introduces a new property to distinguish adversarial from clean texts and uses it to design FilterRAG and ML-FilterRAG for filtering.

Result: The proposed methods effectively mitigate PoisonedRAG attacks, performing nearly as well as original RAG systems.

Conclusion: FilterRAG and ML-FilterRAG are viable defenses against knowledge poisoning in RAG, maintaining system performance.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.

[420] Resource-Efficient Automatic Software Vulnerability Assessment via Knowledge Distillation and Particle Swarm Optimization

Chaoyang Gao, Xiang Chen, Jiyu Wang, Jibin Wang, Guang Yang

Main category: cs.LG

TL;DR: A resource-efficient framework combining knowledge distillation and particle swarm optimization for automated vulnerability assessment, reducing model size by 99.4% while retaining 89.3% accuracy.

DetailsMotivation: The need for scalable cybersecurity solutions due to increasing software complexity and the impracticality of large pre-trained models in real-world scenarios.

Method: A two-stage approach: particle swarm optimization for compact model architecture and knowledge distillation to transfer knowledge from a large teacher model.

Result: Achieves 99.4% model size reduction, 89.3% accuracy retention, outperforms baselines by 1.7% accuracy with 60% fewer parameters, and reduces training and search times significantly.

Conclusion: The proposed framework is effective for efficient and scalable vulnerability assessment, balancing performance and resource constraints.

Abstract: The increasing complexity of software systems has led to a surge in cybersecurity vulnerabilities, necessitating efficient and scalable solutions for vulnerability assessment. However, the deployment of large pre-trained models in real-world scenarios is hindered by their substantial computational and storage demands. To address this challenge, we propose a novel resource-efficient framework that integrates knowledge distillation and particle swarm optimization to enable automated vulnerability assessment. Our framework employs a two-stage approach: First, particle swarm optimization is utilized to optimize the architecture of a compact student model, balancing computational efficiency and model capacity. Second, knowledge distillation is applied to transfer critical vulnerability assessment knowledge from a large teacher model to the optimized student model. This process significantly reduces the model size while maintaining high performance. Experimental results on an enhanced MegaVul dataset, comprising 12,071 CVSS (Common Vulnerability Scoring System) v3 annotated vulnerabilities, demonstrate the effectiveness of our approach. Our approach achieves a 99.4% reduction in model size while retaining 89.3% of the original model’s accuracy. Furthermore, it outperforms state-of-the-art baselines by 1.7% in accuracy with 60% fewer parameters. The framework also reduces training time by 72.1% and architecture search time by 34.88% compared to traditional genetic algorithms.

[421] Comparative Evaluation of Kolmogorov-Arnold Autoencoders and Orthogonal Autoencoders for Fault Detection with Varying Training Set Sizes

Enrique Luna Villagómez, Vladimir Mahalec

Main category: cs.LG

TL;DR: Kolmogorov-Arnold Networks (KANs) are evaluated for unsupervised fault detection in chemical processes, outperforming traditional methods with higher efficiency and robustness, especially in low-data settings.

DetailsMotivation: To explore the untapped potential of KANs in unsupervised fault detection, comparing their performance against conventional methods like Orthogonal Autoencoders.

Method: Four KAN-based autoencoder variants (EfficientKAN, FastKAN, FourierKAN, WavKAN) are benchmarked against an Orthogonal Autoencoder on the Tennessee Eastman Process, using Fault Detection Rate (FDR) as the metric.

Result: WavKAN-AE achieves the highest FDR (≥92%) with minimal data, while EfficientKAN-AE shows robustness with only 500 samples. FastKAN-AE performs well at larger scales, and FourierKAN-AE underperforms.

Conclusion: KAN-AEs combine data efficiency and strong performance, with potential for improved transparency, making them suitable for industrial applications with limited data.

Abstract: Kolmogorov-Arnold Networks (KANs) have recently emerged as a flexible and parameter-efficient alternative to conventional neural networks. Unlike standard architectures that use fixed node-based activations, KANs place learnable functions on edges, parameterized by different function families. While they have shown promise in supervised settings, their utility in unsupervised fault detection remains largely unexplored. This study presents a comparative evaluation of KAN-based autoencoders (KAN-AEs) for unsupervised fault detection in chemical processes. We investigate four KAN-AE variants, each based on a different KAN implementation (EfficientKAN, FastKAN, FourierKAN, and WavKAN), and benchmark them against an Orthogonal Autoencoder (OAE) on the Tennessee Eastman Process. Models are trained on normal operating data across 13 training set sizes and evaluated on 21 fault types, using Fault Detection Rate (FDR) as the performance metric. WavKAN-AE achieves the highest overall FDR ($\geq$92%) using just 4,000 training samples and remains the top performer, even as other variants are trained on larger datasets. EfficientKAN-AE reaches $\geq$90% FDR with only 500 samples, demonstrating robustness in low-data settings. FastKAN-AE becomes competitive at larger scales ($\geq$50,000 samples), while FourierKAN-AE consistently underperforms. The OAE baseline improves gradually but requires substantially more data to match top KAN-AE performance. These results highlight the ability of KAN-AEs to combine data efficiency with strong fault detection performance. Their use of structured basis functions suggests potential for improved model transparency, making them promising candidates for deployment in data-constrained industrial settings.

[422] Beyond Least Squares: Robust Regression Transformer (R2T)

Roman Gutierrez, Tony Kai Tang, Isabel Gutierrez

Main category: cs.LG

TL;DR: A hybrid neural-symbolic architecture improves regression performance under asymmetric structured noise, outperforming traditional methods by 10-300x.

DetailsMotivation: Traditional least-squares optimization fails with asymmetric structured noise, necessitating a more robust approach.

Method: Combines a transformer encoder, compression NN, and fixed symbolic equation to learn symbolic fits from noisy data.

Result: Achieves median MSE of 6e-6 to 3.5e-5, significantly better than least-squares and robust regression techniques.

Conclusion: The hybrid architecture effectively handles asymmetric noise, offering superior regression accuracy.

Abstract: Robust regression techniques rely on least-squares optimization, which works well for Gaussian noise but fails in the presence of asymmetric structured noise. We propose a hybrid neural-symbolic architecture where a transformer encoder processes numerical sequences, a compression NN predicts symbolic parameters, and a fixed symbolic equation reconstructs the original sequence. Using synthetic data, the training objective is to recover the original sequence after adding asymmetric structured noise, effectively learning a symbolic fit guided by neural parameter estimation. Our model achieves a median regression MSE of 6e-6 to 3.5e-5 on synthetic wearable data, which is a 10-300 times improvement when compared with ordinary least squares fit and robust regression techniques such as Huber loss or SoftL1.

[423] CauKer: classification time series foundation models can be pretrained on synthetic data only

Shifeng Xie, Vasilii Feofanov, Marius Alonso, Ambroise Odonnat, Jianfeng Zhang, Themis Palpanas, Ievgen Redko

Main category: cs.LG

TL;DR: CauKer is a novel algorithm for generating synthetic time series to efficiently pretrain Time Series Foundation Models (TSFMs), overcoming the need for large-scale real-world data.

DetailsMotivation: To address the computational cost and data scarcity in pretraining TSFMs, CauKer aims to create diverse, causally coherent synthetic time series.

Method: CauKer combines Gaussian Process kernel composition with Structural Causal Models to generate realistic synthetic time series for pretraining.

Result: CauKer enables sample-efficient pretraining of TSFMs and exhibits clear scaling laws for dataset size and model capacity, unlike real-world datasets.

Conclusion: CauKer offers a practical solution for pretraining TSFMs efficiently, with potential for broader applications in time series analysis.

Abstract: Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.

[424] Neural Networks with Orthogonal Jacobian

Alex Massucco, Davide Murari, Carola-Bibiane Schönlieb

Main category: cs.LG

TL;DR: The paper introduces a framework for designing deep neural networks with orthogonal Jacobian matrices, ensuring stable training and competitive performance without relying on skip connections.

DetailsMotivation: Training very deep neural networks is challenging due to vanishing or exploding gradients. Existing solutions like orthogonal initialization or residual architectures help but are limited.

Method: A unified mathematical framework is proposed to enforce exact orthogonality of Jacobian matrices in feedforward and residual networks, ensuring dynamical isometry.

Result: Networks with perfect Jacobian orthogonality stabilize training and perform competitively, even without skip connections. Generalized models with partial isometries also maintain trainability.

Conclusion: The framework enables efficient training of deep networks by enforcing Jacobian orthogonality, offering new designs beyond conventional architectures.

Abstract: Very deep neural networks achieve state-of-the-art performance by extracting rich, hierarchical features. Yet, training them via backpropagation is often hindered by vanishing or exploding gradients. Existing remedies, such as orthogonal or variance-preserving initialisation and residual architectures, allow for a more stable gradient propagation and the training of deeper models. In this work, we introduce a unified mathematical framework that describes a broad class of nonlinear feedforward and residual networks, whose input-to-output Jacobian matrices are exactly orthogonal almost everywhere. Such a constraint forces the resulting networks to achieve perfect dynamical isometry and train efficiently despite being very deep. Our formulation not only recovers standard architectures as particular cases but also yields new designs that match the trainability of residual networks without relying on conventional skip connections. We provide experimental evidence that perfect Jacobian orthogonality at initialisation is sufficient to stabilise training and achieve competitive performance. We compare this strategy to networks regularised to maintain the Jacobian orthogonality and obtain comparable results. We further extend our analysis to a class of networks well-approximated by those with orthogonal Jacobians and introduce networks with Jacobians representing partial isometries. These generalized models are then showed to maintain the favourable trainability properties.

[425] Physics-Embedded Neural ODEs for Sim2Real Edge Digital Twins of Hybrid Power Electronics Systems

Jialin Zheng, Haoyu Wang, Yangbin Zeng, Di Mou, Xin Zhang, Hong Li, Sergio Vazquez, Leopoldo G. Franquelo

Main category: cs.LG

TL;DR: The paper proposes Physics-Embedded Neural ODEs (PENODE) to improve modeling of hybrid dynamics in Power Electronics Systems (PES) for better Sim-to-Real generalization on edge devices.

DetailsMotivation: Existing modeling approaches fail to capture evolving hybrid dynamics in PES, degrading performance on resource-constrained edge devices.

Method: PENODE embeds hybrid dynamics as an event automaton and injects known ODE components into neural parameterization, creating a trainable, interpretable architecture.

Result: PENODE achieves higher accuracy in benchmarks with 75% fewer neurons, enabling efficient FPGA deployment and real-time control.

Conclusion: PENODE enhances physical interpretability, edge deployment efficiency, and real-time control in PES.

Abstract: Edge Digital Twins (EDTs) are crucial for monitoring and control of Power Electronics Systems (PES). However, existing modeling approaches struggle to consistently capture continuously evolving hybrid dynamics that are inherent in PES, degrading Sim-to-Real generalization on resource-constrained edge devices. To address these challenges, this paper proposes a Physics-Embedded Neural ODEs (PENODE) that (i) embeds the hybrid operating mechanism as an event automaton to explicitly govern discrete switching and (ii) injects known governing ODE components directly into the neural parameterization of unmodeled dynamics. This unified design yields a differentiable end-to-end trainable architecture that preserves physical interpretability while reducing redundancy, and it supports a cloud-to-edge toolchain for efficient FPGA deployment. Experimental results demonstrate that PENODE achieves significantly higher accuracy in benchmarks in white-box, gray-box, and black-box scenarios, with a 75% reduction in neuron count, validating that the proposed PENODE maintains physical interpretability, efficient edge deployment, and real-time control enhancement.

[426] Clus-UCB: A Near-Optimal Algorithm for Clustered Bandits

Aakash Gore, Prasanna Chaporkar

Main category: cs.LG

TL;DR: The paper introduces a stochastic multi-armed bandit problem with known arm clusters, where mean rewards within clusters differ by a known threshold. It proposes Clus-UCB, an algorithm exploiting cluster structure, and shows improved regret bounds.

DetailsMotivation: The work models scenarios like online advertising and clinical trials where outcomes depend on factors with varying influence, aiming to leverage known clustering for better performance.

Method: The authors derive asymptotic regret lower bounds and propose Clus-UCB, an algorithm using cluster information to share arm data and improve decision-making.

Result: Clus-UCB matches the derived lower bounds asymptotically and outperforms KL-UCB and other algorithms in simulations.

Conclusion: The paper highlights limitations and suggests future research directions, emphasizing the potential of cluster-aware bandit algorithms.

Abstract: We study a stochastic multi-armed bandit setting where arms are partitioned into known clusters, such that the mean rewards of arms within a cluster differ by at most a known threshold. While the clustering structure is known a priori, the arm means are unknown. This framework models scenarios where outcomes depend on multiple factors – some with significant and others with minor influence – such as online advertising, clinical trials, and wireless communication. We derive asymptotic lower bounds on the regret that improve upon the classical bound of Lai & Robbins (1985). We then propose Clus-UCB, an efficient algorithm that closely matches this lower bound asymptotically. Clus-UCB is designed to exploit the clustering structure and introduces a new index to evaluate an arm, which depends on other arms within the cluster. In this way, arms share information among each other. We present simulation results of our algorithm and compare its performance against KL-UCB and other well-known algorithms for bandits with dependent arms. Finally, we address some limitations of this work and conclude by mentioning possible future research.

[427] Neural Approximators for Low-Thrust Trajectory Transfer Cost and Reachability

Zhong Zhang, Francesco Topputo

Main category: cs.LG

TL;DR: The paper proposes pretrained neural networks for predicting fuel consumption and trajectory reachability in low-thrust missions, achieving high accuracy and generalization across diverse scenarios.

DetailsMotivation: To address the need for accurate and generalizable predictions of fuel consumption and trajectory reachability in low-thrust mission design.

Method: Uses the Scaling Law and homotopy ray method to construct a large dataset, transforms data into a self-similar space, and trains neural networks adaptable to various mission parameters.

Result: Neural networks achieve 0.78% error in velocity increment prediction and 0.63% in transfer time estimation, validated on diverse scenarios.

Conclusion: The proposed method provides a highly accurate, generalizable, and computationally efficient solution for low-thrust trajectory approximation.

Abstract: In trajectory design, fuel consumption and trajectory reachability are two key performance indicators for low-thrust missions. This paper proposes general-purpose pretrained neural networks to predict these metrics. The contributions of this paper are as follows: Firstly, based on the confirmation of the Scaling Law applicable to low-thrust trajectory approximation, the largest dataset is constructed using the proposed homotopy ray method, which aligns with mission-design-oriented data requirements. Secondly, the data are transformed into a self-similar space, enabling the neural network to adapt to arbitrary semi-major axes, inclinations, and central bodies. This extends the applicability beyond existing studies and can generalize across diverse mission scenarios without retraining. Thirdly, to the best of our knowledge, this work presents the current most general and accurate low-thrust trajectory approximator, with implementations available in C++, Python, and MATLAB. The resulting neural network achieves a relative error of 0.78% in predicting velocity increments and 0.63% in minimum transfer time estimation. The models have also been validated on a third-party dataset, multi-flyby mission design problem, and mission analysis scenario, demonstrating their generalization capability, predictive accuracy, and computational efficiency.

[428] BoostTransformer: Enhancing Transformer Models with Subgrid Selection and Importance Sampling

Biyi Fang, Jean Utke, Truong Vo, Diego Klabjan

Main category: cs.LG

TL;DR: BoostTransformer introduces boosting principles into transformers for efficient training and better performance, outperforming standard transformers in text classification.

DetailsMotivation: Transformers are computationally intensive and require complex tuning; BoostTransformer aims to address these issues.

Method: Incorporates least square boosting into transformers via subgrid token selection and importance-weighted sampling.

Result: Faster convergence and higher accuracy on fine-grained text classification benchmarks.

Conclusion: BoostTransformer is a promising alternative to standard transformers, reducing computational and tuning burdens.

Abstract: Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.

[429] GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

Arthur Cho

Main category: cs.LG

TL;DR: GrandJury introduces a dynamic, pluralistic evaluation protocol for generative ML models, addressing the limitations of static benchmarks by incorporating time-decayed aggregation, traceability, and multi-rater human judgment.

DetailsMotivation: Standard evaluation regimes for generative ML models rely on static benchmarks, which misalign with dynamic user needs and evolving contexts. GrandJury aims to provide a more accountable and context-aware evaluation framework.

Method: GrandJury combines time-decayed aggregation, complete traceability, dynamic task rubric attribution, and multi-rater human judgment to enable pluralistic evaluation.

Result: The protocol captures evolving consensus and highlights disagreements, offering a more nuanced evaluation of generative ML outputs. An open-source implementation and public LLM inference dataset are provided.

Conclusion: GrandJury offers a new paradigm for evaluating generative ML models in contexts without absolute ground truth, promoting alignment with dynamic user needs.

Abstract: Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus and surfaces disagreement. We provide an open-source implementation (grandjury PyPI package) and a public collection of Large Language Model (LLM) inference outputs to illustrate the need and method. GrandJury provides a new paradigm for AI practitioners when evaluating machine learning outputs without absolute ground truth.

[430] PLoRA: Efficient LoRA Hyperparameter Tuning for Large Models

Minghao Yan, Zhuang Wang, Zhen Jia, Shivaram Venkataraman, Yida Wang

Main category: cs.LG

TL;DR: PLoRA improves LoRA fine-tuning efficiency by optimizing hardware resource usage and training throughput, reducing makespan by up to 7.52x and increasing throughput by 12.8x.

DetailsMotivation: Current LoRA training paradigms inefficiently use hardware resources and require high overhead for performant adapters.

Method: PLoRA orchestrates concurrent LoRA fine-tuning jobs under hardware constraints and develops efficient kernels.

Result: PLoRA reduces makespan by up to 7.52x and improves training throughput by up to 12.8x.

Conclusion: PLoRA significantly enhances LoRA fine-tuning efficiency, making it more scalable and resource-effective.

Abstract: Low-rank Adaptation (LoRA) has gained popularity as a fine-tuning approach for Large Language Models (LLMs) due to its low resource requirements and good performance. While a plethora of work has investigated improving LoRA serving efficiency by serving multiple LoRAs concurrently, existing methods assume that a wide range of LoRA adapters are available for serving. In our work, we conduct extensive empirical studies to identify that current training paradigms do not utilize hardware resources efficiently and require high overhead to obtain a performant LoRA. Leveraging these insights, we propose PLoRA, which automatically orchestrates concurrent LoRA fine-tuning jobs under given hardware and model constraints and develops performant kernels to improve training efficiency. Our experimental studies show that PLoRA reduces the makespan of LoRA fine-tuning over a given hyperparameter search space by up to 7.52x and improves training throughput by up to 12.8x across a range of state-of-the-art LLMs.

[431] Injecting Measurement Information Yields a Fast and Noise-Robust Diffusion-Based Inverse Problem Solver

Jonathan Patsenker, Henry Li, Myeongseob Ko, Ruoxi Jia, Yuval Kluger

Main category: cs.LG

TL;DR: The paper proposes a method to improve diffusion models for inverse problems by estimating the conditional posterior mean, integrating measurement information directly into the sampling process.

DetailsMotivation: Current diffusion models for inverse problems rely on Tweedie's formula, which ignores measurement information, leading to suboptimal performance.

Method: The authors estimate the conditional posterior mean via a lightweight maximum likelihood estimation problem, integrating it into standard samplers.

Result: The proposed method shows comparable or better performance than contemporary solvers across various datasets and tasks.

Conclusion: The approach provides a fast, memory-efficient, and noise-robust solution for inverse problems using diffusion models.

Abstract: Diffusion models have been firmly established as principled zero-shot solvers for linear and nonlinear inverse problems, owing to their powerful image prior and iterative sampling algorithm. These approaches often rely on Tweedie’s formula, which relates the diffusion variate $\mathbf{x}_t$ to the posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t]$, in order to guide the diffusion trajectory with an estimate of the final denoised sample $\mathbf{x}_0$. However, this does not consider information from the measurement $\mathbf{y}$, which must then be integrated downstream. In this work, we propose to estimate the conditional posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t, \mathbf{y}]$, which can be formulated as the solution to a lightweight, single-parameter maximum likelihood estimation problem. The resulting prediction can be integrated into any standard sampler, resulting in a fast and memory-efficient inverse solver. Our optimizer is amenable to a noise-aware likelihood-based stopping criteria that is robust to measurement noise in $\mathbf{y}$. We demonstrate comparable or improved performance against a wide selection of contemporary inverse solvers across multiple datasets and tasks.

[432] Scalable Varied-Density Clustering via Graph Propagation

Ninh Pham, Yingtao Zheng, Hugo Phibbs

Main category: cs.LG

TL;DR: A novel density-based clustering method using label propagation in adaptive neighborhood graphs, scalable to high-dimensional data with millions of points.

DetailsMotivation: To address the challenge of varied-density clustering in high-dimensional data by leveraging graph connectivity and efficient propagation techniques.

Method: Frames clustering as label propagation in density-adaptive neighborhood graphs, using a density-aware propagation algorithm and random projections for scalability.

Result: Reduces computational cost significantly, scales to millions of points in minutes, and maintains competitive accuracy.

Conclusion: The method effectively combines density-based clustering with graph techniques for scalable, high-quality results.

Abstract: We propose a novel perspective on varied-density clustering for high-dimensional data by framing it as a label propagation process in neighborhood graphs that adapt to local density variations. Our method formally connects density-based clustering with graph connectivity, enabling the use of efficient graph propagation techniques developed in network science. To ensure scalability, we introduce a density-aware neighborhood propagation algorithm and leverage advanced random projection methods to construct approximate neighborhood graphs. Our approach significantly reduces computational cost while preserving clustering quality. Empirically, it scales to datasets with millions of points in minutes and achieves competitive accuracy compared to existing baselines.

[433] On the Fast Adaptation of Delayed Clients in Decentralized Federated Learning: A Centroid-Aligned Distillation Approach

Jiahui Bai, Hai Dong, A. K. Qin

Main category: cs.LG

TL;DR: DFedCAD is a framework for decentralized federated learning that reduces communication costs and improves adaptation for delayed clients using Centroid-Aligned Distillation and Weighted Cluster Pruning.

DetailsMotivation: Address slow adaptation of late-joining clients and high communication costs in asynchronous decentralized federated learning.

Method: Uses Weighted Cluster Pruning (WCP) to compress models into centroids and a structural distance metric with differentiable k-means distillation for knowledge transfer.

Result: Achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and Tiny-ImageNet, reducing communication overhead by over 86%.

Conclusion: DFedCAD offers a scalable and efficient solution for decentralized learning in dynamic environments.

Abstract: Decentralized Federated Learning (DFL) struggles with the slow adaptation of late-joining delayed clients and high communication costs in asynchronous environments. These limitations significantly hinder overall performance. To address this, we propose DFedCAD, a novel framework for rapid adaptation via Centroid-Aligned Distillation. DFedCAD first employs Weighted Cluster Pruning (WCP) to compress models into representative centroids, drastically reducing communication overhead. It then enables delayed clients to intelligently weigh and align with peer knowledge using a novel structural distance metric and a differentiable k-means distillation module, facilitating efficient end-to-end knowledge transfer. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that DFedCAD consistently achieves state-of-the-art performance, attaining the highest accuracy across all evaluated settings while reducing communication overhead by over 86%. Our framework provides a scalable and practical solution for efficient decentralized learning in dynamic, real-world scenarios.

[434] Where and How to Enhance: Discovering Bit-Width Contribution for Mixed Precision Quantization

Haidong Kang, Lianbo Ma, Guo Yu, Shangce Gao

Main category: cs.LG

TL;DR: The paper introduces Shapley-based MPQ (SMPQ) to improve bit-width selection in mixed precision quantization by measuring direct contributions, outperforming gradient-based methods.

DetailsMotivation: Existing gradient-based MPQ methods assume quantization parameter values reflect bit-width contributions, but this may not hold. The paper aims to address this gap.

Method: Proposes SMPQ, using Shapley values to measure bit-width contributions directly, with Monte Carlo sampling for efficient computation.

Result: SMPQ achieves state-of-the-art performance on benchmarks compared to gradient-based methods.

Conclusion: SMPQ provides a more accurate and efficient approach for bit-width selection in mixed precision quantization.

Abstract: Mixed precision quantization (MPQ) is an effective quantization approach to achieve accuracy-complexity trade-off of neural network, through assigning different bit-widths to network activations and weights in each layer. The typical way of existing MPQ methods is to optimize quantization policies (i.e., bit-width allocation) in a gradient descent manner, termed as Differentiable (DMPQ). At the end of the search, the bit-width associated to the quantization parameters which has the largest value will be selected to form the final mixed precision quantization policy, with the implicit assumption that the values of quantization parameters reflect the operation contribution to the accuracy improvement. While much has been discussed about the MPQ improvement, the bit-width selection process has received little attention. We study this problem and argue that the magnitude of quantization parameters does not necessarily reflect the actual contribution of the bit-width to the task performance. Then, we propose a Shapley-based MPQ (SMPQ) method, which measures the bit-width operation direct contribution on the MPQ task. To reduce computation cost, a Monte Carlo sampling-based approximation strategy is proposed for Shapley computation. Extensive experiments on mainstream benchmarks demonstrate that our SMPQ consistently achieves state-of-the-art performance than gradient-based competitors.

[435] Urban In-Context Learning: Bridging Pretraining and Inference through Masked Diffusion for Urban Profiling

Ruixing Zhang, Bo Wang, Tongyu Zhu, Leilei Sun, Weifeng Lv

Main category: cs.LG

TL;DR: Proposes Urban In-Context Learning, a one-stage framework for urban profiling using masked autoencoding and diffusion modeling, outperforming traditional two-stage methods.

DetailsMotivation: Existing two-stage urban profiling methods are inefficient. Inspired by GPT, the paper aims to unify pretraining and inference for urban data, which differs structurally from language.

Method: Introduces Urban In-Context Learning with Urban Masked Diffusion Transformer for distributional predictions and Urban Representation Alignment Mechanism for stable training.

Result: Outperforms state-of-the-art two-stage methods on three indicators across two cities, with diffusion modeling proving particularly effective.

Conclusion: The one-stage framework is more efficient and effective, validated by ablation and case studies.

Abstract: Urban profiling aims to predict urban profiles in unknown regions and plays a critical role in economic and social censuses. Existing approaches typically follow a two-stage paradigm: first, learning representations of urban areas; second, performing downstream prediction via linear probing, which originates from the BERT era. Inspired by the development of GPT style models, recent studies have shown that novel self-supervised pretraining schemes can endow models with direct applicability to downstream tasks, thereby eliminating the need for task-specific fine-tuning. This is largely because GPT unifies the form of pretraining and inference through next-token prediction. However, urban data exhibit structural characteristics that differ fundamentally from language, making it challenging to design a one-stage model that unifies both pretraining and inference. In this work, we propose Urban In-Context Learning, a framework that unifies pretraining and inference via a masked autoencoding process over urban regions. To capture the distribution of urban profiles, we introduce the Urban Masked Diffusion Transformer, which enables each region’ s prediction to be represented as a distribution rather than a deterministic value. Furthermore, to stabilize diffusion training, we propose the Urban Representation Alignment Mechanism, which regularizes the model’s intermediate features by aligning them with those from classical urban profiling methods. Extensive experiments on three indicators across two cities demonstrate that our one-stage method consistently outperforms state-of-the-art two-stage approaches. Ablation studies and case studies further validate the effectiveness of each proposed module, particularly the use of diffusion modeling.

[436] A Novel Multimodal Framework for Early Detection of Alzheimers Disease Using Deep Learning

Tatwadarshi P Nagarhalli, Sanket Patil, Vishal Pande, Uday Aswalekar, Prafulla Patil

Main category: cs.LG

TL;DR: A novel multimodal framework combining MRI, cognitive tests, and biomarkers improves early Alzheimer’s detection using CNNs and LSTMs, outperforming traditional single-modality methods.

DetailsMotivation: Early diagnosis of Alzheimer's Disease is challenging with traditional single-modality methods, leading to delayed treatment and worse outcomes.

Method: The framework integrates MRI (analyzed with CNNs), cognitive assessments, and biomarkers (processed with LSTMs), using weighted averaging for robust results.

Result: The multimodal approach enhances diagnostic accuracy and reliability, enabling early detection even with incomplete data.

Conclusion: This framework revolutionizes early AD detection, facilitating timely intervention and potentially altering disease progression.

Abstract: Alzheimers Disease (AD) is a progressive neurodegenerative disorder that poses significant challenges in its early diagnosis, often leading to delayed treatment and poorer outcomes for patients. Traditional diagnostic methods, typically reliant on single data modalities, fall short of capturing the multifaceted nature of the disease. In this paper, we propose a novel multimodal framework for the early detection of AD that integrates data from three primary sources: MRI imaging, cognitive assessments, and biomarkers. This framework employs Convolutional Neural Networks (CNN) for analyzing MRI images and Long Short-Term Memory (LSTM) networks for processing cognitive and biomarker data. The system enhances diagnostic accuracy and reliability by aggregating results from these distinct modalities using advanced techniques like weighted averaging, even in incomplete data. The multimodal approach not only improves the robustness of the detection process but also enables the identification of AD at its earliest stages, offering a significant advantage over conventional methods. The integration of biomarkers and cognitive tests is particularly crucial, as these can detect Alzheimer’s long before the onset of clinical symptoms, thereby facilitating earlier intervention and potentially altering the course of the disease. This research demonstrates that the proposed framework has the potential to revolutionize the early detection of AD, paving the way for more timely and effective treatments

[437] VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision

Dingwei Zhu, Shihan Dou, Zhiheng Xi, Senjie Jin, Guoqiang Zhang, Jiazheng Zhang, Junjie Ye, Mingxu Chai, Enyu Zhou, Ming Zhang, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui

Main category: cs.LG

TL;DR: VRPO, a value-centric framework, improves RLHF robustness by enhancing the value model’s noise-filtering ability through auxiliary losses and variational information bottleneck, outperforming PPO and GRPO in noisy settings.

DetailsMotivation: Noisy or imperfect reward supervision in RLHF undermines policy stability and generalization, often overlooked in prior work focusing on reward denoising.

Method: VRPO introduces an auxiliary loss guided by entropy and perplexity from a frozen language model and a variational information bottleneck to enhance the value model’s noise resilience.

Result: VRPO outperforms PPO and GRPO baselines in math reasoning, science QA, and multi-turn dialogue tasks under noisy rewards.

Conclusion: A strong value model is crucial for RLHF robustness, and VRPO provides a principled approach for policy optimization in noisy environments.

Abstract: Reinforcement Learning from Human Feedback (RLHF) often suffers from noisy or imperfect reward supervision in real-world settings, which undermines policy stability and generalization. Such noise may cause models to lose attention on key words during advantage estimation. While prior work focuses on reward denoising or filtering poor data, it often overlooks the critical role of the value model in policy optimization. In this work, we show that a strong value model is essential for mitigating noise by absorbing unstable signals and enabling more reliable advantage estimation. We propose VRPO, a value-centric framework for robust PPO training under noisy supervision. VRPO combines two core designs: (1) an auxiliary loss guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck. These mechanisms enhance the value model’s ability to filter out noise and capture key words from the context during advantage estimation, transforming it from a passive predictor into an active regulator of noise. Experiments on math reasoning, science QA, and multi-turn dialogue, under both rule-based and model-based noisy rewards, show that VRPO consistently outperforms PPO and GRPO baselines. Our findings underscore the often-overlooked importance of the value model in RLHF and offer a principled and practical approach to robust policy optimization in noisy real-world environments.

[438] Achieving Limited Adaptivity for Multinomial Logistic Bandits

Sukruta Prakash Midigeshi, Tanmay Goyal, Gaurav Sinha

Main category: cs.LG

TL;DR: Error: OutputParser failed

DetailsMotivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Multinomial Logistic Bandits have recently attracted much attention due to their ability to model problems with multiple outcomes. In this setting, each decision is associated with many possible outcomes, modeled using a multinomial logit function. Several recent works on multinomial logistic bandits have simultaneously achieved optimal regret and computational efficiency. However, motivated by real-world challenges and practicality, there is a need to develop algorithms with limited adaptivity, wherein we are allowed only $M$ policy updates. To address these challenges, we present two algorithms, B-MNL-CB and RS-MNL, that operate in the batched and rarely-switching paradigms, respectively. The batched setting involves choosing the $M$ policy update rounds at the start of the algorithm, while the rarely-switching setting can choose these $M$ policy update rounds in an adaptive fashion. Our first algorithm, B-MNL-CB extends the notion of distributional optimal designs to the multinomial setting and achieves $\tilde{O}(\sqrt{T})$ regret assuming the contexts are generated stochastically when presented with $\Omega(\log \log T)$ update rounds. Our second algorithm, RS-MNL works with adversarially generated contexts and can achieve $\tilde{O}(\sqrt{T})$ regret with $\tilde{O}(\log T)$ policy updates. Further, we conducted experiments that demonstrate that our algorithms (with a fixed number of policy updates) are extremely competitive (and often better) than several state-of-the-art baselines (which update their policy every round), showcasing the applicability of our algorithms in various practical scenarios.

[439] HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation

Mengting Pan, Fan Li, Xiaoyang Wang, Wenjie Zhang, Xuemin Lin

Main category: cs.LG

TL;DR: HiTeC introduces a two-stage hierarchical contrastive learning framework for text-attributed hypergraphs, addressing limitations of prior methods with semantic-aware augmentation and multi-scale contrastive loss.

DetailsMotivation: Prior contrastive learning methods for hypergraphs overlook textual information and suffer from suboptimal representations, noise, and limited long-range dependency capture.

Method: HiTeC uses a two-stage approach: (1) pre-training a structure-aware text encoder, and (2) employing semantic-aware augmentation and multi-scale contrastive loss.

Result: HiTeC outperforms existing methods, achieving scalable and effective self-supervised learning on text-attributed hypergraphs.

Conclusion: HiTeC addresses key limitations of prior work, offering a scalable and high-quality solution for self-supervised learning on text-attributed hypergraphs.

Abstract: Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costly labels. However, node entities in real-world hypergraphs are often associated with rich textual information, which is overlooked in prior works. Directly applying existing CL-based methods to such text-attributed hypergraphs (TAHGs) leads to three key limitations: (1) The common use of graph-agnostic text encoders overlooks the correlations between textual content and hypergraph topology, resulting in suboptimal representations. (2) Their reliance on random data augmentations introduces noise and weakens the contrastive objective. (3) The primary focus on node- and hyperedge-level contrastive signals limits the ability to capture long-range dependencies, which is essential for expressive representation learning. Although HyperBERT pioneers CL on TAHGs, its co-training paradigm suffers from poor scalability. To fill the research gap, we introduce HiTeC, a two-stage hierarchical contrastive learning framework with semantic-aware augmentation for scalable and effective self-supervised learning on TAHGs. In the first stage, we pre-train the text encoder with a structure-aware contrastive objective to overcome the graph-agnostic nature of conventional methods. In the second stage, we introduce two semantic-aware augmentation strategies, including prompt-enhanced text augmentation and semantic-aware hyperedge drop, to facilitate informative view generation. Furthermore, we propose a multi-scale contrastive loss that extends existing objectives with an $s$-walk-based subgraph-level contrast to better capture long-range dependencies. By decoupling text encoder pretraining from hypergraph contrastive learning, this two-stage design enhances scalability without compromising representation quality. Extensive experiments confirm the effectiveness of HiTeC.

[440] Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis

Yuichi Kondo, Hideaki Iiduka

Main category: cs.LG

TL;DR: The paper introduces a novel Lyapunov function to analyze SGDM’s convergence under dynamic learning rate and batch size schedules, revealing a hierarchy in convergence behavior and validating faster decay rates empirically.

DetailsMotivation: To simplify and unify the convergence analysis of SGDM under dynamic schedules, addressing gaps in existing theoretical frameworks.

Method: A novel Lyapunov function is introduced to analyze SGDM’s convergence across three dynamic scheduling strategies. Theoretical and empirical evaluations are conducted.

Result: Dynamic schedules (ii) and (iii) ensure convergence, with (iii) achieving faster decay. Empirical results confirm dynamic SGDM outperforms fixed baselines.

Conclusion: The study provides a unified theoretical foundation and practical guidance for efficient deep learning training with SGDM.

Abstract: We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning rate and batch size schedules by introducing a novel Lyapunov function. This Lyapunov function has a simpler structure compared with existing ones, facilitating the challenging convergence analysis of SGDM and a unified analysis across various dynamic schedules. Specifically, we extend the theoretical framework to cover three practical scheduling strategies commonly used in deep learning: (i) constant batch size with a decaying learning rate, (ii) increasing batch size with a decaying learning rate, and (iii) increasing batch size with an increasing learning rate. Our theoretical results reveal a clear hierarchy in convergence behavior: while (i) does not guarantee convergence of the expected gradient norm, both (ii) and (iii) do. Moreover, (iii) achieves a provably faster decay rate than (i) and (ii), demonstrating theoretical acceleration even in the presence of momentum. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms fixed-hyperparameter baselines in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior. These findings provide a unified theoretical foundation and practical guidance for designing efficient and stable training procedures in modern deep learning.

[441] Pseudo-label Induced Subspace Representation Learning for Robust Out-of-Distribution Detection

Tarhib Al Azad, Faizul Rakib Sayem, Shahana Ibrahim

Main category: cs.LG

TL;DR: Proposes a novel OOD detection framework using pseudo-label-induced subspace representation and a learning criterion to enhance ID-OOD separability.

DetailsMotivation: Addresses limitations of existing feature-based OOD detection methods that rely on restrictive assumptions about the feature space.

Method: Uses pseudo-label-induced subspace representation and a combined learning criterion (cross-entropy loss and subspace distance regularization).

Result: Extensive experiments confirm the framework’s effectiveness.

Conclusion: The proposed framework improves OOD detection under relaxed assumptions.

Abstract: Out-of-distribution (OOD) detection lies at the heart of robust artificial intelligence (AI), aiming to identify samples from novel distributions beyond the training set. Recent approaches have exploited feature representations as distinguishing signatures for OOD detection. However, most existing methods rely on restrictive assumptions on the feature space that limit the separability between in-distribution (ID) and OOD samples. In this work, we propose a novel OOD detection framework based on a pseudo-label-induced subspace representation, that works under more relaxed and natural assumptions compared to existing feature-based techniques. In addition, we introduce a simple yet effective learning criterion that integrates a cross-entropy-based ID classification loss with a subspace distance-based regularization loss to enhance ID-OOD separability. Extensive experiments validate the effectiveness of our framework.

[442] GEDAN: Learning the Edit Costs for Graph Edit Distance

Francesco Leonardi, Markus Orsi, Jean-Louis Reymond, Kaspar Riesen

Main category: cs.LG

TL;DR: A novel Graph Neural Network framework approximates Graph Edit Distance (GED) with supervised and unsupervised training, improving adaptability and interpretability.

DetailsMotivation: Overcome the NP-hard computation of GED and unrealistic unit-cost assumptions in existing NN-based methods.

Method: Uses a GNN framework with a Generalized Additive Model for context-aware edit costs, enabling unsupervised training via gradient-only self-organization.

Result: Achieves comparable performance to state-of-the-art methods while enhancing adaptability and interpretability.

Conclusion: The method is valuable for domains like molecular analysis due to its flexible and interpretable cost learning.

Abstract: Graph Edit Distance (GED) is defined as the minimum cost transformation of one graph into another and is a widely adopted metric for measuring the dissimilarity between graphs. The major problem of GED is that its computation is NP-hard, which has in turn led to the development of various approximation methods, including approaches based on neural networks (NN). Most of these NN-based models simplify the problem of GED by assuming unit-cost edit operations, a rather unrealistic constraint in real-world applications. In this work, we present a novel Graph Neural Network framework that approximates GED using both supervised and unsupervised training. In the unsupervised setting, it employs a gradient-only self-organizing mechanism that enables optimization without ground-truth distances. Moreover, a core component of our architecture is the integration of a Generalized Additive Model, which allows the flexible and interpretable learning of context-aware edit costs. Experimental results show that the proposed method achieves similar results as state-of-the-art reference methods, yet significantly improves both adaptability and interpretability. That is, the learned cost function offers insights into complex graph structures, making it particularly valuable in domains such as molecular analysis and structural pattern discovery.

[443] RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

The-Hai Nguyen, Dang Huu-Tien, Takeshi Suzuki, Le-Minh Nguyen

Main category: cs.LG

TL;DR: RegMean++ improves RegMean by incorporating intra- and cross-layer dependencies, outperforming it in various tasks and achieving competitive results with advanced methods.

DetailsMotivation: RegMean merges linear layers independently, ignoring feature propagation and dependencies, limiting its effectiveness.

Method: RegMean++ extends RegMean by explicitly including intra- and cross-layer dependencies in the merging objective.

Result: RegMean++ consistently outperforms RegMean in ID, OOD generalization, sequential merging, large-scale tasks, and robustness.

Conclusion: RegMean++ is a superior alternative to RegMean, offering better performance and capturing merge model behaviors more effectively.

Abstract: Regression Mean (RegMean), an approach that formulates model merging as a linear regression problem, aims to find the optimal weights for each linear layer in the merge model by minimizing the discrepancy in predictions between the merge and candidate models. RegMean provides a precise closed-form solution for the merging problem; therefore, it offers explainability and computational efficiency. However, RegMean merges each linear layer independently, overlooking how the features and information in the earlier layers propagate through the layers and influence the final prediction in the merge model. In this paper, we introduce RegMean++, a simple yet effective alternative to RegMean, that explicitly incorporates both intra- and cross-layer dependencies between merge models’ layers into RegMean’s objective. By accounting for these dependencies, RegMean++ better captures the behaviors of the merge model. Extensive experiments demonstrate that RegMean++ consistently outperforms RegMean across diverse settings, including in-domain (ID) and out-of-domain (OOD) generalization, sequential merging, large-scale tasks, and robustness under several types of distribution shifts. Furthermore, RegMean++ achieves competitive or state-of-the-art performance compared to various recent advanced model merging methods. Our code is available at https://github.com/nthehai01/RegMean-plusplus.

[444] Frontier: Simulating the Next Generation of LLM Inference Systems

Yicheng Feng, Xin Tan, Kin Hang Sew, Yimin Jiang, Yibo Zhu, Hong Xu

Main category: cs.LG

TL;DR: Frontier is a high-fidelity simulator designed for complex LLM inference, supporting MoE models and disaggregated architectures, with refined operator models for accuracy.

DetailsMotivation: Existing simulators fail to capture the dynamics of emerging LLM paradigms like MoE and disaggregated architectures, necessitating a new tool.

Method: Frontier provides a unified framework for modeling co-located and disaggregated systems, with native support for MoE inference and expert parallelism. It simulates complex workflows like cross-cluster expert routing and advanced pipelining.

Result: Frontier enables high-fidelity simulation of intricate LLM inference workflows, improving accuracy with refined operator models.

Conclusion: Frontier empowers the community to design and optimize scalable LLM inference systems for future needs.

Abstract: Large Language Model (LLM) inference is growing increasingly complex with the rise of Mixture-of-Experts (MoE) models and disaggregated architectures that decouple components like prefill/decode (PD) or attention/FFN (AF) for heterogeneous scaling. Existing simulators, architected for co-located, dense models, are unable to capture the intricate system dynamics of these emerging paradigms. We present Frontier, a high-fidelity simulator designed from the ground up for this new landscape. Frontier introduces a unified framework to model both co-located and disaggregated systems, providing native support for MoE inference with expert parallelism (EP). It enables the simulation of complex workflows like cross-cluster expert routing and advanced pipelining strategies for latency hiding. To ensure fidelity and usability, Frontier incorporates refined operator models for improved accuracy. Frontier empowers the community to design and optimize the future of LLM inference at scale.

[445] Estimating Worst-Case Frontier Risks of Open-Weight LLMs

Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, Chris Koch

Main category: cs.LG

TL;DR: The paper investigates the risks of releasing GPT-OSS by testing its worst-case capabilities in biology and cybersecurity through malicious fine-tuning (MFT). Results show it underperforms frontier closed-weight models and marginally impacts open-weight models, supporting its release.

DetailsMotivation: To assess the potential harm of releasing GPT-OSS by evaluating its worst-case capabilities in high-risk domains (biology and cybersecurity).

Method: Malicious fine-tuning (MFT) is used to maximize GPT-OSS’s capabilities in biology (threat creation) and cybersecurity (CTF challenges). Performance is compared against open- and closed-weight LLMs.

Result: MFT GPT-OSS underperforms frontier closed-weight models (e.g., OpenAI o3) and marginally increases biological capabilities in open-weight models, but does not significantly advance the frontier.

Conclusion: The findings supported the decision to release GPT-OSS, and the MFT approach is proposed as a tool for estimating harm in future open-weight model releases.

Abstract: In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results contributed to our decision to release the model, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.

[446] Unveiling Location-Specific Price Drivers: A Two-Stage Cluster Analysis for Interpretable House Price Predictions

Paul Gümmer, Julian Rosenberger, Mathias Kraus, Patrick Zschech, Nico Hambauer

Main category: cs.LG

TL;DR: A two-stage clustering approach improves house price valuation by balancing interpretability and performance, showing significant error reduction for GAM and LR models.

DetailsMotivation: Addressing the lack of interpretability in black-box models and the oversimplification of linear regression for localized market variations.

Method: Two-stage clustering: first by location-based features, then modeling each cluster with LR or GAM.

Result: 36% improvement for GAM and 58% for LR in mean absolute error; cluster-specific patterns identified.

Conclusion: Cluster-specific modeling enhances interpretability and practical value for property valuation.

Abstract: House price valuation remains challenging due to localized market variations. Existing approaches often rely on black-box machine learning models, which lack interpretability, or simplistic methods like linear regression (LR), which fail to capture market heterogeneity. To address this, we propose a machine learning approach that applies two-stage clustering, first grouping properties based on minimal location-based features before incorporating additional features. Each cluster is then modeled using either LR or a generalized additive model (GAM), balancing predictive performance with interpretability. Constructing and evaluating our models on 43,309 German house property listings from 2023, we achieve a 36% improvement for the GAM and 58% for LR in mean absolute error compared to models without clustering. Additionally, graphical analyses unveil pattern shifts between clusters. These findings emphasize the importance of cluster-specific insights, enhancing interpretability and offering practical value for buyers, sellers, and real estate analysts seeking more reliable property valuations.

[447] Rethinking Selectivity in State Space Models: A Minimal Predictive Sufficiency Approach

Yiyi Wang, Jian’an Zhang, Hongyi Duan, Haoyang Liu, Qingyang Li

Main category: cs.LG

TL;DR: The paper introduces the Principle of Predictive Sufficiency to address the heuristic design of selective mechanisms in State Space Models (SSMs), proposing the MPS-SSM framework for improved performance and robustness.

DetailsMotivation: To bridge the theoretical gap in SSMs by deriving selective mechanisms from first principles, ensuring optimality and robustness against spurious correlations.

Method: Introduces the Principle of Predictive Sufficiency and develops the MPS-SSM framework, optimizing an objective function to compress historical information without losing predictive power.

Result: MPS-SSM achieves state-of-the-art performance, outperforming existing models in long-term forecasting and noisy scenarios, with superior robustness.

Conclusion: The MPS principle not only enhances SSMs but can also be extended as a general regularization framework for other architectures, demonstrating broad applicability.

Abstract: State Space Models (SSMs), particularly recent selective variants like Mamba, have emerged as a leading architecture for sequence modeling, challenging the dominance of Transformers. However, the success of these state-of-the-art models largely relies on heuristically designed selective mechanisms, which lack a rigorous first-principle derivation. This theoretical gap raises questions about their optimality and robustness against spurious correlations. To address this, we introduce the Principle of Predictive Sufficiency, a novel information-theoretic criterion stipulating that an ideal hidden state should be a minimal sufficient statistic of the past for predicting the future. Based on this principle, we propose the Minimal Predictive Sufficiency State Space Model (MPS-SSM), a new framework where the selective mechanism is guided by optimizing an objective function derived from our principle. This approach encourages the model to maximally compress historical information without losing predictive power, thereby learning to ignore non-causal noise and spurious patterns. Extensive experiments on a wide range of benchmark datasets demonstrate that MPS-SSM not only achieves state-of-the-art performance, significantly outperforming existing models in long-term forecasting and noisy scenarios, but also exhibits superior robustness. Furthermore, we show that the MPS principle can be extended as a general regularization framework to enhance other popular architectures, highlighting its broad potential.

[448] CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction

Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek, Jaewoo Kang

Main category: cs.LG

TL;DR: CoTox integrates LLMs with chain-of-thought reasoning for interpretable multi-toxicity prediction, outperforming traditional models and enhancing drug safety assessment.

DetailsMotivation: Addressing limitations of current toxicity prediction models, such as reliance on annotated data, lack of interpretability, and inability to capture organ-specific toxicities driven by complex biological mechanisms.

Method: Proposes CoTox, a framework combining chemical structure data, biological pathways, and gene ontology terms with LLM-based step-by-step reasoning. Uses GPT-4o and IUPAC names for chemical structures.

Result: CoTox outperforms traditional machine learning and deep learning models, improves reasoning with IUPAC names, and aligns predictions with physiological responses in case studies.

Conclusion: LLM-based frameworks like CoTox enhance interpretability and support early-stage drug safety assessment, with practical utility demonstrated in drug development simulations.

Abstract: Drug toxicity remains a major challenge in pharmaceutical development. Recent machine learning models have improved in silico toxicity prediction, but their reliance on annotated data and lack of interpretability limit their applicability. This limits their ability to capture organ-specific toxicities driven by complex biological mechanisms. Large language models (LLMs) offer a promising alternative through step-by-step reasoning and integration of textual data, yet prior approaches lack biological context and transparent rationale. To address this issue, we propose CoTox, a novel framework that integrates LLM with chain-of-thought (CoT) reasoning for multi-toxicity prediction. CoTox combines chemical structure data, biological pathways, and gene ontology (GO) terms to generate interpretable toxicity predictions through step-by-step reasoning. Using GPT-4o, we show that CoTox outperforms both traditional machine learning and deep learning model. We further examine its performance across various LLMs to identify where CoTox is most effective. Additionally, we find that representing chemical structures with IUPAC names, which are easier for LLMs to understand than SMILES, enhances the model’s reasoning ability and improves predictive performance. To demonstrate its practical utility in drug development, we simulate the treatment of relevant cell types with drug and incorporated the resulting biological context into the CoTox framework. This approach allow CoTox to generate toxicity predictions aligned with physiological responses, as shown in case study. This result highlights the potential of LLM-based frameworks to improve interpretability and support early-stage drug safety assessment. The code and prompt used in this work are available at https://github.com/dmis-lab/CoTox.

[449] Overcoming Algorithm Aversion with Transparency: Can Transparent Predictions Change User Behavior?

Lasse Bohlen, Sven Kruschel, Julian Rosenberger, Patrick Zschech, Mathias Kraus

Main category: cs.LG

TL;DR: The study examines whether interpretable ML models reduce algorithm aversion compared to adjustable predictions, finding adjustability effective but transparency less impactful.

DetailsMotivation: To understand if interpretable ML models can reduce algorithm aversion or replace the need for adjustable predictions, given prior work lacked insights into model reasoning.

Method: Conceptual replication of a study on adjustable predictions, extended with an interpretable ML model. A pre-registered user study with 280 participants tested transparency and adjustability effects.

Result: Adjustability reduced algorithm aversion, but transparency’s impact was smaller and insignificant. Effects of transparency and adjustability were more independent than anticipated.

Conclusion: Adjustability remains effective in reducing aversion, while interpretability’s role is limited. The two factors operate independently in mitigating algorithm aversion.

Abstract: Previous work has shown that allowing users to adjust a machine learning (ML) model’s predictions can reduce aversion to imperfect algorithmic decisions. However, these results were obtained in situations where users had no information about the model’s reasoning. Thus, it remains unclear whether interpretable ML models could further reduce algorithm aversion or even render adjustability obsolete. In this paper, we conceptually replicate a well-known study that examines the effect of adjustable predictions on algorithm aversion and extend it by introducing an interpretable ML model that visually reveals its decision logic. Through a pre-registered user study with 280 participants, we investigate how transparency interacts with adjustability in reducing aversion to algorithmic decision-making. Our results replicate the adjustability effect, showing that allowing users to modify algorithmic predictions mitigates aversion. Transparency’s impact appears smaller than expected and was not significant for our sample. Furthermore, the effects of transparency and adjustability appear to be more independent than expected.

[450] Quantum Spectral Reasoning: A Non-Neural Architecture for Interpretable Machine Learning

Andrew Kiruluta

Main category: cs.LG

TL;DR: A novel ML architecture using quantum spectral methods (Pade approximants, Lanczos algorithm) for interpretable signal analysis and symbolic reasoning, avoiding backpropagation and black-box models.

DetailsMotivation: To create a transparent, physically grounded alternative to deep learning by integrating spectral methods and symbolic AI.

Method: Transforms time-domain signals into sparse spectral representations via rational approximation, maps to symbolic predicates for logical inference.

Result: Achieves competitive accuracy in anomaly detection, symbolic classification, and hybrid reasoning while maintaining interpretability.

Conclusion: Proposes a promising direction for physically-informed, reasoning-capable ML.

Abstract: We propose a novel machine learning architecture that departs from conventional neural network paradigms by leveraging quantum spectral methods, specifically Pade approximants and the Lanczos algorithm, for interpretable signal analysis and symbolic reasoning. The core innovation of our approach lies in its ability to transform raw time-domain signals into sparse, physically meaningful spectral representations without the use of backpropagation, high-dimensional embeddings, or data-intensive black-box models. Through rational spectral approximation, the system extracts resonant structures that are then mapped into symbolic predicates via a kernel projection function, enabling logical inference through a rule-based reasoning engine. This architecture bridges mathematical physics, sparse approximation theory, and symbolic artificial intelligence, offering a transparent and physically grounded alternative to deep learning models. We develop the full mathematical formalism underlying each stage of the pipeline, provide a modular algorithmic implementation, and demonstrate the system’s effectiveness through comparative evaluations on time-series anomaly detection, symbolic classification, and hybrid reasoning tasks. Our results show that this spectral-symbolic architecture achieves competitive accuracy while maintaining interpretability and data efficiency, suggesting a promising new direction for physically-informed, reasoning-capable machine learning.

[451] Adaptive Sparse Softmax: An Effective and Efficient Softmax Variant

Qi Lv, Lei Geng, Ziqiang Cao, Min Cao, Sujian Li, Wenjie Li, Guohong Fu

Main category: cs.LG

TL;DR: AS-Softmax improves training efficiency and performance by focusing on strong opponents and discarding irrelevant classes, with adaptive gradient accumulation for speedup.

DetailsMotivation: Softmax's unreachable target score leads to overfitting and inefficiency, as it forces continuous learning even for correctly classified samples.

Method: Proposes AS-Softmax, which discards irrelevant classes during training and uses adaptive gradient accumulation to speed up learning.

Result: Outperforms softmax and variants across tasks, with 1.2x training speedup and better validation correlation.

Conclusion: AS-Softmax is a more efficient and effective alternative to softmax for classification tasks.

Abstract: Softmax with the cross entropy loss is the standard configuration for current neural classification models. The gold score for a target class is supposed to be 1, but it is never reachable under the softmax schema. Such a problem makes the training process continue forever and leads to overfitting. Moreover, the “target-approach-1” training goal forces the model to continuously learn all samples, leading to a waste of time in handling some samples which have already been classified correctly with high confidence, while the test goal simply requires the target class of each sample to hold the maximum score. To solve the above weaknesses, we propose the Adaptive Sparse softmax (AS-Softmax) which designs a reasonable and test-matching transformation on top of softmax. For more purposeful learning, we discard the classes with far smaller scores compared with the actual class during training. Then the model could focus on learning to distinguish the target class from its strong opponents, which is also the great challenge in test. In addition, since the training losses of easy samples will gradually drop to 0 in AS-Softmax, we develop an adaptive gradient accumulation strategy based on the masked sample ratio to speed up training. We verify the proposed AS-Softmax on a variety of text multi-class, text multi-label, text token classification, image classification and audio classification tasks with class sizes ranging from 5 to 5000+. The results show that AS-Softmax consistently outperforms softmax and its variants, and the loss of AS-Softmax is remarkably correlated with classification performance in validation. Furthermore, adaptive gradient accumulation strategy can bring about 1.2x training speedup comparing with the standard softmax while maintaining classification effectiveness.

[452] Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies

Yi Ma, Hongyao Tang, Chenjun Xiao, Yaodong Yang, Wei Wei, Jianye Hao, Jiye Liang

Main category: cs.LG

TL;DR: The paper reviews the application of scaling laws in deep reinforcement learning (DRL), analyzing strategies in data, network, and training budget dimensions to enhance performance and efficiency.

DetailsMotivation: Despite the success of scaling laws in deep learning for vision and NLP, their application in DRL remains underexplored. This review aims to bridge this gap.

Method: Systematic analysis of scaling strategies in three areas: data (parallel sampling, data generation), network (architectural enhancements, ensemble methods), and training budget (distributed training, large batch sizes).

Result: The review highlights the synergistic roles of these strategies in advancing DRL, providing a roadmap for future research.

Conclusion: Balancing scalability with computational efficiency is crucial for unlocking DRL’s potential in tasks like robot control and autonomous driving.

Abstract: In recent years, the expansion of neural network models and training data has driven remarkable progress in deep learning, particularly in computer vision and natural language processing. This advancement is underpinned by the concept of Scaling Laws, which demonstrates that scaling model parameters and training data enhances learning performance. While these fields have witnessed breakthroughs, such as the development of large language models like GPT-4 and advanced vision models like Midjourney, the application of scaling laws in deep reinforcement learning (DRL) remains relatively unexplored. Despite its potential to improve performance, the integration of scaling laws into DRL for decision making has not been fully realized. This review addresses this gap by systematically analyzing scaling strategies in three dimensions: data, network, and training budget. In data scaling, we explore methods to optimize data efficiency through parallel sampling and data generation, examining the relationship between data volume and learning outcomes. For network scaling, we investigate architectural enhancements, including monolithic expansions, ensemble and MoE methods, and agent number scaling techniques, which collectively enhance model expressivity while posing unique computational challenges. Lastly, in training budget scaling, we evaluate the impact of distributed training, high replay ratios, large batch sizes, and auxiliary training on training efficiency and convergence. By synthesizing these strategies, this review not only highlights their synergistic roles in advancing DRL for decision making but also provides a roadmap for future research. We emphasize the importance of balancing scalability with computational efficiency and outline promising directions for leveraging scaling to unlock the full potential of DRL in various tasks such as robot control, autonomous driving and LLM training.

[453] Convergence of Deterministic and Stochastic Diffusion-Model Samplers: A Simple Analysis in Wasserstein Distance

Eliot Beyler, Francis Bach

Main category: cs.LG

TL;DR: New convergence guarantees in Wasserstein distance for diffusion-based generative models, covering stochastic and deterministic sampling methods, with improved bounds for Heun and Euler samplers.

DetailsMotivation: To analyze and improve convergence guarantees for diffusion-based generative models, addressing discretization, initialization, and score estimation errors.

Method: Introduces a framework to analyze errors, emphasizes spatial regularity of learned score functions, and uses smoothed Wasserstein distances for sharper bounds.

Result: First Wasserstein convergence bound for Heun sampler and improved results for Euler sampler, highlighting the importance of score function regularity.

Conclusion: Spatial regularity of the score function and controlling score error are crucial for convergence, aligning with denoising score matching principles.

Abstract: We provide new convergence guarantees in Wasserstein distance for diffusion-based generative models, covering both stochastic (DDPM-like) and deterministic (DDIM-like) sampling methods. We introduce a simple framework to analyze discretization, initialization, and score estimation errors. Notably, we derive the first Wasserstein convergence bound for the Heun sampler and improve existing results for the Euler sampler of the probability flow ODE. Our analysis emphasizes the importance of spatial regularity of the learned score function and argues for controlling the score error with respect to the true reverse process, in line with denoising score matching. We also incorporate recent results on smoothed Wasserstein distances to sharpen initialization error bounds.

[454] Understanding the Embedding Models on Hyper-relational Knowledge Graph

Yubo Wang, Shimin Di, Zhili Wang, Haoyang Li, Fei Teng, Hao Xin, Lei Chen

Main category: cs.LG

TL;DR: The paper investigates whether Hyper-relational KGE (HKGE) models’ performance stems from their base KGE model or extension modules. By decomposing HKGs into KGs, it finds some KGE models match HKGE performance, but decomposition alters HKG topology. It proposes FormerGNN, a framework addressing these issues, which outperforms existing HKGE models.

DetailsMotivation: To determine if HKGE models' superiority comes from their base KGE model or extension modules, and to address limitations in current HKGE models like topology alteration and information compression.

Method: Decomposes HKGs into KGs using three methods, evaluates classical KGE models, and proposes FormerGNN with a qualifier integrator and GNN-based encoder to preserve topology and capture dependencies.

Result: Some KGE models perform comparably to HKGE models, but decomposition alters HKG topology. FormerGNN outperforms existing HKGE models.

Conclusion: The study highlights limitations in current HKGE models and proposes FormerGNN as a solution, demonstrating its superior performance.

Abstract: Recently, Hyper-relational Knowledge Graphs (HKGs) have been proposed as an extension of traditional Knowledge Graphs (KGs) to better represent real-world facts with additional qualifiers. As a result, researchers have attempted to adapt classical Knowledge Graph Embedding (KGE) models for HKGs by designing extra qualifier processing modules. However, it remains unclear whether the superior performance of Hyper-relational KGE (HKGE) models arises from their base KGE model or the specially designed extension module. Hence, in this paper, we data-wise convert HKGs to KG format using three decomposition methods and then evaluate the performance of several classical KGE models on HKGs. Our results show that some KGE models achieve performance comparable to that of HKGE models. Upon further analysis, we find that the decomposition methods alter the original HKG topology and fail to fully preserve HKG information. Moreover, we observe that current HKGE models are either insufficient in capturing the graph’s long-range dependency or struggle to integrate main-triple and qualifier information due to the information compression issue. To further justify our findings and offer a potential direction for future HKGE research, we propose the FormerGNN framework. This framework employs a qualifier integrator to preserve the original HKG topology, and a GNN-based graph encoder to capture the graph’s long-range dependencies, followed by an improved approach for integrating main-triple and qualifier information to mitigate compression issues. Our experimental results demonstrate that FormerGNN outperforms existing HKGE models.

[455] Revisiting Deep Information Propagation: Fractal Frontier and Finite-size Effects

Giuseppe Alessio D’Inverno, Zhiyuan Hu, Leo Davy, Michael Unser, Gianluigi Rozza, Jonathan Dong

Main category: cs.LG

TL;DR: The paper explores information propagation in finite-width neural networks, revealing a fractal boundary between ordered and chaotic regimes, independent of input data and optimization. It extends findings to convolutional networks using Fourier-based transforms.

DetailsMotivation: To understand how input correlations evolve in practical, finite-width neural networks, beyond the assumptions of mean-field theory for infinitely wide networks.

Method: Study information propagation in randomly initialized finite-width networks and analyze the boundary between ordered and chaotic regimes. Extend the analysis to convolutional neural networks using Fourier-based structured transforms.

Result: The boundary between ordered and chaotic regimes exhibits a fractal structure, highlighting the complexity of neural network dynamics. The same behavior is observed in convolutional networks.

Conclusion: The findings emphasize the importance of finite network depth in balancing separation and robustness, revealing fundamental dynamics in neural networks.

Abstract: Information propagation characterizes how input correlations evolve across layers in deep neural networks. This framework has been well studied using mean-field theory, which assumes infinitely wide networks. However, these assumptions break down for practical, finite-size networks. In this work, we study information propagation in randomly initialized neural networks with finite width and reveal that the boundary between ordered and chaotic regimes exhibits a fractal structure. This shows the fundamental complexity of neural network dynamics, in a setting that is independent of input data and optimization. To extend this analysis beyond multilayer perceptrons, we leverage recently introduced Fourier-based structured transforms, and show that information propagation in convolutional neural networks also follow the same behavior. Our investigation highlights the importance of finite network depth with respect to the tradeoff between separation and robustness.

[456] On Conformal Machine Unlearning

Yahya Alkhatib, Wee Peng Tay

Main category: cs.LG

TL;DR: The paper introduces a Conformal Prediction-based Machine Unlearning (MU) method with statistical guarantees, avoiding costly retraining. It proposes metrics (ECF and EuCF) to measure unlearning effectiveness and demonstrates success in diverse scenarios.

DetailsMotivation: Addressing the lack of rigorous statistical guarantees and computational inefficiency in existing MU methods, driven by data privacy regulations like GDPR and CCPA.

Method: Defines MU using Conformal Prediction (CP), introduces conformal criteria, and proposes metrics (ECF and EuCF). Presents a practical unlearning method optimizing these metrics.

Result: Effective removal of targeted data across various forgetting scenarios, datasets, and models, validated through extensive experiments.

Conclusion: The CP-based MU approach provides statistically sound, uncertainty-aware guarantees and outperforms heuristic-based methods, offering a practical solution for data privacy compliance.

Abstract: The increasing demand for data privacy, driven by regulations such as GDPR and CCPA, has made Machine Unlearning (MU) essential for removing the influence of specific training samples from machine learning models while preserving performance on retained data. However, most existing MU methods lack rigorous statistical guarantees, rely on heuristic metrics, and often require computationally expensive retraining baselines. To overcome these limitations, we introduce a new definition for MU based on Conformal Prediction (CP), providing statistically sound, uncertainty-aware guarantees without the need for the concept of naive retraining. We formalize conformal criteria that quantify how often forgotten samples are excluded from CP sets, and propose empirical metrics,the Efficiently Covered Frequency (ECF at c) and its complement, the Efficiently Uncovered Frequency (EuCF at d), to measure the effectiveness of unlearning. We further present a practical unlearning method designed to optimize these conformal metrics. Extensive experiments across diverse forgetting scenarios, datasets and models demonstrate the efficacy of our approach in removing targeted data.

[457] HALO: Hindsight-Augmented Learning for Online Auto-Bidding

Pusen Dong, Chenglong Cao, Xinyu Zhou, Jirong You, Linhe Xu, Feifan Xu, Shuo Yuan

Main category: cs.LG

TL;DR: HALO, a new auto-bidding method, addresses inefficiencies in traditional solutions by using hindsight-augmented learning and B-spline representation for robust adaptation to diverse advertiser constraints.

DetailsMotivation: Traditional auto-bidding solutions struggle with sample inefficiency and poor generalization under varying budget-ROI constraints, necessitating a more adaptive approach.

Method: HALO employs a hindsight mechanism to repurpose explorations as training data and uses B-spline functional representation for continuous bid mapping.

Result: HALO outperforms traditional methods, reducing constraint violations and improving Gross Merchandise Value (GMV) in industrial evaluations.

Conclusion: HALO provides a robust solution for multi-constraint bidding, effectively handling diverse advertiser requirements.

Abstract: Digital advertising platforms operate millisecond-level auctions through Real-Time Bidding (RTB) systems, where advertisers compete for ad impressions through algorithmic bids. This dynamic mechanism enables precise audience targeting but introduces profound operational complexity due to advertiser heterogeneity: budgets and ROI targets span orders of magnitude across advertisers, from individual merchants to multinational brands. This diversity creates a demanding adaptation landscape for Multi-Constraint Bidding (MCB). Traditional auto-bidding solutions fail in this environment due to two critical flaws: 1) severe sample inefficiency, where failed explorations under specific constraints yield no transferable knowledge for new budget-ROI combinations, and 2) limited generalization under constraint shifts, as they ignore physical relationships between constraints and bidding coefficients. To address this, we propose HALO: Hindsight-Augmented Learning for Online Auto-Bidding. HALO introduces a theoretically grounded hindsight mechanism that repurposes all explorations into training data for arbitrary constraint configuration via trajectory reorientation. Further, it employs B-spline functional representation, enabling continuous, derivative-aware bid mapping across constraint spaces. HALO ensures robust adaptation even when budget/ROI requirements differ drastically from training scenarios. Industrial dataset evaluations demonstrate the superiority of HALO in handling multi-scale constraints, reducing constraint violations while improving GMV.

[458] Towards Interpretable Concept Learning over Time Series via Temporal Logic Semantics

Irene Ferfoglia, Simone Silvetti, Gaia Saveri, Laura Nenzi, Luca Bortolussi

Main category: cs.LG

TL;DR: A neuro-symbolic framework for time series classification uses Signal Temporal Logic (STL) to provide interpretable predictions and explanations.

DetailsMotivation: Address the lack of interpretability in black-box deep learning methods for time series classification, especially in safety-critical applications.

Method: Proposes a novel STL-inspired kernel to embed time series into STL concepts, jointly optimizing for accuracy and interpretability.

Result: Achieves competitive performance while providing human-interpretable logical explanations for predictions.

Conclusion: The framework successfully combines classification and explanation, offering both local and global symbolic insights.

Abstract: Time series classification is a task of paramount importance, as this kind of data often arises in safety-critical applications. However, it is typically tackled with black-box deep learning methods, making it hard for humans to understand the rationale behind their output. To take on this challenge, we propose a neuro-symbolic framework that unifies classification and explanation through direct embedding of trajectories into a space of Signal Temporal Logic (STL) concepts. By introducing a novel STL-inspired kernel that maps raw time series to their alignment with predefined STL formulae, our model jointly optimises for accuracy and interpretability, as each prediction is accompanied by the most relevant logical concepts that characterise it. This enables classification grounded in human-interpretable temporal patterns and produces both local and global symbolic explanations. Early results show competitive performance while offering high-quality logical justifications for model decisions.

[459] Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel

Main category: cs.LG

TL;DR: The paper applies Reinforcement Learning (RL) to Large Language Models (LLMs) for multi-turn interactions in software engineering, improving success rates without teacher models.

DetailsMotivation: Current RL applications to LLMs focus on single-turn tasks, neglecting real-world scenarios like software engineering that require multi-turn interactions with feedback.

Method: A modified Decoupled Advantage Policy Optimization (DAPO) algorithm is used to train a Qwen2.5-72B-Instruct agent for software engineering tasks.

Result: The agent’s success rate on SWE-bench Verified improved from 20% to 39%, matching or outperforming leading models like DeepSeek-V3-0324 and Qwen3-235B-A22B.

Conclusion: This approach offers a viable path for developing autonomous agents for complex real-world problems using open models.

Abstract: Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent’s success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.

[460] The alpha-beta divergence for real and complex data

Sergio Cruces

Main category: cs.LG

TL;DR: The paper extends alpha-beta divergences to complex data, providing a versatile framework that includes classical divergences and offers closed-form solutions for centroids in complex vector approximation.

DetailsMotivation: To address the lack of divergence frameworks for complex data, which is common in signal processing, and to generalize existing separable divergences.

Method: Extends alpha-beta divergences to complex vectors, ensuring compatibility with classical distances like Euclidean and Mahalanobis by setting hyperparameters to unity.

Result: A closed-form expression for centroids in complex vector approximation, revealing the roles of divergence hyperparameters.

Conclusion: The extended alpha-beta divergences are broadly applicable in signal processing for complex data, offering flexibility and practical utility.

Abstract: Divergences are fundamental to the information criteria that underpin most signal processing algorithms. The alpha-beta family of divergences, designed for non-negative data, offers a versatile framework that parameterizes and continuously interpolates several separable divergences found in existing literature. This work extends the definition of alpha-beta divergences to accommodate complex data, specifically when the arguments of the divergence are complex vectors. This novel formulation is designed in such a way that, by setting the divergence hyperparameters to unity, it particularizes to the well-known Euclidean and Mahalanobis squared distances. Other choices of hyperparameters yield practical separable and non-separable extensions of several classical divergences. In the context of the problem of approximating a complex random vector, the centroid obtained by optimizing the alpha-beta mean distortion has a closed-form expression, which interpretation sheds light on the distinct roles of the divergence hyperparameters. These contributions may have wide potential applicability, as there are many signal processing domains in which the underlying data are inherently complex.

[461] MoKA: Mixture of Kronecker Adapters

Mohammadreza Sadeghi, Mahsa Ghazvini Nejad, MirHamed Jafarzadeh Asl, Yu Gu, Yuanhao Yu, Masoud Asgharian, Vahid Partovi Nia

Main category: cs.LG

TL;DR: MoKA introduces a mixture of Kronecker adapters for parameter-efficient fine-tuning, improving expressiveness and hardware efficiency while reducing trainable parameters.

DetailsMotivation: Address the limited expressiveness of low-rank adapters in PEFT for complex tasks.

Method: Proposes Mixture of Kronecker Adapters (MoKA) with a gating mechanism for flexible rank adaptation and reformulates computations for hardware efficiency.

Result: Outperforms PEFT baselines, reduces trainable parameters by up to 27x, and achieves better performance-parameter efficiency trade-offs.

Conclusion: MoKA is a state-of-the-art solution for efficient and expressive fine-tuning of LLMs.

Abstract: Parameter-efficient fine-tuning (PEFT) is essential for reducing the computational overhead of large language models (LLMs). Low-rank family adapters are commonly used to control the parameter size efficiently while maintaining the generative power of LLMs. However, their limited expressiveness due to the rank constraint often restricts their performance on complex tasks. We propose Mixture of Kronecker Adapters (MoKA), a new generation of Kronecker adapters that addresses this limitation by modeling weight updates as a mixture of Kronecker products. Our proposed adapter leverages a gating mechanism that measures the importance of each Kronecker factor, enabling more expressive adaptation. Moreover, MoKA enables a rank flexibility that provides a better trade-off between parameter efficiency and accuracy. To ensure hardware efficiency, we reformulate Kronecker computations using standard matrix operations, allowing seamless deployment on GPU-optimized hardware. We conduct extensive experiments on instruction-tuning and commonsense reasoning tasks using low-bit quantized versions of LLaMA2-7B and LLaMA3-8B models. MoKA not only outperforms PEFT baselines, but also reduces the number of trainable parameters up to 27x, achieving state-of-the-art trade-offs between performance and parameter efficiency.

[462] Online Continual Graph Learning

Giovanni Donghi, Luca Pasa, Daniele Zambon, Cesare Alippi, Nicolò Navarin

Main category: cs.LG

TL;DR: The paper proposes a general framework for online Continual Learning (OCL) on graphs, addressing the lack of a clear definition and benchmarks for this setting.

DetailsMotivation: Many real-world graphs evolve over time, requiring timely predictions, but current OCL approaches on graphs lack clarity and systematic evaluation.

Method: The authors define a general formulation for OCL on graphs, focusing on efficient batch processing over graph topology, and introduce benchmarks for evaluation.

Result: Performance of several adapted CL methods is reported under the proposed framework.

Conclusion: The work provides a well-defined OCL setting for graphs, enabling systematic evaluation and advancing research in this area.

Abstract: The aim of Continual Learning (CL) is to learn new tasks incrementally while avoiding catastrophic forgetting. Online Continual Learning (OCL) specifically focuses on learning efficiently from a continuous stream of data with shifting distribution. While recent studies explore Continual Learning on graphs exploiting Graph Neural Networks (GNNs), only few of them focus on a streaming setting. Yet, many real-world graphs evolve over time, often requiring timely and online predictions. Current approaches, however, are not well aligned with the standard OCL setting, partly due to the lack of a clear definition of online Continual Learning on graphs. In this work, we propose a general formulation for online Continual Learning on graphs, emphasizing the efficiency requirements on batch processing over the graph topology, and providing a well-defined setting for systematic model evaluation. Finally, we introduce a set of benchmarks and report the performance of several methods in the CL literature, adapted to our setting.

[463] Strategic Hypothesis Testing

Safwan Hossain, Yatong Chen, Yiling Chen

Main category: cs.LG

TL;DR: The paper explores hypothesis testing in a principal-agent setup, where the agent strategically reports data to the principal, who sets a p-value threshold balancing errors. A game-theoretic model reveals monotonic error behavior and an optimal threshold, validated with drug approval data.

DetailsMotivation: To understand how strategic behavior of agents influences hypothesis testing decisions by principals, particularly in contexts like drug approvals.

Method: Develops a game-theoretic model analyzing agent participation and reporting behavior in response to the principal’s statistical decision rule.

Result: Identifies a critical p-value threshold for the principal, showing monotonic error behavior and an interpretable optimal threshold, validated empirically.

Conclusion: Provides a framework for strategic interactions in hypothesis testing, offering technical and regulatory insights, especially for drug approvals.

Abstract: We examine hypothesis testing within a principal-agent framework, where a strategic agent, holding private beliefs about the effectiveness of a product, submits data to a principal who decides on approval. The principal employs a hypothesis testing rule, aiming to pick a p-value threshold that balances false positives and false negatives while anticipating the agent’s incentive to maximize expected profitability. Building on prior work, we develop a game-theoretic model that captures how the agent’s participation and reporting behavior respond to the principal’s statistical decision rule. Despite the complexity of the interaction, we show that the principal’s errors exhibit clear monotonic behavior when segmented by an efficiently computable critical p-value threshold, leading to an interpretable characterization of their optimal p-value threshold. We empirically validate our model and these insights using publicly available data on drug approvals. Overall, our work offers a comprehensive perspective on strategic interactions within the hypothesis testing framework, providing technical and regulatory insights.

[464] Bridging ocean wave physics and deep learning: Physics-informed neural operators for nonlinear wavefield reconstruction in real-time

Svenja Ehlers, Merten Stender, Norbert Hoffmann

Main category: cs.LG

TL;DR: A Physics-Informed Neural Operator (PINO) framework is proposed for real-time, phase-resolved ocean wave field prediction without requiring ground truth data, using sparse measurements and physics-based constraints.

DetailsMotivation: The challenge of predicting ocean wave fields accurately in real-time due to sparse or indirect measurements and the impracticality of obtaining large labeled datasets.

Method: Embedding residuals of free surface boundary conditions into the loss function of a neural operator (PINO) to constrain solutions without ground truth data.

Result: Accurate reconstruction of nonlinear wave fields from sparse measurements (buoy time series, radar snapshots) validated with synthetic data.

Conclusion: PINO enables robust, real-time wave reconstruction across diverse conditions, advancing operational data-driven wave prediction.

Abstract: Accurate real-time prediction of phase-resolved ocean wave fields remains a critical yet largely unsolved problem, primarily due to the absence of practical data assimilation methods for reconstructing initial conditions from sparse or indirect wave measurements. While recent advances in supervised deep learning have shown potential for this purpose, they require large labelled datasets of ground truth wave data, which are infeasible to obtain in real-world scenarios. To overcome this limitation, we propose a Physics-Informed Neural Operator (PINO) framework for reconstructing spatially and temporally phase-resolved, nonlinear ocean wave fields from sparse measurements, without the need for ground truth data during training. This is achieved by embedding residuals of the free surface boundary conditions of ocean gravity waves into the loss function of the PINO, constraining the solution space in a soft manner. After training, we validate our approach using highly realistic synthetic wave data and demonstrate the accurate reconstruction of nonlinear wave fields from both buoy time series and radar snapshots. Our results indicate that PINOs enable accurate, real-time reconstruction and generalize robustly across a wide range of wave conditions, thereby paving the way for operational, data-driven wave reconstruction and prediction in realistic marine environments.

[465] Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?

Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, Yang Liu

Main category: cs.LG

TL;DR: The study evaluates bias mitigation methods for tabular data, revealing a zero-sum trade-off in fairness improvements, and explores an alternative approach to enhance fairness without negatively impacting privileged groups.

DetailsMotivation: To determine if the leveling-down effect in bias mitigation applies to tabular data tasks and explore methods to improve fairness without zero-sum trade-offs.

Method: Evaluated eight bias mitigation methods across 44 tasks using five real-world datasets and four ML models. Also investigated a targeted approach for unprivileged groups.

Result: Bias mitigation methods for tabular data operate in a zero-sum fashion, but a targeted approach shows potential to benefit unprivileged groups without harming privileged groups or overall performance.

Conclusion: The study identifies pathways to improve fairness without zero-sum trade-offs, aiding broader adoption of bias mitigation methods.

Abstract: Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods.

[466] Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

He Xiao, Qingyao Yang, Dirui Xie, Wendong Xu, Wenyong Zhou, Haobo Liu, Zhengwu Liu, Ngai Wong

Main category: cs.LG

TL;DR: LieQ is a metric-driven post-training quantization framework for sub-7B models, achieving state-of-the-art compression-accuracy trade-offs at extreme low-bit precision.

DetailsMotivation: Large language models are often over-provisioned, with many layers contributing little unique information but dominating memory and energy use during inference.

Method: LieQ introduces three layer-wise diagnostics (Perplexity Drop, Representational Compactness, Top-k Energy Gain) to enable automatic bit-width allocation without gradient updates.

Result: LieQ recovers 95.9% of FP16 performance at 2.05-bit quantization on Qwen3-4B and maintains 98.2% baseline accuracy at 2.07-bit on LLaMA3.2-3B, outperforming GPTQ and AWQ.

Conclusion: LieQ establishes new paradigms for deploying small language models on resource-constrained edge devices with significant memory reduction.

Abstract: Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ, a metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-7B models under extreme low-bit compression. Our method introduces three complementary layer-wise diagnostics-Perplexity Drop, Representational Compactness, and Top-k Energy Gain -that reveal a canonical division of labour across layers, enabling automatic bit-width allocation without gradient updates. Unlike existing approaches that suffer severe accuracy degradation at 2-3 bits precision, LieQ achieves state-of-the-art compression-accuracy trade-offs: on Qwen3-4B, it recovers 95.9% of FP16 baseline performance at 2.05-bit quantization, outperforming GPTQ by 19.7% and AWQ by 18.1% on average across seven zero-shot reasoning tasks. Applied to LLaMA3.2-3B, LieQ maintains 98.2% of baseline accuracy at 2.07-bit precision while enabling 4x memory reduction, establishing new paradigms for deploying small language models on resource-constrained edge devices.

[467] Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation

Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan

Main category: cs.LG

TL;DR: The paper explores the trade-off between the number of items (N) and annotations per item (K) for reliable ML evaluation, finding that accounting for human disagreement is feasible with N×K ≤ 1000 and often benefits from K > 10. The optimal (N, K) depends on the evaluation metric.

DetailsMotivation: Reproducibility in ML evaluations is crucial but often ignores human disagreement due to budget constraints. This study aims to determine the optimal balance between N and K for reliable evaluation.

Method: Analyzed diverse categorical datasets with multiple annotations per item and simulated distributions to find the optimal (N, K) configuration under fixed budgets.

Result: Human disagreement can be accounted for with N×K ≤ 1000, often requiring K > 10. The trade-off between K and N depends on the evaluation metric, with distribution-sensitive metrics favoring higher K.

Conclusion: The study provides practical guidance for ML practitioners to optimize test data collection, balancing N and K for reliable evaluations within budget constraints.

Abstract: Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple annotators for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ($N$) and the number of responses per item ($K$) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal $(N, K)$ configuration, given a fixed budget ($N \times K$), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with $N \times K$ at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal $N \times K$ almost always occurred for $K > 10$. Furthermore, the nature of the tradeoff between $K$ and $N$ – or if one even existed – depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of $K$. Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget.

[468] A neural network machine-learning approach for characterising hydrogen trapping parameters from TDS experiments

N. Marrani, T. Hageman, E. Martínez-Pañeda

Main category: cs.LG

TL;DR: A machine learning-based scheme using multi-Neural Networks (NNs) is introduced to predict hydrogen trapping parameters from TDS spectra, addressing challenges in indirect TDS methods.

DetailsMotivation: Extracting key parameters like trap binding energies and densities from TDS spectra is challenging due to its indirect nature.

Method: A multi-NN model (classification and regression) trained on synthetic data predicts trap types, densities, and binding energies from experimental TDS spectra.

Result: The model showed strong predictive capabilities when tested on three tempered martensitic steels.

Conclusion: The approach effectively addresses TDS limitations, with the developed code freely available for use.

Abstract: The hydrogen trapping behaviour of metallic alloys is generally characterised using Thermal Desorption Spectroscopy (TDS). However, as an indirect method, extracting key parameters (trap binding energies and densities) remains a significant challenge. To address these limitations, this work introduces a machine learning-based scheme for parameter identification from TDS spectra. A multi-Neural Network (NN) model is developed and trained exclusively on synthetic data to predict trapping parameters directly from experimental data. The model comprises two multi-layer, fully connected, feed-forward NNs trained with backpropagation. The first network (classification model) predicts the number of distinct trap types. The second network (regression model) then predicts the corresponding trap densities and binding energies. The NN architectures, hyperparameters, and data pre-processing were optimised to minimise the amount of training data. The proposed model demonstrated strong predictive capabilities when applied to three tempered martensitic steels of different compositions. The code developed is freely provided.

[469] AI on the Pulse: Real-Time Health Anomaly Detection with Wearable and Ambient Intelligence

Davide Gabrielli, Bardh Prenkaj, Paola Velardi, Stefano Faralli

Main category: cs.LG

TL;DR: AI on the Pulse is an anomaly detection system for real-time patient monitoring using wearable sensors and AI, outperforming 12 SoTA methods with a 22% F1 score improvement.

DetailsMotivation: To enable continuous, personalized health monitoring without impractical labeling or clinical-grade equipment.

Method: Uses UniTS, a universal time-series model, for anomaly detection and integrates LLMs for interpretability.

Result: Outperforms 12 SoTA methods, works with consumer wearables, and is deployed in real-world settings (@HOME).

Conclusion: Demonstrates effective, non-invasive health monitoring with actionable insights for healthcare professionals.

Abstract: We introduce AI on the Pulse, a real-world-ready anomaly detection system that continuously monitors patients using a fusion of wearable sensors, ambient intelligence, and advanced AI models. Powered by UniTS, a state-of-the-art (SoTA) universal time-series model, our framework autonomously learns each patient’s unique physiological and behavioral patterns, detecting subtle deviations that signal potential health risks. Unlike classification methods that require impractical, continuous labeling in real-world scenarios, our approach uses anomaly detection to provide real-time, personalized alerts for reactive home-care interventions. Our approach outperforms 12 SoTA anomaly detection methods, demonstrating robustness across both high-fidelity medical devices (ECG) and consumer wearables, with a ~ 22% improvement in F1 score. However, the true impact of AI on the Pulse lies in @HOME, where it has been successfully deployed for continuous, real-world patient monitoring. By operating with non-invasive, lightweight devices like smartwatches, our system proves that high-quality health monitoring is possible without clinical-grade equipment. Beyond detection, we enhance interpretability by integrating LLMs, translating anomaly scores into clinically meaningful insights for healthcare professionals.

[470] An Auditable Agent Platform For Automated Molecular Optimisation

Atabey Ünlü, Phil Rohr, Ahmet Celebi

Main category: cs.LG

TL;DR: A hierarchical agent framework automates molecular optimization, improving binding affinity by 31% in multi-agent setups, while single-agent runs excel in drug-like properties.

DetailsMotivation: Drug discovery slows due to scattered data and tools. The goal is to shorten design cycles by automating molecular optimization with a collaborative agent framework.

Method: A multi-agent system includes roles like Principal Researcher, Database agent, AI Expert, Medicinal Chemist, Ranking agent, and Scientific Critic. Agents communicate via provenance records for auditable reasoning.

Result: Multi-agent setups improved binding affinity by 31%, while single-agent runs prioritized drug-like properties. Unguided LLMs lacked transparency.

Conclusion: Agent frameworks enhance molecular design by providing auditable reasoning and feedback loops, suggesting expansion to ADMET and selectivity predictors for further improvements.

Abstract: Drug discovery frequently loses momentum when data, expertise, and tools are scattered, slowing design cycles. To shorten this loop we built a hierarchical, tool using agent framework that automates molecular optimisation. A Principal Researcher defines each objective, a Database agent retrieves target information, an AI Expert generates de novo scaffolds with a sequence to molecule deep learning model, a Medicinal Chemist edits them while invoking a docking tool, a Ranking agent scores the candidates, and a Scientific Critic polices the logic. Each tool call is summarised and stored causing the full reasoning path to remain inspectable. The agents communicate through concise provenance records that capture molecular lineage, to build auditable, molecule centered reasoning trajectories and reuse successful transformations via in context learning. Three cycle research loops were run against AKT1 protein using five large language models. After ranking the models by mean docking score, we ran 20 independent scale ups on the two top performers. We then compared the leading LLMs’ binding affinity results across three configurations, LLM only, single agent, and multi agent. Our results reveal an architectural trade off, the multi agent setting excelled at focused binding optimization, improving average predicted binding affinity by 31%. In contrast, single agent runs generated molecules with superior drug like properties at the cost of less potent binding scores. Unguided LLM runs finished fastest, yet their lack of transparent tool signals left the validity of their reasoning paths unverified. These results show that test time scaling, focused feedback loops and provenance convert general purpose LLMs into auditable systems for molecular design, and suggest that extending the toolset to ADMET and selectivity predictors could push research workflows further along the discovery pipeline.

[471] SLA-MORL: SLA-Aware Multi-Objective Reinforcement Learning for HPC Resource Optimization

Seraj Al Mahmud Mostafa, Aravind Mohan, Jianwu Wang

Main category: cs.LG

TL;DR: SLA-MORL is a multi-objective reinforcement learning framework for dynamic GPU/CPU resource allocation in cloud ML workloads, optimizing time, cost, and SLA compliance.

DetailsMotivation: Challenges in balancing training time, cost, and SLA compliance in cloud ML workloads, with traditional methods causing inefficiencies or violations.

Method: Uses adaptive multi-objective RL with intelligent initialization and dynamic weight adaptation, leveraging a 21D state representation and actor-critic network.

Result: Achieves 67.2% faster training, 68.8% cost reduction, and 73.4% better SLA compliance compared to static baselines.

Conclusion: SLA-MORL effectively balances performance, cost, and reliability in cloud ML resource management.

Abstract: Dynamic resource allocation for machine learning workloads in cloud environments remains challenging due to competing objectives of minimizing training time and operational costs while meeting Service Level Agreement (SLA) constraints. Traditional approaches employ static resource allocation or single-objective optimization, leading to either SLA violations or resource waste. We present SLA-MORL, an adaptive multi-objective reinforcement learning framework that intelligently allocates GPU and CPU resources based on user-defined preferences (time, cost, or balanced) while ensuring SLA compliance. Our approach introduces two key innovations: (1) intelligent initialization through historical learning or efficient baseline runs that eliminates cold-start problems, reducing initial exploration overhead by 60%, and (2) dynamic weight adaptation that automatically adjusts optimization priorities based on real-time SLA violation severity, creating a self-correcting system. SLA-MORL constructs a 21-dimensional state representation capturing resource utilization, training progress, and SLA compliance, enabling an actor-critic network to make informed allocation decisions across 9 possible actions. Extensive evaluation on 13 diverse ML workloads using production HPC infrastructure demonstrates that SLA-MORL achieves 67.2% reduction in training time for deadline-critical jobs, 68.8% reduction in costs for budget-constrained workloads, and 73.4% improvement in overall SLA compliance compared to static baselines. By addressing both cold-start inefficiency and dynamic adaptation challenges, SLA-MORL provides a practical solution for cloud resource management that balances performance, cost, and reliability in modern ML training environments.

[472] VRPRM: Process Reward Modeling via Visual Reasoning

Xinquan Chen, Bangwei Liu, Xuhong Wang

Main category: cs.LG

TL;DR: VRPRM introduces visual reasoning into PRMs, reducing annotation costs while improving reasoning capabilities.

DetailsMotivation: Address the lack of long-term reasoning in PRMs and high costs of CoT-PRM data annotation.

Method: Proposes VRPRM with a two-stage training strategy using minimal CoT-PRM and non-CoT PRM data.

Result: VRPRM outperforms non-thinking PRMs with less data, achieving up to 118% performance improvement.

Conclusion: VRPRM offers a cost-effective, high-quality reasoning solution for PRM training.

Abstract: Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118% over the base model in the BoN experiment. This result confirms that the proposed combined training strategy can achieve higher quality reasoning capabilities at a lower data annotation cost, thus providing a new paradigm for PRM training with more efficient data utilization.

[473] Heterogeneity-Oblivious Robust Federated Learning

Weiyao Zhang, Jinyang Li, Qi Song, Miao Wang, Chungang Lin, Haitong Luo, Xuying Meng, Yujun Zhang

Main category: cs.LG

TL;DR: Horus is a robust Federated Learning framework using low-rank adaptations (LoRAs) to mitigate poisoning attacks in hyper-heterogeneous environments.

DetailsMotivation: Federated Learning is vulnerable to poisoning attacks, especially under hyper-heterogeneity, which complicates detection and aggregation.

Method: Horus inserts LoRAs into stable layers, aggregates only LoRAs, and uses a Heterogeneity-Oblivious Poisoning Score to filter poisoned clients. It also employs projection-aware aggregation for benign clients.

Result: Horus outperforms state-of-the-art baselines in robustness and accuracy across diverse datasets, models, and attacks.

Conclusion: Horus effectively addresses FL vulnerabilities in hyper-heterogeneous settings by leveraging LoRAs and novel aggregation strategies.

Abstract: Federated Learning (FL) remains highly vulnerable to poisoning attacks, especially under real-world hyper-heterogeneity, where clients differ significantly in data distributions, communication capabilities, and model architectures. Such heterogeneity not only undermines the effectiveness of aggregation strategies but also makes attacks more difficult to detect. Furthermore, high-dimensional models expand the attack surface. To address these challenges, we propose Horus, a heterogeneity-oblivious robust FL framework centered on low-rank adaptations (LoRAs). Rather than aggregating full model parameters, Horus inserts LoRAs into empirically stable layers and aggregates only LoRAs to reduce the attack surface.We uncover a key empirical observation that the input projection (LoRA-A) is markedly more stable than the output projection (LoRA-B) under heterogeneity and poisoning. Leveraging this, we design a Heterogeneity-Oblivious Poisoning Score using the features from LoRA-A to filter poisoned clients. For the remaining benign clients, we propose projection-aware aggregation mechanism to preserve collaborative signals while suppressing drifts, which reweights client updates by consistency with the global directions. Extensive experiments across diverse datasets, model architectures, and attacks demonstrate that Horus consistently outperforms state-of-the-art baselines in both robustness and accuracy.

[474] DeepFaith: A Domain-Free and Model-Agnostic Unified Framework for Highly Faithful Explanations

Yuhan Guo, Lizhong Ding, Shihan Jia, Yanyu Ren, Pengqi Li, Jiarun Fu, Changsheng Li, Ye yuan, Guoren Wang

Main category: cs.LG

TL;DR: DeepFaith is a unified, model-agnostic XAI framework that optimizes faithfulness across multiple metrics, providing a theoretical ground truth for explanations. It outperforms baselines in faithfulness across diverse tasks.

DetailsMotivation: The lack of a unified ground truth for evaluating XAI methods hinders objective assessment and optimization. DeepFaith addresses this by unifying faithfulness metrics.

Method: DeepFaith formulates an optimal explanation objective by unifying faithfulness metrics, trains an explainer using supervised signals from existing methods, and optimizes consistency and correlation.

Result: DeepFaith achieves the highest faithfulness across 10 metrics on 12 diverse tasks, demonstrating effectiveness and generalizability.

Conclusion: DeepFaith provides a unified, theoretically grounded framework for generating faithful explanations, outperforming existing methods.

Abstract: Explainable AI (XAI) builds trust in complex systems through model attribution methods that reveal the decision rationale. However, due to the absence of a unified optimal explanation, existing XAI methods lack a ground truth for objective evaluation and optimization. To address this issue, we propose Deep architecture-based Faith explainer (DeepFaith), a domain-free and model-agnostic unified explanation framework under the lens of faithfulness. By establishing a unified formulation for multiple widely used and well-validated faithfulness metrics, we derive an optimal explanation objective whose solution simultaneously achieves optimal faithfulness across these metrics, thereby providing a ground truth from a theoretical perspective. We design an explainer learning framework that leverages multiple existing explanation methods, applies deduplicating and filtering to construct high-quality supervised explanation signals, and optimizes both pattern consistency loss and local correlation to train a faithful explainer. Once trained, DeepFaith can generate highly faithful explanations through a single forward pass without accessing the model being explained. On 12 diverse explanation tasks spanning 6 models and 6 datasets, DeepFaith achieves the highest overall faithfulness across 10 metrics compared to all baseline methods, highlighting its effectiveness and cross-domain generalizability.

[475] Zero-Variance Gradients for Variational Autoencoders

Zilei Shao, Anji Liu, Guy Van den Broeck

Main category: cs.LG

TL;DR: Proposes ‘Silent Gradients’ to avoid gradient variance in VAEs by using specific decoder architectures and a novel training dynamic.

DetailsMotivation: Addresses the issue of gradient variance in VAEs due to stochastic sampling, which slows convergence and degrades performance.

Method: Leverages decoder architectures to compute the expected ELBO analytically, introducing a zero-variance gradient. Uses a training dynamic combining exact gradients early on with stochastic estimators later.

Result: Outperforms existing estimators (reparameterization, Gumbel-Softmax, REINFORCE) across datasets, improving baseline performance.

Conclusion: Introduces a stable, effective method for training generative models by combining analytical computation with deep architectures.

Abstract: Training deep generative models like Variational Autoencoders (VAEs) is often hindered by the need to backpropagate gradients through the stochastic sampling of their latent variables, a process that inherently introduces estimation variance, which can slow convergence and degrade performance. In this paper, we propose a new perspective that sidesteps this problem, which we call Silent Gradients. Instead of improving stochastic estimators, we leverage specific decoder architectures to analytically compute the expected ELBO, yielding a gradient with zero variance. We first provide a theoretical foundation for this method and demonstrate its superiority over existing estimators in a controlled setting with a linear decoder. To generalize our approach for practical use with complex, expressive decoders, we introduce a novel training dynamic that uses the exact, zero-variance gradient to guide the early stages of encoder training before annealing to a standard stochastic estimator. Our experiments show that this technique consistently improves the performance of established baselines, including reparameterization, Gumbel-Softmax, and REINFORCE, across multiple datasets. This work opens a new direction for training generative models by combining the stability of analytical computation with the expressiveness of deep, nonlinear architecture.

[476] VITA: Variational Pretraining of Transformers for Climate-Robust Crop Yield Forecasting

Adib Hasan, Mardavij Roozbehani, Munther Dahleh

Main category: cs.LG

TL;DR: VITA, a variational pretraining framework, improves crop yield forecasting by addressing data asymmetry, outperforming existing models with less data, especially in extreme weather conditions.

DetailsMotivation: Accurate crop yield forecasting is crucial for food security, but current AI models fail when yields deviate from historical trends due to data asymmetry.

Method: VITA uses detailed weather variables as proxy targets during pretraining and self-supervised feature masking, enabling fine-tuning with basic weather statistics.

Result: Applied to 763 U.S. counties, VITA achieves state-of-the-art performance, with significant improvements in extreme weather years.

Conclusion: VITA demonstrates how domain-aware AI can overcome data limitations, enhancing agricultural forecasting in a changing climate.

Abstract: Accurate crop yield forecasting is essential for global food security. However, current AI models systematically underperform when yields deviate from historical trends. This issue arises from key data challenges, including a major asymmetry between rich pretraining weather datasets and the limited data available for fine-tuning. We introduce VITA (Variational Inference Transformer for Asymmetric data), a variational pretraining framework that addresses this asymmetry. Instead of relying on input reconstruction, VITA uses detailed weather variables as proxy targets during pretraining and learns to predict rich atmospheric states through self-supervised feature masking. This allows the model to be fine-tuned using only basic weather statistics during deployment. Applied to 763 counties in the U.S. Corn Belt, VITA achieves state-of-the-art performance in predicting corn and soybean yields across all evaluation scenarios. While it consistently delivers superior performance under normal conditions, its advantages are particularly pronounced during extreme weather years, with statistically significant improvements (paired t-test, $p \approx 0.01$). Importantly, VITA outperforms prior frameworks like GNN-RNN using less data, making it more practical for real-world use–particularly in data-scarce regions. This work highlights how domain-aware AI design can overcome data limitations and support resilient agricultural forecasting in a changing climate.

[477] SolarSeer: Ultrafast and accurate 24-hour solar irradiance forecasts outperforming numerical weather prediction across the USA

Mingliang Bai, Zuliang Fang, Shengyu Tao, Siqi Xiang, Jiang Bian, Yanfei Xiang, Pengcheng Zhao, Weixin Jin, Jonathan A. Weyn, Haiyu Dong, Bin Zhang, Hongyu Sun, Kit Thambiratnam, Qi Zhang, Hongbin Sun, Xuan Zhang, Qiuwei Wu

Main category: cs.LG

TL;DR: SolarSeer, an AI model, forecasts solar irradiance 1,500x faster than traditional NWP, reducing errors by 27.28% in reanalysis and 15.35% at stations.

DetailsMotivation: Accurate 24-hour solar irradiance forecasting is crucial for solar PV systems, but traditional NWP models are computationally expensive.

Method: SolarSeer maps historical satellite data directly to forecasts, bypassing costly data assimilation and PDE solving.

Result: SolarSeer operates 1,500x faster than NWP, reduces forecasting errors significantly, and captures irradiance fluctuations better.

Conclusion: SolarSeer’s fast, accurate forecasts support sustainable energy systems.

Abstract: Accurate 24-hour solar irradiance forecasting is essential for the safe and economic operation of solar photovoltaic systems. Traditional numerical weather prediction (NWP) models represent the state-of-the-art in forecasting performance but rely on computationally costly data assimilation and solving complicated partial differential equations (PDEs) that simulate atmospheric physics. Here, we introduce SolarSeer, an end-to-end large artificial intelligence (AI) model for solar irradiance forecasting across the Contiguous United States (CONUS). SolarSeer is designed to directly map the historical satellite observations to future forecasts, eliminating the computational overhead of data assimilation and PDEs solving. This efficiency allows SolarSeer to operate over 1,500 times faster than traditional NWP, generating 24-hour cloud cover and solar irradiance forecasts for the CONUS at 5-kilometer resolution in under 3 seconds. Compared with the state-of-the-art NWP in the CONUS, i.e., High-Resolution Rapid Refresh (HRRR), SolarSeer significantly reduces the root mean squared error of solar irradiance forecasting by 27.28% in reanalysis data and 15.35% across 1,800 stations. SolarSeer also effectively captures solar irradiance fluctuations and significantly enhances the first-order irradiance difference forecasting accuracy. SolarSeer’s ultrafast, accurate 24-hour solar irradiance forecasts provide strong support for the transition to sustainable, net-zero energy systems.

[478] On the (In)Significance of Feature Selection in High-Dimensional Datasets

Bhavesh Neekhra, Debayan Gupta, Partha Pratim Chakravarti

Main category: cs.LG

TL;DR: Feature selection (FS) algorithms in high-dimensional datasets, like gene expression, may not outperform random feature selection, challenging their utility in classification tasks.

DetailsMotivation: To validate the performance of FS algorithms by comparing them against randomly selected features, questioning their effectiveness in high-dimensional datasets.

Method: Testing the null hypothesis of using randomly selected features against FS-selected features in classification tasks, focusing on gene expression datasets.

Result: Randomly selected small feature subsets perform comparably to FS-selected features, and sometimes outperform them, questioning FS’s value in genomics.

Conclusion: FS algorithms may not be useful for high-dimensional datasets like gene expression, raising concerns about studies relying on computationally selected genes without lab validation.

Abstract: Extensive research has been done on feature selection (FS) algorithms for high-dimensional datasets aiming to improve model performance, reduce computational cost and identify features of interest. We test the null hypothesis of using randomly selected features to compare against features selected by FS algorithms to validate the performance of the latter. Our results show that FS on high-dimensional datasets (in particular gene expression) in classification tasks is not useful. We find that (1) models trained on small subsets (0.02%-1% of all features) of randomly selected features almost always perform comparably to those trained on all features, and (2) a “typical”- sized random subset provides comparable or superior performance to that of top-k features selected in various published studies. Thus, our work challenges many feature selection results on high dimensional datasets, particularly in computational genomics. It raises serious concerns about studies that propose drug design or targeted interventions based on computationally selected genes, without further validation in a wet lab.

[479] Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, Chi Jin

Main category: cs.LG

TL;DR: Goedel-Prover-V2 is a state-of-the-art open-source language model for automated theorem proving, outperforming larger models with innovations like scaffolded data synthesis, verifier-guided self-correction, and model averaging.

DetailsMotivation: To advance automated theorem proving by creating more efficient and smaller models that outperform larger counterparts.

Method: Uses expert iteration and reinforcement learning with scaffolded data synthesis, verifier-guided self-correction, and model averaging.

Result: Goedel-Prover-V2-8B achieves 84.6% pass@32 on MiniF2F, while the flagship Goedel-Prover-V2-32B reaches 88.1% (90.4% with self-correction) and solves 86 problems on PutnamBench, surpassing prior records.

Conclusion: Goedel-Prover-V2 sets a new benchmark in automated theorem proving, offering top performance with smaller model sizes and compute budgets.

Abstract: We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B’s record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models–including closed-source systems with publicly reported performance–under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.

[480] Minimal Convolutional RNNs Accelerate Spatiotemporal Learning

Coşku Can Horuz, Sebastian Otte, Martin V. Butz, Matthias Karlbauer

Main category: cs.LG

TL;DR: MinConvLSTM and MinConvGRU combine convolutional RNNs’ spatial biases with efficient, parallelizable training, outperforming standard ConvLSTMs/GRUs in speed and accuracy.

DetailsMotivation: To bridge the gap between recurrent simplicity and spatial complexity by improving training efficiency and reducing bottlenecks in conventional ConvRNNs.

Method: Extends log-domain prefix-sum formulation of MinLSTM/GRU to convolutional architectures, adding exponential gating for efficiency.

Result: Faster training, lower prediction errors, and improved scalability compared to standard ConvLSTMs/GRUs.

Conclusion: Minimal recurrent structures with convolutional input aggregation offer an efficient alternative for spatiotemporal modeling.

Abstract: We introduce MinConvLSTM and MinConvGRU, two novel spatiotemporal models that combine the spatial inductive biases of convolutional recurrent networks with the training efficiency of minimal, parallelizable RNNs. Our approach extends the log-domain prefix-sum formulation of MinLSTM and MinGRU to convolutional architectures, enabling fully parallel training while retaining localized spatial modeling. This eliminates the need for sequential hidden state updates during teacher forcing - a major bottleneck in conventional ConvRNN models. In addition, we incorporate an exponential gating mechanism inspired by the xLSTM architecture into the MinConvLSTM, which further simplifies the log-domain computation. Our models are structurally minimal and computationally efficient, with reduced parameter count and improved scalability. We evaluate our models on two spatiotemporal forecasting tasks: Navier-Stokes dynamics and real-world geopotential data. In terms of training speed, our architectures significantly outperform standard ConvLSTMs and ConvGRUs. Moreover, our models also achieve lower prediction errors in both domains, even in closed-loop autoregressive mode. These findings demonstrate that minimal recurrent structures, when combined with convolutional input aggregation, offer a compelling and efficient alternative for spatiotemporal sequence modeling, bridging the gap between recurrent simplicity and spatial complexity.

[481] Pair Correlation Factor and the Sample Complexity of Gaussian Mixtures

Farzad Aryan

Main category: cs.LG

TL;DR: The paper challenges the traditional view that the sample complexity of learning Gaussian Mixture Models (GMMs) is solely tied to the minimum pairwise separation between components. It introduces the Pair Correlation Factor (PCF) as a better geometric measure and provides an algorithm with improved sample complexity bounds in the uniform spherical case.

DetailsMotivation: Prior work focused on the minimum pairwise separation between components to determine sample complexity, but this is incomplete. The study aims to identify a more accurate geometric property (PCF) that governs the difficulty of parameter recovery.

Method: The paper introduces the Pair Correlation Factor (PCF) to capture the clustering of component means. It then presents an algorithm for the uniform spherical case, analyzing its sample complexity.

Result: The PCF is shown to more accurately dictate parameter recovery difficulty than the minimum gap. The algorithm achieves improved sample complexity bounds, revealing cases where more than the usual $\epsilon^{-2}$ samples are needed.

Conclusion: The PCF is a more comprehensive geometric measure for GMM sample complexity, and the proposed algorithm demonstrates its practical utility in improving bounds.

Abstract: We study the problem of learning Gaussian Mixture Models (GMMs) and ask: which structural properties govern their sample complexity? Prior work has largely tied this complexity to the minimum pairwise separation between components, but we demonstrate this view is incomplete. We introduce the \emph{Pair Correlation Factor} (PCF), a geometric quantity capturing the clustering of component means. Unlike the minimum gap, the PCF more accurately dictates the difficulty of parameter recovery. In the uniform spherical case, we give an algorithm with improved sample complexity bounds, showing when more than the usual $\epsilon^{-2}$ samples are necessary.

[482] Cross-patient Seizure Onset Zone Classification by Patient-Dependent Weight

Xuyang Zhao, Hidenori Sugano, Toshihisa Tanaka

Main category: cs.LG

TL;DR: The paper proposes a method to fine-tune a pretrained model using patient-specific weights to address the ‘cross-patient problem’ in identifying the seizure onset zone (SOZ) for focal epilepsy, improving classification accuracy by over 10%.

DetailsMotivation: The challenge lies in the variability of medical data across patients, making it hard for machine learning models to perform consistently. The goal is to enhance diagnostic reliability for SOZ identification.

Method: A pretrained model is fine-tuned using patient-specific weights derived from similarity measures between test and training patient data. Supervised learning is first applied, followed by weight-based fine-tuning.

Result: The method, evaluated via leave-one-patient-out, shows improved classification accuracy for each test patient, averaging over 10% enhancement.

Conclusion: The proposed fine-tuning approach effectively addresses the cross-patient variability issue, offering a reliable solution for SOZ identification in focal epilepsy.

Abstract: Identifying the seizure onset zone (SOZ) in patients with focal epilepsy is essential for surgical treatment and remains challenging due to its dependence on visual judgment by clinical experts. The development of machine learning can assist in diagnosis and has made promising progress. However, unlike data in other fields, medical data is usually collected from individual patients, and each patient has different illnesses, physical conditions, and medical histories, which leads to differences in the distribution of each patient’s data. This makes it difficult for a machine learning model to achieve consistently reliable performance in every new patient dataset, which we refer to as the “cross-patient problem.” In this paper, we propose a method to fine-tune a pretrained model using patient-specific weights for every new test patient to improve diagnostic performance. First, the supervised learning method is used to train a machine learning model. Next, using the intermediate features of the trained model obtained through the test patient data, the similarity between the test patient data and each training patient’s data is defined to determine the weight of each training patient to be used in the following fine-tuning. Finally, we fine-tune all parameters in the pretrained model with training data and patient weights. In the experiment, the leave-one-patient-out method is used to evaluate the proposed method, and the results show improved classification accuracy for every test patient, with an average improvement of more than 10%.

[483] Cross-Model Semantics in Representation Learning

Saleh Nikooroo, Thomas Engel

Main category: cs.LG

TL;DR: The paper explores how structural constraints in deep networks improve the stability and transferability of learned representations across different architectures.

DetailsMotivation: To understand how architectural choices affect the alignment and transferability of internal representations in deep networks.

Method: Develops a framework combining theoretical insights, empirical probes, and transfer experiments to analyze representational alignment.

Result: Structural regularities lead to more stable representational geometry across architectures, enhancing feature interoperability.

Conclusion: Inductive biases improve feature transferability, with implications for model distillation, modular learning, and robust system design.

Abstract: The internal representations learned by deep networks are often sensitive to architecture-specific choices, raising questions about the stability, alignment, and transferability of learned structure across models. In this paper, we investigate how structural constraints–such as linear shaping operators and corrective paths–affect the compatibility of internal representations across different architectures. Building on the insights from prior studies on structured transformations and convergence, we develop a framework for measuring and analyzing representational alignment across networks with distinct but related architectural priors. Through a combination of theoretical insights, empirical probes, and controlled transfer experiments, we demonstrate that structural regularities induce representational geometry that is more stable under architectural variation. This suggests that certain forms of inductive bias not only support generalization within a model, but also improve the interoperability of learned features across models. We conclude with a discussion on the implications of representational transferability for model distillation, modular learning, and the principled design of robust learning systems.

[484] Efficient Morphology-Aware Policy Transfer to New Embodiments

Michael Przystupa, Hongyao Tang, Martin Jagersand, Santiago Miret, Mariano Phielipp, Matthew E. Taylor, Glen Berseth

Main category: cs.LG

TL;DR: Morphology-aware policy learning improves sample efficiency but lacks zero-shot performance. Combining it with parameter-efficient finetuning (PEFT) reduces the need for extensive data collection, achieving better performance with fewer parameters.

DetailsMotivation: Address the sub-optimal zero-shot performance of morphology-aware policies and the computational cost of end-to-end finetuning in robotics.

Method: Combine morphology-aware pretraining with PEFT techniques (tuning subsets of weights, input adapters, prefix tuning) for online finetuning.

Result: PEFT with pretraining reduces samples needed for improvement. Tuning <1% of parameters outperforms zero-shot performance.

Conclusion: PEFT enhances morphology-aware policies efficiently, making them more practical for robotics.

Abstract: Morphology-aware policy learning is a means of enhancing policy sample efficiency by aggregating data from multiple agents. These types of policies have previously been shown to help generalize over dynamic, kinematic, and limb configuration variations between agent morphologies. Unfortunately, these policies still have sub-optimal zero-shot performance compared to end-to-end finetuning on morphologies at deployment. This limitation has ramifications in practical applications such as robotics because further data collection to perform end-to-end finetuning can be computationally expensive. In this work, we investigate combining morphology-aware pretraining with parameter efficient finetuning (PEFT) techniques to help reduce the learnable parameters necessary to specialize a morphology-aware policy to a target embodiment. We compare directly tuning sub-sets of model weights, input learnable adapters, and prefix tuning techniques for online finetuning. Our analysis reveals that PEFT techniques in conjunction with policy pre-training generally help reduce the number of samples to necessary to improve a policy compared to training models end-to-end from scratch. We further find that tuning as few as less than 1% of total parameters will improve policy performance compared the zero-shot performance of the base pretrained a policy.

[485] A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design

Claudiu Leoveanu-Condrei

Main category: cs.LG

TL;DR: A contract layer is introduced for LLMs using Design by Contract principles to ensure semantic and type compliance probabilistically.

DetailsMotivation: LLMs lack verifiable guarantees despite fluent outputs; this work aims to provide structured, enforceable contracts for reliability.

Method: Adapts Design by Contract and type theory to mediate LLM calls, stipulating input/output requirements and probabilistic remediation.

Result: Contracts enable probabilistic validation and functional equivalence of agents under the same conditions.

Conclusion: Contracts offer a framework for reliable LLM interactions by enforcing semantic and type compliance.

Abstract: Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emph{functionally equivalent} with respect to those contracts.

[486] Streaming Generated Gaussian Process Experts for Online Learning and Control

Zewen Yang, Dongfa Zhang, Xiaobing Dai, Fengyi Yu, Chi Zhang, Bingkun Huang, Hamid Sadeghian, Sami Haddadin

Main category: cs.LG

TL;DR: SkyGP is a streaming kernel-induced framework for Gaussian Processes that addresses computational and memory constraints while maintaining performance guarantees. Two variants, SkyGP-Dense and SkyGP-Fast, optimize for accuracy or efficiency, respectively.

DetailsMotivation: Exact GPs face scalability issues in real-time settings due to high computational and memory costs, limiting their use in large datasets.

Method: SkyGP introduces a framework with bounded experts to reduce complexity, offering two variants: SkyGP-Dense for accuracy and SkyGP-Fast for efficiency.

Result: SkyGP outperforms state-of-the-art methods in benchmarks and real-time control experiments.

Conclusion: SkyGP provides a scalable solution for streaming data with GPs, balancing performance and computational efficiency.

Abstract: Gaussian Processes (GPs), as a nonparametric learning method, offer flexible modeling capabilities and calibrated uncertainty quantification for function approximations. Additionally, GPs support online learning by efficiently incorporating new data with polynomial-time computation, making them well-suited for safety-critical dynamical systems that require rapid adaptation. However, the inference and online updates of exact GPs, when processing streaming data, incur cubic computation time and quadratic storage memory complexity, limiting their scalability to large datasets in real-time settings. In this paper, we propose a \underline{s}treaming \underline{k}ernel-induced progressivel\underline{y} generated expert framework of \underline{G}aussian \underline{p}rocesses (SkyGP) that addresses both computational and memory constraints by maintaining a bounded set of experts, while inheriting the learning performance guarantees from exact Gaussian processes. Furthermore, two SkyGP variants are introduced, each tailored to a specific objective, either maximizing prediction accuracy (SkyGP-Dense) or improving computational efficiency (SkyGP-Fast). The effectiveness of SkyGP is validated through extensive benchmarks and real-time control experiments demonstrating its superior performance compared to state-of-the-art approaches.

[487] Self-Questioning Language Models

Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, Deepak Pathak

Main category: cs.LG

TL;DR: A framework called Self-Questioning Language Models (SQLM) enables language models to improve reasoning skills by generating and solving their own questions without external data.

DetailsMotivation: To explore if language models can enhance reasoning abilities autonomously by self-generating questions and answers, eliminating the need for curated datasets.

Method: SQLM uses asymmetric self-play: a proposer generates questions (or unit tests for coding), and a solver attempts to answer. Both are trained via reinforcement learning with rewards for question difficulty and correctness.

Result: Tested on three benchmarks (multiplication, algebra, programming), SQLM shows improvement on downstream tasks without external data.

Conclusion: Language models can autonomously improve reasoning skills through self-generated questions and answers, reducing reliance on external datasets.

Abstract: Can large language models improve without external data – by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets.

[488] No LLM Solved Yu Tsumura’s 554th Problem

Simon Frieder, William Hart

Main category: cs.LG

TL;DR: A problem (Yu Tsumura’s 554th) challenges LLMs despite their IMO success, as it meets IMO standards, avoids combinatorics, and remains unsolved by current LLMs.

DetailsMotivation: Highlight limitations of LLMs in solving specific, non-combinatorics IMO-level problems despite their general success.

Method: Identify a problem (Yu Tsumura’s 554th) meeting criteria (IMO-level, non-combinatorics, solvable with fewer techniques, and in LLM training data) and test LLMs on it.

Result: No existing LLM (commercial or open-source) could solve the problem, revealing a gap in their problem-solving abilities.

Conclusion: LLMs have limitations in solving certain IMO-level problems, even when conditions seem favorable, indicating room for improvement.

Abstract: We show, contrary to the optimism about LLM’s problem-solving abilities, fueled by the recent gold medals that were attained, that a problem exists – Yu Tsumura’s 554th problem – that a) is within the scope of an IMO problem in terms of proof sophistication, b) is not a combinatorics problem which has caused issues for LLMs, c) requires fewer proof techniques than typical hard IMO problems, d) has a publicly available solution (likely in the training data of LLMs), and e) that cannot be readily solved by any existing off-the-shelf LLM (commercial or open-source).

[489] PAC Apprenticeship Learning with Bayesian Active Inverse Reinforcement Learning

Ondrej Bajgar, Dewi S. W. Gould, Jonathon Liu, Alessandro Abate, Konstantinos Gatsis, Michael A. Osborne

Main category: cs.LG

TL;DR: The paper introduces PAC-EIG, an information-theoretic method for active inverse reinforcement learning (IRL) to ensure reliable policies with formal guarantees, addressing the challenge of costly human demonstrations.

DetailsMotivation: Aligning AI decision-making with human preferences is critical, especially in high-stakes domains like autonomous driving, where reliable policies with formal guarantees are needed.

Method: The authors propose PAC-EIG, an acquisition function for active IRL, which maximizes information gain about policy regret to identify critical states for demonstration. Reward-EIG is also introduced for reward-focused learning.

Result: The method provides theoretical guarantees (PAC) for active IRL with noisy demonstrations, outperforms prior heuristic approaches, and demonstrates experimental advantages.

Conclusion: PAC-EIG offers a principled solution for active IRL, ensuring reliable policies with formal guarantees, and is experimentally validated.

Abstract: As AI systems become increasingly autonomous, reliably aligning their decision-making to human preferences is essential. Inverse reinforcement learning (IRL) offers a promising approach to infer preferences from demonstrations. These preferences can then be used to produce an apprentice policy that performs well on the demonstrated task. However, in domains like autonomous driving or robotics, where errors can have serious consequences, we need not just good average performance but reliable policies with formal guarantees – yet obtaining sufficient human demonstrations for reliability guarantees can be costly. Active IRL addresses this challenge by strategically selecting the most informative scenarios for human demonstration. We introduce PAC-EIG, an information-theoretic acquisition function that directly targets probably-approximately-correct (PAC) guarantees for the learned policy – providing the first such theoretical guarantee for active IRL with noisy expert demonstrations. Our method maximises information gain about the regret of the apprentice policy, efficiently identifying states requiring further demonstration. We also present Reward-EIG as an alternative when learning the reward itself is the primary objective. Focusing on finite state-action spaces, we prove convergence bounds, illustrate failure modes of prior heuristic methods, and demonstrate our method’s advantages experimentally.

[490] Out-of-Context Relational Reasoning in Large Language Models

Jonathan Shaki, Emanuele La Malfa, Michael Wooldridge, Sarit Kraus

Main category: cs.LG

TL;DR: LLMs perform better than random but imperfectly on out-of-context reasoning tasks involving binary relations like equality, inequality, and inclusion.

DetailsMotivation: To evaluate LLMs' ability to reason about binary relations (e.g., equality, inequality) out-of-context, focusing on their properties and logical complexity.

Method: Benchmarking LLMs on binary relations by learning new token representations, testing properties like reflexivity, symmetry, and transitivity.

Result: LLMs achieve above-random accuracy but struggle with simple reasoning tasks, though they encode useful information in embeddings.

Conclusion: LLMs show promise in out-of-context reasoning but remain limited in handling binary relations effectively.

Abstract: Binary relations, such as equality, are basic mathematical concepts that appear, implicitly or explicitly, in most benchmarks for Large Language Models (LLM). A recent trend in the literature is benchmarking LLMs on out-of-context learning, where the data is not presented in the prompt, but only during the model’s training. However, existing works mostly focus on higher-order tasks, making it hard to interpret success or failure. In this work, we study how well can LLMs reason out-of-context on binary relations by only learning the representations of newly introduced tokens. Our experiments focus on equality ($=$), inequality ($<$), and inclusion ($\subset$) and the properties they satisfy, such as reflexivity, symmetry, transitivity, and logical complexity (e.g., the number of reasoning “hops”). We show that LLMs achieve better than random accuracy, but are still far from perfect, even on relatively simple reasoning tasks involving binary relations. We analyse the learned representations and show that LLMs encode useful information directly, arranging the embeddings according to the task.

[491] Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

Janna Lu

Main category: cs.LG

TL;DR: LLMs’ forecasting abilities are evaluated against human crowds and experts, showing improvement but still lagging behind experts.

DetailsMotivation: To assess the forecasting capabilities of large language models (LLMs) compared to human crowds and experts, given their understudied potential in this area.

Method: Evaluating state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top human forecasters and experts.

Result: Frontier LLMs achieve Brier scores surpassing the human crowd but still underperform expert groups.

Conclusion: While LLMs show improved forecasting abilities, they remain less accurate than expert humans, indicating room for further advancement.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of experts.

[492] Principled Foundations for Preference Optimization

Wenxuan Zhou, Shujian Zhang, Brice Magdalou, John Lambert, Ehsan Amid, Richard Nock, Andrew Hard

Main category: cs.LG

TL;DR: The paper connects direct preference optimization (DPO) to broader theories in ML, highlighting its generality and applications.

DetailsMotivation: To provide a principled understanding of DPO by linking it to Savage's loss functions and stochastic choice theories, addressing its diverse applications and pitfalls.

Method: Establishes a connection between DPO and Savage’s losses and stochastic choice theories, supporting abstention, non-convex objectives, and extensions like margins and length corrections.

Result: The paper generalizes DPO, showing its broader applicability and potential pitfalls when deviating from the established framework.

Conclusion: Understanding DPO’s theoretical foundations is crucial for its diverse applications and avoiding pitfalls, with the paper providing a comprehensive map for its use.

Abstract: In this paper, we show that direct preference optimization (DPO) is a very specific form of a connection between two major theories in the ML context of learning from preferences: loss functions (Savage) and stochastic choice (Doignon-Falmagne and Machina). The connection is established for all of Savage’s losses and at this level of generality, (i) it includes support for abstention on the choice theory side, (ii) it includes support for non-convex objectives on the ML side, and (iii) it allows to frame for free some notable extensions of the DPO setting, including margins and corrections for length. Getting to understand how DPO operates from a general principled perspective is crucial because of the huge and diverse application landscape of models, because of the current momentum around DPO, but also – and importantly – because many state of the art variations on DPO definitely occupy a small region of the map that we cover. It also helps to understand the pitfalls of departing from this map, and figure out workarounds.

[493] Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Huijie Lv, Ming Zhang, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang

Main category: cs.LG

TL;DR: The paper investigates the reliability of reinforcement learning (RL) methods in large language models, revealing data contamination issues in benchmarks for models like Qwen2.5. It introduces a clean dataset, RandomCalculation, showing accurate rewards are crucial for performance gains, unlike random or incorrect signals.

DetailsMotivation: To address the unreliability of conclusions from contaminated benchmarks and understand the impact of reward signals in RL for mathematical reasoning.

Method: Empirical analysis using a leakage-free dataset (RandomCalculation) to evaluate RL performance, comparing accurate, random, and incorrect reward signals.

Result: Accurate rewards improve performance beyond base model boundaries, while random or incorrect rewards do not. Contaminated benchmarks mislead evaluations.

Conclusion: Future studies should use uncontaminated benchmarks and test diverse model series to ensure reliable conclusions about RL methods.

Abstract: Reasoning in large language models has long been a central research focus, and recent studies employing reinforcement learning (RL) have introduced diverse methods that yield substantial performance gains with minimal or even no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance performance. However, these breakthroughs are predominantly observed for the mathematically strong Qwen2.5 series on benchmarks such as MATH-500, AMC, and AIME, and seldom transfer to models like Llama, which warrants a more in-depth investigation. In this work, our empirical analysis reveals that pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks. Consequently, conclusions derived from contaminated benchmarks on Qwen2.5 series may be unreliable. To obtain trustworthy evaluation results, we introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation. Using this leakage-free dataset, we show that only accurate reward signals yield steady improvements that surpass the base model’s performance boundary in mathematical reasoning, whereas random or incorrect rewards do not. Moreover, we conduct more fine-grained analyses to elucidate the factors underlying the different performance observed on the MATH-500 and RandomCalculation benchmarks. Consequently, we recommend that future studies evaluate models on uncontaminated benchmarks and, when feasible, test various model series to ensure trustworthy conclusions about RL and related methods.

[494] PCENet: High Dimensional Surrogate Modeling for Learning Uncertainty

Paz Fink Shustin, Shashanka Ubaru, Małgorzata J. Zimoń, Songtao Lu, Vasileios Kalantzis, Lior Horesh, Haim Avron

Main category: cs.LG

TL;DR: A two-stage DRSM approach combines variational autoencoders and polynomial chaos expansion for efficient uncertainty quantification in high-dimensional data.

DetailsMotivation: Addressing the computational challenges of uncertainty quantification in high-dimensional data.

Method: 1) Variational autoencoder for low-dimensional representation. 2) Polynomial chaos expansion for mapping to output.

Result: Efficiently captures system dynamics, learns under uncertainty, and matches high-order moments without prior assumptions.

Conclusion: The DRSM method effectively handles uncertainty in high-dimensional data with demonstrated performance.

Abstract: Learning data representations under uncertainty is an important task that emerges in numerous scientific computing and data analysis applications. However, uncertainty quantification techniques are computationally intensive and become prohibitively expensive for high-dimensional data. In this study, we introduce a dimensionality reduction surrogate modeling (DRSM) approach for representation learning and uncertainty quantification that aims to deal with data of moderate to high dimensions. The approach involves a two-stage learning process: 1) employing a variational autoencoder to learn a low-dimensional representation of the input data distribution; and 2) harnessing polynomial chaos expansion (PCE) formulation to map the low dimensional distribution to the output target. The model enables us to (a) capture the system dynamics efficiently in the low-dimensional latent space, (b) learn under uncertainty, a representation of the data and a mapping between input and output distributions, (c) estimate this uncertainty in the high-dimensional data system, and (d) match high-order moments of the output distribution; without any prior statistical assumptions on the data. Numerical results are presented to illustrate the performance of the proposed method.

[495] R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang

Main category: cs.LG

TL;DR: R-Stitch is a hybrid decoding framework that accelerates Chain-of-Thought reasoning by dynamically switching between small and large language models based on token-level confidence, reducing latency by up to 85% with minimal accuracy loss.

DetailsMotivation: Current CoT acceleration methods like speculative decoding have limitations in speedup and fail to leverage small models' potential for concise reasoning. R-Stitch aims to address these inefficiencies.

Method: R-Stitch uses a small language model (SLM) by default and switches to a large language model (LLM) only when SLM confidence is low, avoiding full-sequence rollback and selectively invoking the LLM.

Result: Experiments show R-Stitch reduces inference latency by up to 85% with negligible accuracy drop on math reasoning benchmarks.

Conclusion: R-Stitch is a practical, model-agnostic solution for efficient CoT reasoning, balancing speed and accuracy.

Abstract: Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM’s confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.

[496] TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks

Ziji Shi, Le Jiang, Ang Wang, Jie Zhang, Chencan Wu, Yong Li, Xiaokui Xiao, Wei Lin, Jialin Li

Main category: cs.LG

TL;DR: TAPAS is an automatic parallelism framework that optimizes tensor parallel strategies for large neural networks by leveraging repeated substructures, achieving sub-linear search complexity and outperforming existing methods.

DetailsMotivation: The challenge of determining optimal tensor parallel strategies for large neural networks due to an exponentially growing search space motivates the development of TAPAS.

Method: TAPAS uses a divide-and-conquer approach to identify and leverage repeated substructures in neural networks, reducing redundant search efforts and folding the search space efficiently.

Result: TAPAS achieves up to 160× faster search speeds than state-of-the-art frameworks and produces strategies competitive with expert-engineered solutions like Megatron-LM.

Conclusion: TAPAS is a scalable and efficient solution for automatic tensor parallelism in large-scale neural network training.

Abstract: Tensor parallelism is an essential technique for distributed training of large neural networks. However, automatically determining an optimal tensor parallel strategy is challenging due to the gigantic search space, which grows exponentially with model size and tensor dimension. This prohibits the adoption of auto-parallel systems on larger models. We observe that neural networks usually contain repeated substructures, and build an automatic parallelism framework named TAPAS that eliminates redundant search efforts. TAPAS employs a divide-and-conquer approach that efficiently folds the search space by identifying those unique substructures. As a result, it runs at sub-linear complexity concerning the model size, making it a scalable solution for training large-scale networks. Our evaluations demonstrate that TAPAS outperforms the state-of-the-art automatic parallelism frameworks by up to $160\times$ in search speed on a wide range of models, and the performance of derived strategies is competitive or even better compared with the expert-engineered Megatron-LM library.

[497] PPFL: A Personalized Federated Learning Framework for Heterogeneous Population

Hao Di, Yi Yang, Haishan Ye, Xiangyu Chang

Main category: cs.LG

TL;DR: The paper introduces PPFL, a privacy-preserving personalized federated learning framework, using canonical models and membership vectors to model client heterogeneity. It outperforms existing PFL methods and is validated through experiments.

DetailsMotivation: Centralized personalized methods risk exposing raw data; PPFL addresses this by integrating privacy into federated learning while modeling client preferences.

Method: PPFL uses canonical models and membership vectors to capture population characteristics and client preferences. A random block coordinate descent algorithm solves the non-convex optimization problem.

Result: Experiments show PPFL’s effectiveness in modeling heterogeneity and outperforming existing PFL methods.

Conclusion: PPFL provides a flexible, interpretable, and privacy-preserving solution for personalized federated learning, with demonstrated advantages over other PFL approaches.

Abstract: Personalization aims to characterize individual preferences and is widely applied across many fields. However, conventional personalized methods operate in a centralized manner, potentially exposing raw data when pooling individual information. In this paper, with privacy considerations, we develop a flexible and interpretable personalized framework within the paradigm of federated learning, called \texttt{PPFL} (Population Personalized Federated Learning). By leveraging canonical models" to capture fundamental characteristics of a heterogeneous population and employing membership vectors” to reveal clients’ preferences, \texttt{PPFL} models heterogeneity as clients’ varying preferences for these characteristics. This approach provides substantial insights into client characteristics, which are lacking in existing Personalized Federated Learning (PFL) methods. Furthermore, we explore the relationship between \texttt{PPFL} and three main branches of PFL methods: clustered FL, multi-task PFL, and decoupling PFL, and demonstrate the advantages of \texttt{PPFL}. To solve \texttt{PPFL} (a non-convex optimization problem with linear constraints), we propose a novel random block coordinate descent algorithm and establish its convergence properties. We conduct experiments on both pathological and practical data sets, and the results validate the effectiveness of \texttt{PPFL}.

[498] Set-Based Training for Neural Network Verification

Lukas Koller, Tobias Ladner, Matthias Althoff

Main category: cs.LG

TL;DR: Set-based training improves neural network robustness by computing output and gradient sets, simplifying formal verification.

DetailsMotivation: Neural networks are vulnerable to adversarial attacks; ensuring robustness is critical for safety-critical applications.

Method: A novel set-based training procedure computes output and gradient sets, reducing output enclosure size by centering gradients.

Result: Produces robust networks with competitive performance, enabling fast polynomial-time verification.

Conclusion: Set-based training enhances robustness and simplifies verification, making it practical for safety-critical environments.

Abstract: Neural networks are vulnerable to adversarial attacks, i.e., small input perturbations can significantly affect the outputs of a neural network. Therefore, to ensure safety of neural networks in safety-critical environments, the robustness of a neural network must be formally verified against input perturbations, e.g., from noisy sensors. To improve the robustness of neural networks and thus simplify the formal verification, we present a novel set-based training procedure in which we compute the set of possible outputs given the set of possible inputs and compute for the first time a gradient set, i.e., each possible output has a different gradient. Therefore, we can directly reduce the size of the output enclosure by choosing gradients toward its center. Small output enclosures increase the robustness of a neural network and, at the same time, simplify its formal verification. The latter benefit is due to the fact that a larger size of propagated sets increases the conservatism of most verification methods. Our extensive evaluation demonstrates that set-based training produces robust neural networks with competitive performance, which can be verified using fast (polynomial-time) verification algorithms due to the reduced output set.

[499] Filtering with Self-Attention and Storing with MLP: One-Layer Transformers Can Provably Acquire and Extract Knowledge

Ruichen Xu, Kexin Chen

Main category: cs.LG

TL;DR: The paper investigates how transformers acquire and extract knowledge, introducing a one-layer framework with self-attention and MLPs to analyze training dynamics and generalization guarantees.

DetailsMotivation: To address the theoretical opacity of how transformers store and retrieve knowledge, especially given prior limitations of simplified models.

Method: A tractable one-layer transformer framework with self-attention and MLPs, analyzed via gradient dynamics for convergence and generalization guarantees.

Result: Transformers achieve near-optimal training loss (knowledge acquisition) and low generalization error under specific conditions (knowledge extraction), but fail otherwise.

Conclusion: The framework provides theoretical insights into knowledge acquisition and extraction, validated by experiments on synthetic and real-world datasets.

Abstract: Modern large language models excel in knowledge-intensive tasks, yet how transformers acquire (store) knowledge during pre-training and extract (retrieve) it during post-fine-tuning inference remains theoretically opaque. While prior theoretical work has begun to investigate these questions through the analysis of training dynamics, such studies are limited to single-layer, attention-only architectures. However, most existing studies suggest that MLPs are the most contributing components for storing knowledge in transformer-based language models. Meanwhile, our empirical investigations reveal that such simplified models, when trained using standard next-token prediction objectives, may be incapable of acquiring or extracting factual knowledge. To overcome this limitation, we introduce a tractable one-layer transformer framework that crucially incorporates both self-attention and MLP modules. By tracking its gradient dynamics, we establish convergence and generalization guarantees that illuminate the ability of knowledge acquisition and extraction. We prove that 1) Transformers can achieve near-optimal training loss during pre-training, signifying effective knowledge acquisition; 2) With a large fine-tuning dataset and specific data multiplicity conditions met, transformers can achieve low generalization error when tested on factual knowledge learned during pre-training but not reinforced during the fine-tuning, indicating successful knowledge extraction; 3) When the conditions are not satisfied, transformers exhibit high generalization loss, resulting in hallucinations. Our analysis includes both full fine-tuning and low-rank fine-tuning. Furthermore, our analysis offers theoretical insights into several pertinent empirical phenomena, such as the role of learning rate schedules. Experiments on synthetic and real-world PopQA datasets with GPT-2 and Llama-3.2-1B validate our results.

[500] Efficient Time Series Processing for Transformers and State-Space Models through Token Merging

Leon Götz, Marcel Kollovieh, Stephan Günnemann, Leo Schwinn

Main category: cs.LG

TL;DR: Token merging, especially local merging, improves efficiency in time series analysis for transformers and state-space models, offering linear complexity and causal merging for decoders.

DetailsMotivation: Long token sequences impose high computational costs, and token merging is explored to enhance efficiency in time series analysis.

Method: Introduces local merging, a domain-specific algorithm that combines tokens within a local neighborhood, adjusting complexity and enabling causal merging.

Result: Local merging achieves up to 5400% acceleration with minimal accuracy impact, validated on the Chronos foundation model.

Conclusion: Local merging is a scalable and efficient solution for long sequences, with predictable benefits based on spectral properties.

Abstract: Despite recent advances in subquadratic attention mechanisms or state-space models, processing long token sequences still imposes significant computational requirements. Token merging has emerged as a solution to increase computational efficiency in computer vision architectures. In this work, we perform the first investigations of token merging in time series analysis on both transformers and state-space models. We further introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood, achieving two major benefits: a) Local merging can adjust its computational complexity from quadratic to linear based on the neighborhood size to effectively scale to long sequences; b) Local merging is the first causal merging scheme enabling token merging in transformer decoders. Further, we identify spectral properties of the input data that reliably predict the potential benefits of local merging without requiring evaluation on downstream tasks. Our comprehensive empirical evaluation demonstrates that local merging offers substantial efficiency gains with minimal impact on accuracy, achieving up to 5400% acceleration on the recently proposed Chronos foundation model.

[501] Balancing Optimality and Diversity: Human-Centered Decision Making through Generative Curation

Michael Lingzhi Li, Shixiang Zhu

Main category: cs.LG

TL;DR: Generative curation is a framework for creating diverse, high-quality recommendation sets to enhance human decision-making by balancing quantitative quality and qualitative diversity.

DetailsMotivation: Human decision-makers often rely on algorithmic recommendations, but the final choice depends on unobserved qualitative factors. Current methods focus on a single 'optimum,' missing the need for diverse options.

Method: The framework uses a generative neural network and sequential optimization to learn a distribution of solutions, maximizing the expected desirability of the best option in a portfolio.

Result: The approach reduces expected regret compared to benchmarks, showing effectiveness in synthetic and real-world studies.

Conclusion: Generative curation provides a principled way to design algorithms that complement human judgment, accommodating unmodeled factors and enabling scalable human-centered decision-making.

Abstract: Operational decisions in healthcare, logistics, and public policy increasingly involve algorithms that recommend candidate solutions, such as treatment plans, delivery routes, or policy options, while leaving the final choice to human decision-makers. For instance, school districts use algorithms to design bus routes, but administrators make the final call given community feedback. In these settings, decision quality depends not on a single algorithmic ``optimum’’, but on whether the portfolio of recommendations contains at least one option the human ultimately deems desirable. We propose generative curation, a framework that optimally generates recommendation sets when desirability depends on both observable objectives and unobserved qualitative considerations. Instead of a fixed solution, generative curation learns a distribution over solutions designed to maximize the expected desirability of the best option within a manageable portfolio. Our analysis identifies a trade-off between quantitative quality and qualitative diversity, formalized through a novel diversity metric derived from the reformulated objective. We implement the framework using a generative neural network and a sequential optimization method, and show in synthetic and real-world studies that it consistently reduces expected regret compared to existing benchmarks. Our framework provides decision-makers with a principled way to design algorithms that complement, rather than replace, human judgment. By generating portfolios of diverse yet high-quality options, decision-support tools can better accommodate unmodeled factors such as stakeholder preferences, political feasibility, or community acceptance. More broadly, the framework enables organizations to operationalize human-centered decision-making at scale, ensuring that algorithmic recommendations remain useful even when objectives are incomplete or evolving.

[502] Risk-averse learning with delayed feedback

Siyi Wang, Zifan Wang, Karl Henrik Johansson, Sandra Hirche

Main category: cs.LG

TL;DR: The paper explores risk-averse learning with delayed feedback, using CVaR as a risk measure. Two algorithms (one-point and two-point zeroth-order optimization) are developed, with regret bounds analyzed. The two-point method outperforms the one-point method, and experiments on dynamic pricing validate their performance.

DetailsMotivation: Risk-averse learning is crucial for mitigating adverse outcomes, but delayed feedback complicates risk assessment and management.

Method: Two risk-averse learning algorithms are developed using one-point and two-point zeroth-order optimization, with regret bounds analyzed in terms of cumulative delay and total samplings.

Result: The two-point algorithm achieves a smaller regret bound than the one-point method. Without delay, regret bounds align with existing zeroth-order stochastic gradient methods.

Conclusion: The proposed algorithms effectively handle delayed feedback in risk-averse learning, with the two-point method showing superior performance, as validated by numerical experiments.

Abstract: In real-world scenarios, risk-averse learning is valuable for mitigating potential adverse outcomes. However, the delayed feedback makes it challenging to assess and manage risk effectively. In this paper, we investigate risk-averse learning using Conditional Value at Risk (CVaR) as risk measure, while incorporating feedback with random but bounded delays. We develop two risk-averse learning algorithms that rely on one-point and two-point zeroth-order optimization approaches, respectively. The dynamic regrets of the algorithms are analyzed in terms of the cumulative delay and the number of total samplings. In the absence of delay, the regret bounds match the established bounds of zeroth-order stochastic gradient methods for risk-averse learning. Furthermore, the two-point risk-averse learning outperforms the one-point algorithm by achieving a smaller regret bound. We provide numerical experiments on a dynamic pricing problem to demonstrate the performance of the algorithms.

[503] Clinicians’ Voice: Fundamental Considerations for XAI in Healthcare

T. E. Röber, R. Goedhart, S. İ. Birbil

Main category: cs.LG

TL;DR: The paper explores clinicians’ perspectives on Explainable AI (XAI) in healthcare, emphasizing the need for user input, workflow integration, and clinician training.

DetailsMotivation: To address the lack of end-user input in XAI research and improve practical adoption in healthcare.

Method: Semi-structured interviews with clinicians to gather their thoughts, hopes, and concerns about AI-based tools.

Result: Clinicians are positive about AI but worry about workflow fit and patient relations. Training and general requirements for XAI tools are highlighted.

Conclusion: A holistic approach is needed to define XAI requirements in healthcare before tool-specific testing.

Abstract: Explainable AI (XAI) holds the promise of advancing the implementation and adoption of AI-based tools in practice, especially in high-stakes environments like healthcare. However, most of the current research lacks input from end users, and therefore their practical value is limited. To address this, we conducted semi-structured interviews with clinicians to discuss their thoughts, hopes, and concerns. Clinicians from our sample generally think positively about developing AI-based tools for clinical practice, but they have concerns about how these will fit into their workflow and how it will impact clinician-patient relations. We further identify training of clinicians on AI as a crucial factor for the success of AI in healthcare and highlight aspects clinicians are looking for in (X)AI-based tools. In contrast to other studies, we take on a holistic and exploratory perspective to identify general requirements for (X)AI products for healthcare before moving on to testing specific tools.

[504] OWLed: Outlier-weighed Layerwise Pruning for Efficient Autonomous Driving Framework

Jiaxi Li, Lu Yin, Xilu Wang

Main category: cs.LG

TL;DR: OWLed introduces outlier-weighted layerwise pruning to compress LLMs for autonomous driving, reducing computational demands without fine-tuning.

DetailsMotivation: The high computational cost of deploying LLMs locally in autonomous driving systems makes it impractical, necessitating efficient compression methods.

Method: OWLed uses outlier-weighted layerwise sparsity, assigning non-uniform sparsity ratios based on outlier features, and incorporates driving data for calibration.

Result: OWLed outperforms existing methods in perception, action prediction, and language understanding while reducing computational needs.

Conclusion: Combining advanced pruning with LLMs enables efficient, robust autonomous driving systems for complex scenarios.

Abstract: The integration of Large Language Models (LLMs) into autonomous driving systems offers promising enhancements in environmental understanding and decision-making. However, the substantial computational demands of deploying LLMs locally on vehicles render this approach unfeasible for real-world automotive applications. To address this challenge, we introduce OWLed, the Outlier-Weighed Layerwise Pruning for Efficient Autonomous Driving Framework that leverages outlier-weighted layerwise sparsity for model compression. Our method assigns non-uniform sparsity ratios to different layers based on the distribution of outlier features, significantly reducing the model size without the need for fine-tuning. To ensure the compressed model adapts well to autonomous driving tasks, we incorporate driving environment data into both the calibration and pruning processes. Our empirical studies reveal that the encoder component is more sensitive to pruning than the LLM, highlighting its critical role in the system. Experimental results demonstrate that OWLed outperforms existing methods in perception, action prediction, and language understanding while substantially lowering computational requirements. These findings underscore the potential of combining advanced pruning techniques with LLMs to develop efficient and robust autonomous driving systems capable of handling complex scenarios. Code will be made publicly available.

[505] Learning New Concepts, Remembering the Old: Continual Learning for Multimodal Concept Bottleneck Models

Songning Lai, Mingqian Liao, Zhangyi Hu, Jiayu Yang, Wenshuo Chen, Hongru Xiao, Jianheng Tang, Haicheng Liao, Yutao Yue

Main category: cs.LG

TL;DR: CONCIL is a novel framework for continual learning in Concept Bottleneck Models (CBMs), addressing static dataset limitations by enabling concept- and class-incremental learning through linear regression, preventing catastrophic forgetting efficiently.

DetailsMotivation: Existing CBMs lack adaptability to evolving multimodal data streams, limiting real-world applicability.

Method: CONCIL reformulates concept and decision layer updates as linear regression problems, using recursive matrix operations for efficiency.

Result: CONCIL achieves absolute knowledge memory and outperforms traditional CBMs in incremental learning tasks.

Conclusion: CONCIL establishes a new paradigm for continual learning in CBMs, enhancing dynamic multimodal understanding.

Abstract: Concept Bottleneck Models (CBMs) enhance the interpretability of AI systems, particularly by bridging visual input with human-understandable concepts, effectively acting as a form of multimodal interpretability model. However, existing CBMs typically assume static datasets, which fundamentally limits their adaptability to real-world, continuously evolving multimodal data streams. To address this, we define a novel continual learning task for CBMs: simultaneously handling concept-incremental and class-incremental learning. This task requires models to continuously acquire new concepts (often representing cross-modal attributes) and classes while robustly preserving previously learned knowledge. To tackle this challenging problem, we propose CONceptual Continual Incremental Learning (CONCIL), a novel framework that fundamentally re-imagines concept and decision layer updates as linear regression problems. This reformulation eliminates the need for gradient-based optimization, thereby effectively preventing catastrophic forgetting. Crucially, CONCIL relies solely on recursive matrix operations, rendering it highly computationally efficient and well-suited for real-time and large-scale multimodal data applications. Experimental results compellingly demonstrate that CONCIL achieves “absolute knowledge memory” and significantly surpasses the performance of traditional CBM methods in both concept- and class-incremental settings, thus establishing a new paradigm for continual learning in CBMs, particularly valuable for dynamic multimodal understanding.

[506] Training Multi-Layer Binary Neural Networks With Local Binary Error Signals

Luca Colombo, Fabrizio Pittorino, Manuel Roveri

Main category: cs.LG

TL;DR: A fully binary, gradient-free training algorithm for multi-layer BNNs is proposed, eliminating floating-point gradients and using local binary signals, improving accuracy and reducing computational cost.

DetailsMotivation: Existing BNN training relies on floating-point SGD, limiting binary operation benefits to inference. A fully binary training method is needed.

Method: The algorithm uses local binary error signals, binary weight updates, and integer-valued hidden weights, employing XNOR, Popcount, and increment/decrement operations.

Result: Test accuracy improved by up to +35.47% over existing binary solutions and +35.30% over full-precision SGD, with significantly reduced computational cost.

Conclusion: The proposed fully binary training algorithm enhances BNN performance and efficiency, offering a neurobiologically plausible solution.

Abstract: Binary Neural Networks (BNNs) significantly reduce computational complexity and memory usage in machine and deep learning by representing weights and activations with just one bit. However, most existing training algorithms for BNNs rely on quantization-aware floating-point Stochastic Gradient Descent (SGD), limiting the full exploitation of binary operations to the inference phase only. In this work, we propose, for the first time, a fully binary and gradient-free training algorithm for multi-layer BNNs, eliminating the need for back-propagated floating-point gradients. Specifically, the proposed algorithm relies on local binary error signals and binary weight updates, employing integer-valued hidden weights that serve as a synaptic metaplasticity mechanism, thereby enhancing its neurobiological plausibility. Our proposed solution enables the training of binary multi-layer perceptrons by using exclusively XNOR, Popcount, and increment/decrement operations. Experimental results on multi-class classification benchmarks show test accuracy improvements of up to +35.47% over the only existing fully binary single-layer state-of-the-art solution. Compared to full-precision SGD, our solution improves test accuracy by up to +35.30% under the same total memory demand, while also reducing computational cost by two to three orders of magnitude in terms of the total number of Boolean gates. The proposed algorithm is made available to the scientific community as a public repository.

[507] Average-Reward Soft Actor-Critic

Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, Rahul V. Kulkarni

Main category: cs.LG

TL;DR: The paper introduces an average-reward soft actor-critic algorithm, filling gaps in entropy-regularized average-reward RL, and shows superior performance on benchmarks.

DetailsMotivation: Addressing the lack of deep RL algorithms for entropy-regularized average-reward objectives and the underexplored actor-critic framework in this context.

Method: Develops an average-reward soft actor-critic algorithm, combining entropy regularization with average-reward RL.

Result: Achieves superior performance compared to existing average-reward algorithms on standard RL benchmarks.

Conclusion: The proposed algorithm effectively bridges gaps in entropy-regularized average-reward RL and outperforms existing methods.

Abstract: The average-reward formulation of reinforcement learning (RL) has drawn increased interest in recent years for its ability to solve temporally-extended problems without relying on discounting. Meanwhile, in the discounted setting, algorithms with entropy regularization have been developed, leading to improvements over deterministic methods. Despite the distinct benefits of these approaches, deep RL algorithms for the entropy-regularized average-reward objective have not been developed. While policy-gradient based approaches have recently been presented for the average-reward literature, the corresponding actor-critic framework remains less explored. In this paper, we introduce an average-reward soft actor-critic algorithm to address these gaps in the field. We validate our method by comparing with existing average-reward algorithms on standard RL benchmarks, achieving superior performance for the average-reward criterion.

[508] Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model

F. S. Pezzicoli, V. Ros, F. P. Landes, M. Baity-Jesi

Main category: cs.LG

TL;DR: The paper provides a theoretical framework for addressing class imbalance (CI) in anomaly detection, revealing optimal train imbalance isn’t always 50% and depends on data and noise.

DetailsMotivation: Class imbalance is a persistent issue in machine learning, but empirical solutions lack theoretical grounding. The study aims to clarify CI's impact and solutions.

Method: Uses the teacher-student perceptron model and replica theory to analyze CI, distinguishing intrinsic, train, and test imbalance.

Result: Optimal train imbalance varies with intrinsic imbalance, data abundance, and noise. Performance degrades in high-noise regimes.

Conclusion: Challenges conventional wisdom on CI and offers practical guidelines for addressing it.

Abstract: Class imbalance (CI) is a longstanding problem in machine learning, slowing down training and reducing performances. Although empirical remedies exist, it is often unclear which ones work best and when, due to the lack of an overarching theory. We address a common case of imbalance, that of anomaly (or outlier) detection. We provide a theoretical framework to analyze, interpret and address CI. It is based on an exact solution of the teacher-student perceptron model, through replica theory. Within this framework, one can distinguish several sources of CI: either intrinsic, train or test imbalance. Our analysis reveals that the optimal train imbalance is generally different from 50%, with a non trivial dependence on the intrinsic imbalance, the abundance of data and on the noise in the learning. Moreover, there is a crossover between a small noise training regime where results are independent of the noise level to a high noise regime where performances quickly degrade with noise. Our results challenge some of the conventional wisdom on CI and offer practical guidelines to address it.

[509] Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

Wenyun Li, Wenjie Huang, Chen Sun

Main category: cs.LG

TL;DR: The paper introduces a method combining Semi-Supervised Learning (SSL) and novel data augmentation to improve reward shaping in sparse-reward environments, outperforming supervised approaches.

DetailsMotivation: Sparse rewards in real-world scenarios make learning effective reward functions challenging, necessitating better methods for reward shaping.

Method: Uses SSL and a novel data augmentation technique (double entropy) to learn from zero-reward transitions, enhancing reward inference.

Result: Outperforms supervised baselines, achieving up to twice the peak scores in sparse-reward environments and a 15.8% increase in best score.

Conclusion: The proposed method significantly improves reward shaping efficacy, especially in sparse-reward settings.

Abstract: In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8% increase in best score over other augmentation methods

[510] CAMEF: Causal-Augmented Multi-Modality Event-Driven Financial Forecasting by Integrating Time Series Patterns and Salient Macroeconomic Announcements

Yang Zhang, Wenbo Yang, Jun Wang, Qiang Ma, Jie Xiong

Main category: cs.LG

TL;DR: CAMEF is a multi-modal framework integrating textual and time-series data with causal learning and LLM-based counterfactual augmentation for financial forecasting.

DetailsMotivation: Existing methods fail to capture multi-modal data and causal relationships between macroeconomic events and market behavior.

Method: CAMEF combines textual and time-series data with causal learning and LLM-based counterfactual event augmentation.

Result: CAMEF outperforms state-of-the-art baselines and includes a new financial dataset.

Conclusion: CAMEF effectively enhances financial forecasting by addressing multi-modal and causal gaps.

Abstract: Accurately forecasting the impact of macroeconomic events is critical for investors and policymakers. Salient events like monetary policy decisions and employment reports often trigger market movements by shaping expectations of economic growth and risk, thereby establishing causal relationships between events and market behavior. Existing forecasting methods typically focus either on textual analysis or time-series modeling, but fail to capture the multi-modal nature of financial markets and the causal relationship between events and price movements. To address these gaps, we propose CAMEF (Causal-Augmented Multi-Modality Event-Driven Financial Forecasting), a multi-modality framework that effectively integrates textual and time-series data with a causal learning mechanism and an LLM-based counterfactual event augmentation technique for causal-enhanced financial forecasting. Our contributions include: (1) a multi-modal framework that captures causal relationships between policy texts and historical price data; (2) a new financial dataset with six types of macroeconomic releases from 2008 to April 2024, and high-frequency real trading data for five key U.S. financial assets; and (3) an LLM-based counterfactual event augmentation strategy. We compare CAMEF to state-of-the-art transformer-based time-series and multi-modal baselines, and perform ablation studies to validate the effectiveness of the causal learning mechanism and event types.

[511] A First-order Generative Bilevel Optimization Framework for Diffusion Models

Quan Xiao, Hui Yuan, A F M Saif, Gaowen Liu, Ramana Kompella, Mengdi Wang, Tianyi Chen

Main category: cs.LG

TL;DR: The paper introduces a first-order bilevel optimization framework to address challenges in optimizing diffusion models for downstream tasks, outperforming existing methods.

DetailsMotivation: Traditional bilevel methods fail for diffusion models due to infinite-dimensional probability spaces and high sampling costs, necessitating a new approach.

Method: The framework formalizes generative bilevel optimization, focusing on fine-tuning pre-trained models and training from scratch with noise schedule optimization.

Result: The proposed method outperforms existing fine-tuning and hyperparameter search baselines in experiments.

Conclusion: The first-order bilevel framework provides a theoretically grounded and computationally practical solution for optimizing diffusion models.

Abstract: Diffusion models, which iteratively denoise data samples to synthesize high-quality outputs, have achieved empirical success across domains. However, optimizing these models for downstream tasks often involves nested bilevel structures, such as tuning hyperparameters for fine-tuning tasks or noise schedules in training dynamics, where traditional bilevel methods fail due to the infinite-dimensional probability space and prohibitive sampling costs. We formalize this challenge as a generative bilevel optimization problem and address two key scenarios: (1) fine-tuning pre-trained models via an inference-only lower-level solver paired with a sample-efficient gradient estimator for the upper level, and (2) training diffusion model from scratch with noise schedule optimization by reparameterizing the lower-level problem and designing a computationally tractable gradient estimator. Our first-order bilevel framework overcomes the incompatibility of conventional bilevel methods with diffusion processes, offering theoretical grounding and computational practicality. Experiments demonstrate that our method outperforms existing fine-tuning and hyperparameter search baselines.

[512] Vertical Federated Continual Learning via Evolving Prototype Knowledge

Shuo Wang, Keke Gai, Jing Yu, Liehuang Zhu, Qi Wu

Main category: cs.LG

TL;DR: V-LETO is a novel vertical federated continual learning method that addresses catastrophic forgetting in VFL by evolving prototypes and optimizing local models.

DetailsMotivation: Traditional VFL lacks mechanisms for continual learning, leading to knowledge loss in sequential tasks.

Method: V-LETO uses evolving prototype knowledge and restricts updates to specific local model parameters to retain past and current task knowledge.

Result: V-LETO outperforms state-of-the-art methods by 10.39% (CIL) and 35.15% (FIL).

Conclusion: V-LETO effectively mitigates catastrophic forgetting in VFL, enhancing performance in continual learning tasks.

Abstract: Vertical Federated Learning (VFL) has garnered significant attention as a privacy-preserving machine learning framework for sample-aligned feature federation. However, traditional VFL approaches do not address the challenges of class and feature continual learning, resulting in catastrophic forgetting of knowledge from previous tasks. To address the above challenge, we propose a novel vertical federated continual learning method, named Vertical Federated Continual Learning via Evolving Prototype Knowledge (V-LETO), which primarily facilitates the transfer of knowledge from previous tasks through the evolution of prototypes. Specifically, we propose an evolving prototype knowledge method, enabling the global model to retain both previous and current task knowledge. Furthermore, we introduce a model optimization technique that mitigates the forgetting of previous task knowledge by restricting updates to specific parameters of the local model, thereby enhancing overall performance. Extensive experiments conducted in both CIL and FIL settings demonstrate that our method, V-LETO, outperforms the other state-of-the-art methods. For example, our method outperforms the state-of-the-art method by 10.39% and 35.15% for CIL and FIL tasks, respectively. Our code is available at https://anonymous.4open.science/r/V-LETO-0108/README.md.

[513] Entropy-Lens: The Information Signature of Transformer Computations

Riccardo Ali, Francesco Caso, Christopher Irwin, Pietro Liò

Main category: cs.LG

TL;DR: The paper introduces Entropy-Lens, a framework to analyze token-level distributions in transformers using Shannon entropy, revealing patterns and correlations without modifying the model.

DetailsMotivation: To study the evolution of token-level distributions in transformers directly in vocabulary space, as traditional descriptors are ill-suited for high-dimensional, unordered distributions.

Method: Compute Shannon entropy of intermediate predicted distributions to create entropy profiles, a compact signature of the model’s computation.

Result: Entropy profiles reveal computation patterns, predict prompt types, and correlate with output correctness, without needing gradients or model internals.

Conclusion: Shannon entropy is a stable and principled summary for analyzing transformer computations, validated across models and tasks.

Abstract: Transformer models map input token sequences to output token distributions, layer by layer. While most interpretability work focuses on internal latent representations, we study the evolution of these token-level distributions directly in vocabulary space. However, such distributions are high-dimensional and defined on an unordered support, making common descriptors like moments or cumulants ill-suited. We address this by computing the Shannon entropy of each intermediate predicted distribution, yielding one interpretable scalar per layer. The resulting sequence, the entropy profile, serves as a compact, information-theoretic signature of the model’s computation. We introduce Entropy-Lens, a model-agnostic framework that extracts entropy profiles from frozen, off-the-shelf transformers. We show that these profiles (i) reveal family-specific computation patterns invariant under depth rescaling, (ii) are predictive of prompt type and task format, and (iii) correlate with output correctness. We further show that R'enyi entropies yield similar results within a broad range of $\alpha$ values, justifying the use of Shannon entropy as a stable and principled summary. Our results hold across different transformers, without requiring gradients, fine-tuning, or access to model internals.

[514] Scalable Graph Condensation with Evolving Capabilities

Shengbo Gong, Mohammad Hashemi, Juntong Ni, Carl Yang, Wei Jin

Main category: cs.LG

TL;DR: GECC introduces a scalable graph condensation method for evolving graph data, achieving superior performance and speedup over existing methods.

DetailsMotivation: Existing graph condensation methods assume static data, conflicting with the dynamic nature of real-world graphs.

Method: GECC uses class-wise clustering on aggregated features and inherits previous condensation results for evolving capability.

Result: GECC outperforms state-of-the-art methods with a 1000× speedup on large datasets.

Conclusion: GECC effectively handles evolving graph data, offering efficiency and scalability.

Abstract: The rapid growth of graph data creates significant scalability challenges as most graph algorithms scale quadratically with size. To mitigate these issues, Graph Condensation (GC) methods have been proposed to learn a small graph from a larger one, accelerating downstream tasks. However, existing approaches critically assume a static training set, which conflicts with the inherently dynamic and evolving nature of real-world graph data. This work introduces a novel framework for continual graph condensation, enabling efficient updates to the distilled graph that handle data streams without requiring costly retraining. This limitation leads to inefficiencies when condensing growing training sets. In this paper, we introduce GECC (\underline{G}raph \underline{E}volving \underline{C}lustering \underline{C}ondensation), a scalable graph condensation method designed to handle large-scale and evolving graph data. GECC employs a traceable and efficient approach by performing class-wise clustering on aggregated features. Furthermore, it can inherit previous condensation results as clustering centroids when the condensed graph expands, thereby attaining an evolving capability. This methodology is supported by robust theoretical foundations and demonstrates superior empirical performance. Comprehensive experiments including real world scenario show that GECC achieves better performance than most state-of-the-art graph condensation methods while delivering an around 1000$\times$ speedup on large datasets.

[515] The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation

Ho Kei Cheng, Alexander Schwing

Main category: cs.LG

TL;DR: Minibatch optimal transport coupling simplifies unconditional flow matching but fails in conditional settings due to skewed priors. Proposed C²OT improves performance by adding conditional weighting.

DetailsMotivation: Address the gap between training and testing in conditional settings caused by skewed priors in minibatch optimal transport.

Method: Introduces conditional optimal transport (C²OT) with a conditional weighting term in the cost matrix.

Result: Outperforms baselines in tasks like 8gaussians-to-moons, CIFAR-10, and ImageNet across different budgets.

Conclusion: C²OT effectively bridges the training-testing gap in conditional settings, improving performance.

Abstract: Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numerically solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal transport falls short. This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training. In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution. This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport C^2OT that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians-to-moons, CIFAR-10, ImageNet-32x32, and ImageNet-256x256. Our method performs better overall compared to the existing baselines across different function evaluation budgets. Code is available at https://hkchengrex.github.io/C2OT

[516] Augmented Adversarial Trigger Learning

Zhe Wang, Yanjun Qi

Main category: cs.LG

TL;DR: ATLA introduces a weighted loss for adversarial trigger learning, improving efficiency and generalization with fewer queries.

DetailsMotivation: To enhance adversarial trigger learning by optimizing response format tokens and suppressing evasive responses.

Method: ATLA uses a weighted loss formulation and an auxiliary loss to improve trigger learning from minimal data.

Result: ATLA achieves nearly 100% attack success with 80% fewer queries and generalizes well to unseen queries and LLMs.

Conclusion: ATLA outperforms state-of-the-art methods, offering efficient and generalizable adversarial trigger learning.

Abstract: Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs. We released our code \href{https://github.com/QData/ALTA_Augmented_Adversarial_Trigger_Learning}{here}.

[517] Potential Score Matching: Debiasing Molecular Structure Sampling with Potential Energy Guidance

Liya Guo, Zun Wang, Chang Liu, Junzhe Li, Pipi Hu, Yi Zhu

Main category: cs.LG

TL;DR: Potential Score Matching (PSM) is introduced as a novel method to efficiently sample molecular conformations by leveraging potential energy gradients, outperforming traditional methods and diffusion models.

DetailsMotivation: Traditional methods like MD and MCMC are costly and time-consuming, while diffusion models struggle with unbiased distribution sampling due to ergodicity requirements.

Method: PSM uses potential energy gradients to guide generative models, avoiding the need for exact energy functions and debiasing samples from limited or biased data.

Result: PSM outperforms SOTA models on the LJ potential and approximates the Boltzmann distribution better than traditional diffusion models on MD17 and MD22 datasets.

Conclusion: PSM offers a promising, efficient alternative for sampling molecular conformations, addressing key limitations of existing methods.

Abstract: The ensemble average of physical properties of molecules is closely related to the distribution of molecular conformations, and sampling such distributions is a fundamental challenge in physics and chemistry. Traditional methods like molecular dynamics (MD) simulations and Markov chain Monte Carlo (MCMC) sampling are commonly used but can be time-consuming and costly. Recently, diffusion models have emerged as efficient alternatives by learning the distribution of training data. Obtaining an unbiased target distribution is still an expensive task, primarily because it requires satisfying ergodicity. To tackle these challenges, we propose Potential Score Matching (PSM), an approach that utilizes the potential energy gradient to guide generative models. PSM does not require exact energy functions and can debias sample distributions even when trained on limited and biased data. Our method outperforms existing state-of-the-art (SOTA) models on the Lennard-Jones (LJ) potential, a commonly used toy model. Furthermore, we extend the evaluation of PSM to high-dimensional problems using the MD17 and MD22 datasets. The results demonstrate that molecular distributions generated by PSM more closely approximate the Boltzmann distribution compared to traditional diffusion models.

[518] Spectral Architecture Search for Neural Network Models

Gianluca Peri, Lorenzo Chicchi, Duccio Fanelli, Lorenzo Giambagli

Main category: cs.LG

TL;DR: SPARCS is a novel architecture search protocol using spectral attributes of inter-layer transfer matrices for gradient-based optimization, yielding efficient and minimal architectures.

DetailsMotivation: Addressing the challenges of architecture design and optimization in artificial neural networks.

Method: SPARCS exploits spectral attributes of inter-layer transfer matrices to explore architectures via continuous, differentiable manifolds, enabling gradient-based optimization.

Result: The method produces self-emerging architectures with minimal expressivity and reduced parameters compared to alternatives.

Conclusion: SPARCS offers an efficient approach to neural architecture search by leveraging spectral properties and gradient optimization.

Abstract: Architecture design and optimization are challenging problems in the field of artificial neural networks. Working in this context, we here present SPARCS (SPectral ARchiteCture Search), a novel architecture search protocol which exploits the spectral attributes of the inter-layer transfer matrices. SPARCS allows one to explore the space of possible architectures by spanning continuous and differentiable manifolds, thus enabling for gradient-based optimization algorithms to be eventually employed. With reference to simple benchmark models, we show that the newly proposed method yields a self-emerging architecture with a minimal degree of expressivity to handle the task under investigation and with a reduced parameter count as compared to other viable alternatives.

[519] Efficient Generative Model Training via Embedded Representation Warmup

Deyuan Liu, Peng Sun, Xufeng Li, Tao Lin

Main category: cs.LG

TL;DR: The paper introduces Embedded Representation Warmup (ERW), a method to improve diffusion models by pretraining early layers with high-quality representations, speeding up convergence and enhancing performance.

DetailsMotivation: Diffusion models underutilize high-quality representations during training, slowing convergence. The goal is to address this bottleneck.

Method: ERW is a plug-and-play framework that warms up early layers with pretrained representations, reducing the need to learn from scratch.

Result: ERW achieves a 40× training speed acceleration and improves representation quality compared to state-of-the-art methods.

Conclusion: ERW effectively addresses the bottleneck in diffusion models by optimizing early-layer representation processing, leading to faster convergence and better performance.

Abstract: Diffusion models excel at generating high-dimensional data but fall short in training efficiency and representation quality compared to self-supervised methods. We identify a key bottleneck: the underutilization of high-quality, semantically rich representations during training notably slows down convergence. Our systematic analysis reveals a critical representation processing region – primarily in the early layers – where semantic and structural pattern learning takes place before generation can occur. To address this, we propose Embedded Representation Warmup (ERW), a plug-and-play framework where in the first stage we get the ERW module serves as a warmup that initializes the early layers of the diffusion model with high-quality, pretrained representations. This warmup minimizes the burden of learning representations from scratch, thereby accelerating convergence and boosting performance. Our theoretical analysis demonstrates that ERW’s efficacy depends on its precise integration into specific neural network layers – termed the representation processing region – where the model primarily processes and transforms feature representations for later generation. We further establish that ERW not only accelerates training convergence but also enhances representation quality: empirically, our method achieves a 40$\times$ acceleration in training speed compared to REPA, the current state-of-the-art methods. Code is available at https://github.com/LINs-lab/ERW.

[520] Aligning Constraint Generation with Design Intent in Parametric CAD

Evan Casey, Tianyu Zhang, Shu Ishida, William P. McCarthy, John Roger Thompson, Amir Khasahmadi, Joseph George Lambourne, Pradeep Kumar Jayaraman, Karl D. D. Willis

Main category: cs.LG

TL;DR: The paper adapts alignment techniques from reasoning LLMs to generate engineering sketch constraints in CAD models, improving constraint generation to fully-constrain 93% of sketches compared to baselines.

DetailsMotivation: Current CAD design generation lacks alignment with design intent, termed 'design alignment,' which is critical for predictable geometry updates.

Method: The approach uses alignment techniques to train a constraint generation model with feedback from a constraint solver.

Result: The method fully-constrains 93% of sketches, outperforming naive supervised fine-tuning (34%) and no fine-tuning (8.9%).

Conclusion: The work bridges alignment strategies between language and design domains, enabling further research in generative CAD models.

Abstract: We adapt alignment techniques from reasoning LLMs to the task of generating engineering sketch constraints found in computer-aided design (CAD) models. Engineering sketches consist of geometric primitives (e.g. points, lines) connected by constraints (e.g. perpendicular, tangent) that define the relationships between them. For a design to be easily editable, the constraints must effectively capture design intent, ensuring the geometry updates predictably when parameters change. Although current approaches can generate CAD designs, an open challenge remains to align model outputs with design intent, we label this problem ‘design alignment’. A critical first step towards aligning generative CAD models is to generate constraints which fully-constrain all geometric primitives, without over-constraining or distorting sketch geometry. Using alignment techniques to train an existing constraint generation model with feedback from a constraint solver, we are able to fully-constrain 93% of sketches compared to 34% when using a naive supervised fine-tuning (SFT) baseline and only 8.9% without SFT. Our approach can be applied to any existing constraint generation model and sets the stage for further research bridging alignment strategies between the language and design domains. Additional results can be found at https://autodeskailab.github.io/aligning-constraint-generation/.

[521] NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang

Main category: cs.LG

TL;DR: NoWag is a framework for compressing large language models (LLMs) with zero-shot shape-preserving methods, outperforming state-of-the-art in vector quantization and pruning.

DetailsMotivation: LLMs have high computational and memory demands, limiting deployment in resource-constrained environments.

Method: Proposes NoWag, a unified framework for zero-shot compression, tested on Llama-2 and Llama-3 models using vector quantization (NoWag-VQ) and pruning (NoWag-P).

Result: NoWag-VQ outperforms state-of-the-art VQ; NoWag-P is competitive with top pruning methods.

Conclusion: NoWag shows promise for LLM compression, with potential for future work inspired by commonalities between compression paradigms.

Abstract: Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag: (Normalized Weight and Activation Guided Compression), a unified framework for zero-shot shape preserving compression algorithms. We compressed Llama-2 7B/13B/70B and Llama-3 8/70BB models, using two popular forms of shape-preserving compression, vector quantization NoWag-VQ (NoWag for Vector Quantization), and unstructured/semi-structured pruning NoWag-P (NoWag for Pruning). We found that NoWag-VQ significantly outperforms state-of-the-art zero shot VQ, and that NoWag-P performs competitively against state-of-the-art methods. These results suggest commonalities between these compression paradigms that could inspire future work. Our code is available at https://github.com/LawrenceRLiu/NoWag

[522] Low-Bit Integerization of Vision Transformers using Operand Reordering for Efficient Hardware

Ching-Yi Lin, Sahil Shah

Main category: cs.LG

TL;DR: The paper proposes an integerization process for pre-trained vision transformers to reduce computational overhead by delaying dequantization, enabling efficient low-bit inference.

DetailsMotivation: Pre-trained vision transformers face high computational and memory costs, even with quantization, due to dequantization overhead.

Method: The method involves operation reordering to delay dequantization, enabling integerized matrix multiplication and linear modules directly on quantized inputs.

Result: Experiments show reduced per-PE power consumption for linear layers and matrix multiplication, improving efficiency.

Conclusion: The approach bridges the gap between quantized models and efficient inference, validating its effectiveness.

Abstract: Pre-trained vision transformers have achieved remarkable performance across various visual tasks but suffer from expensive computational and memory costs. While model quantization reduces memory usage by lowering precision, these models still incur significant computational overhead due to the dequantization before matrix operations. In this work, we analyze the computation graph and propose an integerization process based on operation reordering. Specifically, the process delays dequantization until after matrix operations. This enables integerized matrix multiplication and linear module by directly processing the quantized input. To validate our approach, we synthesize the self-attention module of ViT on a systolic array-based hardware. Experimental results show that our low-bit inference reduces per-PE power consumption for linear layer and matrix multiplication, bridging the gap between quantized models and efficient inference.

[523] GRILL: Gradient Signal Restoration in Ill-Conditioned Layers to Enhance Adversarial Attacks on Autoencoders

Chethan Krishnamurthy Ramanaik, Arjun Roy, Tobias Callies, Eirini Ntoutsi

Main category: cs.LG

TL;DR: GRILL enhances adversarial attacks on autoencoders by addressing gradient vanishing in ill-conditioned layers, improving attack effectiveness.

DetailsMotivation: Adversarial robustness of autoencoders is underexplored, and existing attacks often fail due to gradient vanishing in ill-conditioned layers.

Method: Introduces GRILL to restore gradient signals in ill-conditioned layers, optimizing norm-bounded adversarial perturbations.

Result: GRILL significantly improves attack effectiveness across various autoencoder architectures and attack settings.

Conclusion: GRILL enables more rigorous evaluation of autoencoder robustness by overcoming gradient vanishing issues.

Abstract: Adversarial robustness of deep autoencoders (AEs) remains relatively unexplored, even though their non-invertible nature poses distinct challenges. Existing attack algorithms during the optimization of imperceptible, norm-bounded adversarial perturbations to maximize output damage in AEs, often stop at sub-optimal attacks. We observe that the adversarial loss gradient vanishes when backpropagated through ill-conditioned layers. This issue arises from near-zero singular values in the Jacobians of these layers, which weaken the gradient signal during optimization. We introduce GRILL, a technique that locally restores gradient signals in ill-conditioned layers, enabling more effective norm-bounded attacks. Through extensive experiments on different architectures of popular AEs, under both sample-specific and universal attack setups, and across standard and adaptive attack settings, we show that our method significantly increases the effectiveness of our adversarial attacks, enabling a more rigorous evaluation of AE robustness.

[524] Localization of Impacts on Thin-Walled Structures by Recurrent Neural Networks: End-to-end Learning from Real-World Data

Alexander Humer, Lukas Grasboeck, Ayech Benjeddou

Main category: cs.LG

TL;DR: The paper explores using recurrent neural networks (RNNs) with Gated Recurrent Units (GRUs) to localize impacts on shell-like structures from sensor data, achieving high accuracy with experimental training data.

DetailsMotivation: Impact localization is critical for structural health monitoring (SHM), but conventional methods struggle with dispersive Lamb waves. Neural networks offer a promising alternative.

Method: Proposes GRU-based RNNs to process long sequences of sensor data for end-to-end impact localization. Uses experimental data collected via automated robot impacts on an aluminum plate.

Result: Demonstrates remarkable accuracy in impact position estimation, even with a small dataset, validating the approach.

Conclusion: The GRU-based RNN method is effective for impact localization, leveraging experimental data to bridge the reality gap in training.

Abstract: Today, machine learning is ubiquitous, and structural health monitoring (SHM) is no exception. Specifically, we address the problem of impact localization on shell-like structures, where knowledge of impact locations aids in assessing structural integrity. Impacts on thin-walled structures excite Lamb waves, which can be measured with piezoelectric sensors. Their dispersive characteristics make it difficult to detect and localize impacts by conventional methods. In the present contribution, we explore the localization of impacts using neural networks. In particular, we propose to use recurrent neural networks (RNNs) to estimate impact positions end-to-end, i.e., directly from sequential sensor data. We deal with comparatively long sequences of thousands of samples, since high sampling rate are needed to accurately capture elastic waves. For this reason, the proposed approach builds upon Gated Recurrent Units (GRUs), which are less prone to vanishing gradients as compared to conventional RNNs. Quality and quantity of data are crucial when training neural networks. Often, synthetic data is used, which inevitably introduces a reality gap. Here, by contrast, we train our networks using physical data from experiments, which requires automation to handle the large number of experiments needed. For this purpose, a robot is used to drop steel balls onto an aluminum plate equipped with piezoceramic sensors. Our results show remarkable accuracy in estimating impact positions, even with a comparatively small dataset.

[525] Byte Pair Encoding for Efficient Time Series Forecasting

Leon Götz, Marcel Kollovieh, Stephan Günnemann, Leo Schwinn

Main category: cs.LG

TL;DR: Proposes a pattern-centric tokenization for time series using frequent motifs, improving efficiency and performance.

DetailsMotivation: Existing tokenization methods are inflexible, generating excessive tokens for simple patterns, leading to computational overhead.

Method: Uses a discrete vocabulary of frequent motifs to merge samples into tokens, with conditional decoding for optimization.

Result: Improves forecasting by 36%, efficiency by 1990%, and reduces MSE by up to 44%.

Conclusion: The method adapts to diverse patterns, generalizes well, and captures meaningful time series properties.

Abstract: Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in substantial computational overhead. Inspired by the success of byte pair encoding, we propose the first pattern-centric tokenization scheme for time series analysis. Based on a discrete vocabulary of frequent motifs, our method merges samples with underlying patterns into tokens, compressing time series adaptively. Exploiting our finite set of motifs and the continuous properties of time series, we further introduce conditional decoding as a lightweight yet powerful post-hoc optimization method, which requires no gradient computation and adds no computational overhead. On recent time series foundation models, our motif-based tokenization improves forecasting performance by 36% and boosts efficiency by 1990% on average. Conditional decoding further reduces MSE by up to 44%. In an extensive analysis, we demonstrate the adaptiveness of our tokenization to diverse temporal patterns, its generalization to unseen data, and its meaningful token representations capturing distinct time series properties, including statistical moments and trends.

[526] Enhancing Certified Robustness via Block Reflector Orthogonal Layers and Logit Annealing Loss

Bo-Han Lai, Pin-Han Huang, Bo-Han Kung, Shang-Tse Chen

Main category: cs.LG

TL;DR: The paper introduces a Block Reflector Orthogonal (BRO) layer and a new loss function to enhance Lipschitz neural networks, achieving state-of-the-art certified robustness.

DetailsMotivation: To improve the expressiveness and certified robustness of Lipschitz neural networks.

Method: Proposes a BRO layer for better orthogonal layer construction and a new loss function with an annealing mechanism.

Result: BRONet outperforms existing baselines on CIFAR-10/100, Tiny-ImageNet, and ImageNet.

Conclusion: The BRO layer and new loss function significantly enhance Lipschitz networks, validated by extensive experiments.

Abstract: Lipschitz neural networks are well-known for providing certified robustness in deep learning. In this paper, we present a novel, efficient Block Reflector Orthogonal (BRO) layer that enhances the capability of orthogonal layers on constructing more expressive Lipschitz neural architectures. In addition, by theoretically analyzing the nature of Lipschitz neural networks, we introduce a new loss function that employs an annealing mechanism to increase margin for most data points. This enables Lipschitz models to provide better certified robustness. By employing our BRO layer and loss function, we design BRONet - a simple yet effective Lipschitz neural network that achieves state-of-the-art certified robustness. Extensive experiments and empirical analysis on CIFAR-10/100, Tiny-ImageNet, and ImageNet validate that our method outperforms existing baselines. The implementation is available at https://github.com/ntuaislab/BRONet.

[527] Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning

Yutong Chen, Jiandong Gao, Ji Wu

Main category: cs.LG

TL;DR: Re-distillation, a technique sampling from RL-trained policies, improves small-scale SFT efficiency, matching RL performance with fewer samples and computation.

DetailsMotivation: To clarify the unclear mechanism behind rule-based RL and improve the efficiency of small-scale SFT.

Method: Propose an analytical framework to compare SFT and RL efficiency, then introduce Re-distillation to enhance SFT by leveraging RL-trained policies.

Result: Re-distillation achieves RL performance with fewer samples (e.g., surpassing DeepSeek-V3-0324 with 1K SFT samples on K&K dataset).

Conclusion: Re-distillation efficiently balances RL goals, explains R1-style RL phenomena, and offers a practical improvement for SFT efficiency.

Abstract: R1-style Reinforcement Learning (RL) significantly enhances Large Language Models’ reasoning capabilities, yet the mechanism behind rule-based RL remains unclear. We found that small-scale SFT has substantial influence on RL but shows poor efficiency. To explain our observations, we propose an analytical framework and compare the efficiency of SFT and RL by measuring \textbf{sample effect}. Our hypothetical analysis shows the potential to improve SFT efficiency. Guided by our analysis, we propose \textbf{Re-distillation}, a technique that aims to boost the effectiveness of small-scale distillation by sampling from the RL-trained policy. Re-distillation shows consistent surprising efficiency on three datasets and both Qwen&Llama models: Re-distilled models matched RL performance with far fewer samples and less computation. As a result, on K&K dataset, our re-distilled Qwen-2.5-1.5B model surpasses DeepSeek-V3-0324 with only 1K SFT samples. We demonstrate that re-distillation can be used to efficiently balance multiple goals in RL. Our work explains several interesting phenomena in R1-style RL, shedding light on the mechanisms behind its empirical success. Code is available at: https://github.com/on1262/deep-reasoning.

[528] Learning Fluid-Structure Interaction Dynamics with Physics-Informed Neural Networks and Immersed Boundary Methods

Afrah Farea, Saiful Khan, Reza Daryani, Emre Cenk Ersan, Mustafa Serdar Celebi

Main category: cs.LG

TL;DR: Combining PINNs with IBM for FSI problems, two architectures (Single-FSI and Eulerian-Lagrangian) are tested. Eulerian-Lagrangian with adaptive B-spline activation performs best, though pressure recovery remains a challenge.

DetailsMotivation: To improve fluid-structure interaction (FSI) problem-solving by integrating physics-informed neural networks (PINNs) with the immersed boundary method (IBM).

Method: Two architectures: Single-FSI (unified parameter space) and Eulerian-Lagrangian (separate parameter spaces). Tested with Tanh and adaptive B-spline activation functions on a 2D cavity flow problem.

Result: Eulerian-Lagrangian architecture outperforms, especially with adaptive B-spline activation. Velocity field prediction is accurate, but pressure recovery is challenging.

Conclusion: Domain-specific design and adaptive activation functions are crucial for FSI modeling in PINNs, though further work is needed for pressure recovery.

Abstract: We introduce neural network architectures that combine physics-informed neural networks (PINNs) with the immersed boundary method (IBM) to solve fluid-structure interaction (FSI) problems. Our approach features two distinct architectures: a Single-FSI network with a unified parameter space, and an innovative Eulerian-Lagrangian network that maintains separate parameter spaces for fluid and structure domains. We study each architecture using standard Tanh and adaptive B-spline activation functions. Empirical studies on a 2D cavity flow problem involving a moving solid structure show that the Eulerian-Lagrangian architecture performs significantly better. The adaptive B-spline activation further enhances accuracy by providing locality-aware representation near boundaries. While our methodology shows promising results in predicting the velocity field, pressure recovery remains challenging due to the absence of explicit force-coupling constraints in the current formulation. Our findings underscore the importance of domain-specific architectural design and adaptive activation functions for modeling FSI problems within the PINN framework.

[529] SALAD: Systematic Assessment of Machine Unlearning on LLM-Aided Hardware Design

Zeng Wang, Minghao Shao, Rupesh Karn, Likhitha Mankali, Jitendra Bhandari, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel

Main category: cs.LG

TL;DR: SALAD uses machine unlearning to address data security risks in LLM-aided hardware design, such as contamination, IP leakage, and malicious code, without full retraining.

DetailsMotivation: LLMs in hardware design automation pose security risks like data contamination, IP leakage, and malicious code generation.

Method: Introduces SALAD, leveraging machine unlearning to selectively remove contaminated or sensitive data from pre-trained LLMs.

Result: Case studies show SALAD effectively mitigates security risks in LLM-aided hardware design.

Conclusion: Machine unlearning is a viable solution for enhancing data security in LLM-driven hardware design automation.

Abstract: Large Language Models (LLMs) offer transformative capabilities for hardware design automation, particularly in Verilog code generation. However, they also pose significant data security challenges, including Verilog evaluation data contamination, intellectual property (IP) design leakage, and the risk of malicious Verilog generation. We introduce SALAD, a comprehensive assessment that leverages machine unlearning to mitigate these threats. Our approach enables the selective removal of contaminated benchmarks, sensitive IP and design artifacts, or malicious code patterns from pre-trained LLMs, all without requiring full retraining. Through detailed case studies, we demonstrate how machine unlearning techniques effectively reduce data security risks in LLM-aided hardware design.

[530] NeuralOM: Neural Ocean Model for Subseasonal-to-Seasonal Simulation

Yuan Gao, Ruiqi Shu, Hao Wu, Fan Xu, Yanfei Xiang, Ruijian Gou, Qingsong Wen, Xian Wu, Kun Wang, Xiaomeng Huang

Main category: cs.LG

TL;DR: NeuralOM is a neural operator framework for simulating slow-changing physical systems, addressing error accumulation with progressive refinement and physics-guided modeling, outperforming baselines in accuracy and stability.

DetailsMotivation: Traditional autoregressive models fail in long-term simulations due to error accumulation, necessitating a new approach for slow-changing systems like oceans and climate.

Method: NeuralOM uses a Progressive Residual Correction Framework for error suppression and a Physics-Guided Graph Network for multi-scale physical interactions.

Result: NeuralOM achieves 13.3% lower RMSE at 60-day lead time and excels in simulating extreme events, outperforming state-of-the-art models.

Conclusion: NeuralOM provides a stable, efficient, and physically-aware solution for data-driven scientific computing in slow-changing systems.

Abstract: Long-term, high-fidelity simulation of slow-changing physical systems, such as the ocean and climate, presents a fundamental challenge in scientific computing. Traditional autoregressive machine learning models often fail in these tasks as minor errors accumulate and lead to rapid forecast degradation. To address this problem, we propose NeuralOM, a general neural operator framework designed for simulating complex, slow-changing dynamics. NeuralOM’s core consists of two key innovations: (1) a Progressive Residual Correction Framework that decomposes the forecasting task into a series of fine-grained refinement steps, effectively suppressing long-term error accumulation; and (2) a Physics-Guided Graph Network whose built-in adaptive messaging mechanism explicitly models multi-scale physical interactions, such as gradient-driven flows and multiplicative couplings, thereby enhancing physical consistency while maintaining computational efficiency. We validate NeuralOM on the challenging task of global Subseasonal-to-Seasonal (S2S) ocean simulation. Extensive experiments demonstrate that NeuralOM not only surpasses state-of-the-art models in forecast accuracy and long-term stability, but also excels in simulating extreme events. For instance, at a 60-day lead time, NeuralOM achieves a 13.3% lower RMSE compared to the best-performing baseline, offering a stable, efficient, and physically-aware paradigm for data-driven scientific computing. Code link: https://github.com/YuanGao-YG/NeuralOM.

[531] Comprehensive Attribute Encoding and Dynamic LSTM HyperModels for Outcome Oriented Predictive Business Process Monitoring

Fang Wang, Paolo Ceravolo, Ernesto Damiani

Main category: cs.LG

TL;DR: The paper proposes dynamic LSTM HyperModels for Predictive Business Process Monitoring (PBPM) to address challenges like simultaneous events, class imbalance, and multi-level attributes, achieving high accuracy and F1 scores.

DetailsMotivation: Existing PBPM methods lack flexibility for real-world challenges like simultaneous events, class imbalance, and multi-level attributes, limiting their adaptability and generalization.

Method: The authors introduce dynamic LSTM HyperModels with hierarchical encoding, character-based decomposition, pseudo-embedding techniques, and specialized LSTM variants for simultaneous events.

Result: Experiments on four datasets show up to 100% accuracy on balanced datasets and F1 scores over 86% on imbalanced ones.

Conclusion: The approach advances PBPM with modular, interpretable models and contributes to AI by improving temporal prediction, handling data heterogeneity, and promoting explainable frameworks.

Abstract: Predictive Business Process Monitoring (PBPM) aims to forecast future outcomes of ongoing business processes. However, existing methods often lack flexibility to handle real-world challenges such as simultaneous events, class imbalance, and multi-level attributes. While prior work has explored static encoding schemes and fixed LSTM architectures, they struggle to support adaptive representations and generalize across heterogeneous datasets. To address these limitations, we propose a suite of dynamic LSTM HyperModels that integrate two-level hierarchical encoding for event and sequence attributes, character-based decomposition of event labels, and novel pseudo-embedding techniques for durations and attribute correlations. We further introduce specialized LSTM variants for simultaneous event modeling, leveraging multidimensional embeddings and time-difference flag augmentation. Experimental validation on four public and real-world datasets demonstrates up to 100% accuracy on balanced datasets and F1 scores exceeding 86% on imbalanced ones. Our approach advances PBPM by offering modular and interpretable models better suited for deployment in complex settings. Beyond PBPM, it contributes to the broader AI community by improving temporal outcome prediction, supporting data heterogeneity, and promoting explainable process intelligence frameworks.

[532] Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints

Sunil Kumar, Bowen Zhao, Leo Dirac, Paulina Varshavskaya

Main category: cs.LG

TL;DR: Smaller VLMs trained with GRPO and external tools like zoom outperform baselines in VQA tasks by leveraging detailed visual reasoning.

DetailsMotivation: Addressing the limitation of VLMs in detailed visual reasoning under constrained compute resources.

Method: Training smaller models with GRPO, a simple reward structure, a simplified tool-calling interface, and a data mix favoring visually difficult examples.

Result: Improved performance on VQA tasks due to better visual information from external tools.

Conclusion: Combining GRPO with external tools enhances VLM performance in detailed visual reasoning.

Abstract: Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.

[533] ProARD: progressive adversarial robustness distillation: provide wide range of robust students

Seyedhamidreza Mousavi, Seyedali Mousavi, Masoud Daneshtalab

Main category: cs.LG

TL;DR: ProARD enables efficient one-time training of a dynamic network for diverse robust student networks, avoiding retraining costs.

DetailsMotivation: Current ARD methods require retraining for each student, leading to high computational costs and CO2 emissions.

Method: ProARD uses a dynamic network with weight-sharing and a sampling mechanism to optimize a teacher and internal students.

Result: Random sampling fails; ProARD aims to efficiently produce accurate and robust students.

Conclusion: ProARD offers a scalable solution for training diverse robust networks without retraining.

Abstract: Adversarial Robustness Distillation (ARD) has emerged as an effective method to enhance the robustness of lightweight deep neural networks against adversarial attacks. Current ARD approaches have leveraged a large robust teacher network to train one robust lightweight student. However, due to the diverse range of edge devices and resource constraints, current approaches require training a new student network from scratch to meet specific constraints, leading to substantial computational costs and increased CO2 emissions. This paper proposes Progressive Adversarial Robustness Distillation (ProARD), enabling the efficient one-time training of a dynamic network that supports a diverse range of accurate and robust student networks without requiring retraining. We first make a dynamic deep neural network based on dynamic layers by encompassing variations in width, depth, and expansion in each design stage to support a wide range of architectures. Then, we consider the student network with the largest size as the dynamic teacher network. ProARD trains this dynamic network using a weight-sharing mechanism to jointly optimize the dynamic teacher network and its internal student networks. However, due to the high computational cost of calculating exact gradients for all the students within the dynamic network, a sampling mechanism is required to select a subset of students. We show that random student sampling in each iteration fails to produce accurate and robust students.

[534] Enhancing Spectral Graph Neural Networks with LLM-Predicted Homophily

Kangkang Lu, Yanhua Yu, Zhiyong Huang, Tat-Seng Chua

Main category: cs.LG

TL;DR: A framework using LLMs to estimate graph homophily and guide spectral filter construction in SGNNs, improving performance with minimal cost.

DetailsMotivation: SGNNs struggle with limited labeled data and heterophilic graphs. LLMs offer a way to enhance graph learning without structural changes.

Method: LLMs predict graph homophily from labeled node pairs formatted as prompts, guiding spectral filter construction in SGNNs.

Result: The framework improves SGNN performance on benchmark datasets, especially for heterophilic graphs, with negligible cost.

Conclusion: LLM-assisted spectral filters enhance SGNN adaptability, offering a practical solution for real-world graph tasks.

Abstract: Spectral Graph Neural Networks (SGNNs) have achieved remarkable performance in tasks such as node classification due to their ability to learn flexible filters. Typically, these filters are learned under the supervision of downstream tasks, enabling SGNNs to adapt to diverse structural patterns. However, in scenarios with limited labeled data, SGNNs often struggle to capture the optimal filter shapes, resulting in degraded performance, especially on graphs with heterophily. Meanwhile, the rapid progress of Large Language Models (LLMs) has opened new possibilities for enhancing graph learning without modifying graph structure or requiring task-specific training. In this work, we propose a novel framework that leverages LLMs to estimate the homophily level of a graph and uses this global structural prior to guide the construction of spectral filters. Specifically, we design a lightweight and plug-and-play pipeline where a small set of labeled node pairs is formatted as natural language prompts for the LLM, which then predicts the graph’s homophily ratio. This estimated value informs the spectral filter basis, enabling SGNNs to adapt more effectively to both homophilic and heterophilic structures. Extensive experiments on multiple benchmark datasets demonstrate that our LLM-assisted spectral framework consistently improves performance over strong SGNN baselines. Importantly, this enhancement incurs negligible computational and monetary cost, making it a practical solution for real-world graph applications.

[535] S2FGL: Spatial Spectral Federated Graph Learning

Zihan Tan, Suyuan Huang, Guancheng Wan, Wenke Huang, He Li, Mang Ye

Main category: cs.LG

TL;DR: S2FGL addresses spatial and spectral challenges in Federated Graph Learning by combining a global knowledge repository and frequency alignment.

DetailsMotivation: Current subgraph-FL methods neglect signal propagation in spatial and spectral domains, leading to degraded global GNN performance and spectral client drift.

Method: Proposes a global knowledge repository for semantic knowledge and frequency alignment to mitigate spectral heterogeneity.

Result: S2FGL outperforms on multiple datasets, demonstrating effectiveness.

Conclusion: S2FGL successfully tackles spatial and spectral challenges in FGL, improving global generalizability.

Abstract: Federated Graph Learning (FGL) combines the privacy-preserving capabilities of federated learning (FL) with the strong graph modeling capability of Graph Neural Networks (GNNs). Current research addresses subgraph-FL from the structural perspective, neglecting the propagation of graph signals on spatial and spectral domains of the structure. From a spatial perspective, subgraph-FL introduces edge disconnections between clients, leading to disruptions in label signals and a degradation in the semantic knowledge of the global GNN. From a spectral perspective, spectral heterogeneity causes inconsistencies in signal frequencies across subgraphs, which makes local GNNs overfit the local signal propagation schemes. As a result, spectral client drift occurs, undermining global generalizability. To tackle the challenges, we propose a global knowledge repository to mitigate the challenge of poor semantic knowledge caused by label signal disruption. Furthermore, we design a frequency alignment to address spectral client drift. The combination of Spatial and Spectral strategies forms our framework S2FGL. Extensive experiments on multiple datasets demonstrate the superiority of S2FGL. The code is available at https://github.com/Wonder7racer/S2FGL.git.

[536] Unraveling the Black-box Magic: An Analysis of Neural Networks’ Dynamic Extrema

Shengjian Chen

Main category: cs.LG

TL;DR: Neural networks are not black boxes; their generalization comes from dynamic mapping to model extrema. A new algorithm solves linear equations for parameters, addressing gradient vanishing and overfitting.

DetailsMotivation: To clarify neural networks' generalization and propose a non-backpropagation method to handle issues like gradient vanishing and overfitting.

Method: Proposes a new algorithm solving linear equations for parameter values, differing from backpropagation.

Result: Shows a positive correlation between extrema and parameters, and effectively addresses gradient vanishing and overfitting.

Conclusion: The new algorithm provides a viable alternative to backpropagation, with better handling of common neural network issues.

Abstract: We point out that neural networks are not black boxes, and their generalization stems from the ability to dynamically map a dataset to the extrema of the model function. We further prove that the number of extrema in a neural network is positively correlated with the number of its parameters. We then propose a new algorithm that is significantly different from back-propagation algorithm, which mainly obtains the values of parameters by solving a system of linear equations. Some difficult situations, such as gradient vanishing and overfitting, can be reasonably explained and dealt with in this framework.

[537] Detection of Intelligent Tampering in Wireless Electrocardiogram Signals Using Hybrid Machine Learning

Siddhant Deshpande, Yalemzerf Getnet, Waltenegus Dargie

Main category: cs.LG

TL;DR: The paper evaluates CNN, ResNet, and hybrid Transformer-CNN models for ECG tamper detection and Siamese networks for identity verification, achieving high accuracy (up to 100%) in various scenarios.

DetailsMotivation: To protect ECG signal integrity against tampering and improve identity verification in wireless health monitoring systems.

Method: Uses CNN, ResNet, and hybrid Transformer-CNN models for tamper detection and Siamese networks for verification. ECG signals are transformed into 2D time-frequency representations using CWT.

Result: Models achieved over 99.5% accuracy for fragmented manipulations and 98-100% for subtle manipulations and identity verification.

Conclusion: Hybrid models, especially FeatCNN-TranCNN and CNN-Transformer Siamese, are highly effective for ECG tamper detection and identity verification.

Abstract: With the proliferation of wireless electrocardiogram (ECG) systems for health monitoring and authentication, protecting signal integrity against tampering is becoming increasingly important. This paper analyzes the performance of CNN, ResNet, and hybrid Transformer-CNN models for tamper detection. It also evaluates the performance of a Siamese network for ECG based identity verification. Six tampering strategies, including structured segment substitutions and random insertions, are emulated to mimic real world attacks. The one-dimensional ECG signals are transformed into a two dimensional representation in the time frequency domain using the continuous wavelet transform (CWT). The models are trained and evaluated using ECG data from 54 subjects recorded in four sessions 2019 to 2025 outside of clinical settings while the subjects performed seven different daily activities. Experimental results show that in highly fragmented manipulation scenarios, CNN, FeatCNN-TranCNN, FeatCNN-Tran and ResNet models achieved an accuracy exceeding 99.5 percent . Similarly, for subtle manipulations (for example, 50 percent from A and 50 percent from B and, 75 percent from A and 25 percent from B substitutions) our FeatCNN-TranCNN model demonstrated consistently reliable performance, achieving an average accuracy of 98 percent . For identity verification, the pure Transformer-Siamese network achieved an average accuracy of 98.30 percent . In contrast, the hybrid CNN-Transformer Siamese model delivered perfect verification performance with 100 percent accuracy.

[538] Divide-Then-Rule: A Cluster-Driven Hierarchical Interpolator for Attribute-Missing Graphs

Yaowen Hu, Wenxuan Tu, Yue Liu, Miaomiao Li, Wenpeng Lu, Zhigang Luo, Xinwang Liu, Ping Chen

Main category: cs.LG

TL;DR: DTRGC is a novel method for deep graph clustering in attribute-missing graphs, addressing imputation challenges by iteratively refining node attributes and leveraging clustering information for better results.

DetailsMotivation: Existing imputation methods for attribute-missing graphs often ignore varying neighborhood information, leading to unreliable clustering results, especially for nodes with insufficient known data.

Method: DTRGC uses Dynamic Cluster-Aware Feature Propagation (DCFP) for initial imputation, Hierarchical Neighborhood-aware Imputation (HNAI) for grouping nodes, and Hop-wise Representation Enhancement (HRE) to enrich node representations.

Result: Experiments on six datasets show DTRGC significantly improves clustering performance for attribute-missing graphs.

Conclusion: DTRGC effectively addresses imputation challenges in attribute-missing graphs, enhancing clustering accuracy by leveraging iterative refinement and clustering-aware techniques.

Abstract: Deep graph clustering (DGC) for attribute-missing graphs is an unsupervised task aimed at partitioning nodes with incomplete attributes into distinct clusters. Addressing this challenging issue is vital for practical applications. However, research in this area remains underexplored. Existing imputation methods for attribute-missing graphs often fail to account for the varying amounts of information available across node neighborhoods, leading to unreliable results, especially for nodes with insufficient known neighborhood. To address this issue, we propose a novel method named Divide-Then-Rule Graph Completion (DTRGC). This method first addresses nodes with sufficient known neighborhood information and treats the imputed results as new knowledge to iteratively impute more challenging nodes, while leveraging clustering information to correct imputation errors. Specifically, Dynamic Cluster-Aware Feature Propagation (DCFP) initializes missing node attributes by adjusting propagation weights based on the clustering structure. Subsequently, Hierarchical Neighborhood-aware Imputation (HNAI) categorizes attribute-missing nodes into three groups based on the completeness of their neighborhood attributes. The imputation is performed hierarchically, prioritizing the groups with nodes that have the most available neighborhood information. The cluster structure is then used to refine the imputation and correct potential errors. Finally, Hop-wise Representation Enhancement (HRE) integrates information across multiple hops, thereby enriching the expressiveness of node representations. Experimental results on six widely used graph datasets show that DTRGC significantly improves the clustering performance of various DGC methods under attribute-missing graphs.

[539] Leveraging Distribution Matching to Make Approximate Machine Unlearning Faster

Junaid Iqbal Khan

Main category: cs.LG

TL;DR: The paper introduces two methods, Blend and A-AMU, to speed up approximate machine unlearning (AMU) by reducing retained dataset size and accelerating convergence, while maintaining model utility and privacy.

DetailsMotivation: Current AMU methods are computationally expensive due to large retained datasets and slow convergence, prompting the need for efficiency improvements.

Method: 1. Blend: A dataset condensation technique merging similar images to reduce retained set size. 2. A-AMU: A loss-centric method combining steepened loss for faster forgetting and a regularizer matching loss distributions.

Result: The dual approach significantly reduces unlearning latency in single and multi-round scenarios without compromising utility or privacy.

Conclusion: This work is the first to systematically improve unlearning efficiency through combined dataset condensation and accelerated loss function design.

Abstract: Approximate machine unlearning (AMU) enables models to `forget’ specific training data through specialized fine-tuning on a retained (and forget) subset of training set. However, processing this large retained subset still dominates computational runtime, while reductions of unlearning epochs also remain a challenge. In this paper, we propose two complementary methods to accelerate arbitrary classification-oriented AMU method. First, \textbf{Blend}, a novel distribution-matching dataset condensation (DC), merges visually similar images with shared blend-weights to significantly reduce the retained set size. It operates with minimal pre-processing overhead and is orders of magnitude faster than state-of-the-art DC methods. Second, our loss-centric method, \textbf{Accelerated-AMU (A-AMU)}, augments the AMU objective to quicken convergence. A-AMU achieves this by combining a steepened primary loss to expedite forgetting with a differentiable regularizer that matches the loss distributions of forgotten and in-distribution unseen data. Our extensive experiments demonstrate that this dual approach of data and loss-centric optimization dramatically reduces end-to-end unlearning latency across both single and multi-round scenarios, all while preserving model utility and privacy. To our knowledge, this is the first work to systematically tackle unlearning efficiency by jointly designing a specialized dataset condensation technique with a dedicated accelerated loss function. Code is available at https://github.com/algebraicdianuj/DC_Unlearning.

[540] AdaBrain-Bench: Benchmarking Brain Foundation Models for Brain-Computer Interface Applications

Jiamin Wu, Zichen Ren, Junyu Wang, Pengyu Zhu, Yonghao Song, Mianxin Liu, Qihao Zheng, Lei Bai, Wanli Ouyang, Chunfeng Song

Main category: cs.LG

TL;DR: AdaBrain-Bench is introduced as a standardized benchmark to evaluate brain foundation models in non-invasive BCI tasks, addressing the lack of comprehensive benchmarks in the field.

DetailsMotivation: The high noise and limited task-specific data in non-invasive BCI signals constrain decoding capabilities, and current brain foundation models lack practical benchmarks for widespread adoption.

Method: AdaBrain-Bench includes diverse BCI datasets, a task adaptation pipeline, multi-dimensional metrics, and adaptation tools to assess model generalizability across transfer settings.

Result: The benchmark evaluates public brain foundation models, providing insights for model selection in various scenarios.

Conclusion: AdaBrain-Bench offers a reproducible and evolving platform to advance robust and generalized neural decoding solutions.

Abstract: Non-invasive Brain-Computer Interfaces (BCI) offer a safe and accessible means of connecting the human brain to external devices, with broad applications in home and clinical settings to enhance human capabilities. However, the high noise level and limited task-specific data in non-invasive signals constrain decoding capabilities. Recently, the adoption of self-supervised pre-training is transforming the landscape of non-invasive BCI research, enabling the development of brain foundation models to capture generic neural representations from large-scale unlabeled electroencephalography (EEG) signals with substantial noises. However, despite these advances, the field currently lacks comprehensive, practical and extensible benchmarks to assess the utility of the public foundation models across diverse BCI tasks, hindering their widespread adoption. To address this challenge, we present AdaBrain-Bench, a large-scale standardized benchmark to systematically evaluate brain foundation models in widespread non-invasive BCI tasks. AdaBrain-Bench encompasses a diverse collection of representative BCI decoding datasets spanning 7 key applications. It introduces a streamlined task adaptation pipeline integrated with multi-dimensional evaluation metrics and a set of adaptation tools. The benchmark delivers an inclusive framework for assessing generalizability of brain foundation models across key transfer settings, including cross-subject, multi-subject, and few-shot scenarios. We leverage AdaBrain-Bench to evaluate a suite of publicly available brain foundation models and offer insights into practices for selecting appropriate models in various scenarios. We make our benchmark pipeline available to enable reproducible research and external use, offering a continuously evolving platform to foster progress toward robust and generalized neural decoding solutions.

[541] P3SL: Personalized Privacy-Preserving Split Learning on Heterogeneous Edge Devices

Wei Fan, JinYi Yoon, Xiaochang Li, Huajie Shao, Bo Ji

Main category: cs.LG

TL;DR: P3SL is a framework for personalized, privacy-preserving split learning in heterogeneous edge environments, addressing resource constraints and privacy needs without sharing sensitive data.

DetailsMotivation: Existing split learning frameworks neglect personalized privacy and local model customization in heterogeneous edge environments.

Method: P3SL introduces a personalized sequential split learning pipeline and bi-level optimization for clients to determine optimal split points privately.

Result: The framework balances energy consumption, privacy risks, and model accuracy, validated on a diverse testbed.

Conclusion: P3SL effectively addresses heterogeneity and privacy in split learning for edge devices.

Abstract: Split Learning (SL) is an emerging privacy-preserving machine learning technique that enables resource constrained edge devices to participate in model training by partitioning a model into client-side and server-side sub-models. While SL reduces computational overhead on edge devices, it encounters significant challenges in heterogeneous environments where devices vary in computing resources, communication capabilities, environmental conditions, and privacy requirements. Although recent studies have explored heterogeneous SL frameworks that optimize split points for devices with varying resource constraints, they often neglect personalized privacy requirements and local model customization under varying environmental conditions. To address these limitations, we propose P3SL, a Personalized Privacy-Preserving Split Learning framework designed for heterogeneous, resource-constrained edge device systems. The key contributions of this work are twofold. First, we design a personalized sequential split learning pipeline that allows each client to achieve customized privacy protection and maintain personalized local models tailored to their computational resources, environmental conditions, and privacy needs. Second, we adopt a bi-level optimization technique that empowers clients to determine their own optimal personalized split points without sharing private sensitive information (i.e., computational resources, environmental conditions, privacy requirements) with the server. This approach balances energy consumption and privacy leakage risks while maintaining high model accuracy. We implement and evaluate P3SL on a testbed consisting of 7 devices including 4 Jetson Nano P3450 devices, 2 Raspberry Pis, and 1 laptop, using diverse model architectures and datasets under varying environmental conditions.

[542] Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility

Melih Barsbey, Lucas Prieto, Stefanos Zafeiriou, Tolga Birdal

Main category: cs.LG

TL;DR: High learning rates enable robustness to spurious correlations and network compressibility, while improving feature utilization and activation sparsity.

DetailsMotivation: Achieving joint robustness and resource-efficiency in machine learning models is challenging.

Method: Investigate the role of high learning rates in achieving robustness and compressibility, analyzing feature utilization and activation patterns.

Result: Large learning rates outperform other hyperparameters in robustness, compressibility, and feature properties.

Conclusion: High learning rates effectively address spurious correlations and improve model performance, linked to confident mispredictions of bias-conflicting samples.

Abstract: Robustness and resource-efficiency are two highly desirable properties for modern machine learning models. However, achieving them jointly remains a challenge. In this paper, we identify high learning rates as a facilitator for simultaneously achieving robustness to spurious correlations and network compressibility. We demonstrate that large learning rates also produce desirable representation properties such as invariant feature utilization, class separation, and activation sparsity. Our findings indicate that large learning rates compare favorably to other hyperparameters and regularization methods, in consistently satisfying these properties in tandem. In addition to demonstrating the positive effect of large learning rates across diverse spurious correlation datasets, models, and optimizers, we also present strong evidence that the previously documented success of large learning rates in standard classification tasks is related to addressing hidden/rare spurious correlations in the training dataset. Our investigation of the mechanisms underlying this phenomenon reveals the importance of confident mispredictions of bias-conflicting samples under large learning rates.

[543] FedSA-GCL: A Semi-Asynchronous Federated Graph Learning Framework with Personalized Aggregation and Cluster-Aware Broadcasting

Zhongzheng Yuan, Lianshuai Guo, Xunkai Li, Yinlin Zhu, Wenyu Wang, Meixia Qu

Main category: cs.LG

TL;DR: FedSA-GCL is a semi-asynchronous federated framework for graph learning that addresses inefficiencies of synchronous methods and semantic drift in asynchronous approaches by leveraging inter-client label divergence and graph topology.

DetailsMotivation: Existing federated graph learning (FGL) methods rely on synchronous communication, which is inefficient and impractical. Asynchronous federated learning (AFL) methods ignore graph topology, risking semantic drift.

Method: Proposes FedSA-GCL, a semi-asynchronous framework using ClusterCast to account for label divergence and graph topology. Evaluated on real-world datasets with Louvain and Metis splits.

Result: Outperforms 9 baselines, achieving 2.92% and 3.4% improvements with Louvain and Metis splits, respectively, demonstrating robustness and efficiency.

Conclusion: FedSA-GCL effectively addresses inefficiencies and semantic drift in FGL, offering a robust and efficient solution for distributed graph learning.

Abstract: Federated Graph Learning (FGL) is a distributed learning paradigm that enables collaborative training over large-scale subgraphs located on multiple local systems. However, most existing FGL approaches rely on synchronous communication, which leads to inefficiencies and is often impractical in real-world deployments. Meanwhile, current asynchronous federated learning (AFL) methods are primarily designed for conventional tasks such as image classification and natural language processing, without accounting for the unique topological properties of graph data. Directly applying these methods to graph learning can possibly result in semantic drift and representational inconsistency in the global model. To address these challenges, we propose FedSA-GCL, a semi-asynchronous federated framework that leverages both inter-client label distribution divergence and graph topological characteristics through a novel ClusterCast mechanism for efficient training. We evaluate FedSA-GCL on multiple real-world graph datasets using the Louvain and Metis split algorithms, and compare it against 9 baselines. Extensive experiments demonstrate that our method achieves strong robustness and outstanding efficiency, outperforming the baselines by an average of 2.92% with the Louvain and by 3.4% with the Metis.

[544] A Comprehensive Review of Diffusion Models in Smart Agriculture: Progress, Applications, and Challenges

Xing Hu, Haodong Chen, Qianqian Duan, Choon Ki Ahn, Huiliang Shang, Dawei Zhang Zhang

Main category: cs.LG

TL;DR: Diffusion models outperform GANs in agricultural AI tasks like crop monitoring and pest detection, offering better stability and image quality. They enhance downstream model performance but face challenges in computational efficiency and domain generalization.

DetailsMotivation: The need for sustainable agriculture amid limited arable land and a growing population drives the adoption of AI, particularly diffusion models, for tasks like crop monitoring and pest detection.

Method: The paper reviews diffusion models’ applications in agriculture, focusing on crop disease detection, remote sensing, and resource management, comparing them to traditional GANs.

Result: Diffusion models improve accuracy, robustness, and generalization in agricultural tasks, though challenges like computational efficiency remain.

Conclusion: Diffusion models hold promise for intelligent agriculture, potentially addressing global food security and sustainability issues as the technology evolves.

Abstract: With the global population increasing and arable land resources becoming increasingly limited, smart and precision agriculture have emerged as essential directions for sustainable agricultural development. Artificial intelligence (AI), particularly deep learning models, has been widely adopted in applications such as crop monitoring, pest detection, and yield prediction. Among recent generative models, diffusion models have demonstrated considerable potential in agricultural image processing, data augmentation, and remote sensing analysis. Compared to traditional generative adversarial networks (GANs), diffusion models exhibit greater training stability and superior image generation quality, effectively addressing challenges such as limited annotated datasets and imbalanced sample distributions in agricultural scenarios. This paper reviews recent advancements in the application of diffusion models within agriculture, focusing on their roles in crop disease and pest detection, remote sensing image enhancement, crop growth prediction, and agricultural resource management. Empirical studies show that diffusion models significantly enhance the performance of downstream models by improving accuracy, robustness, and generalization in tasks involving image synthesis, augmentation, and denoising under complex environmental conditions. Despite ongoing challenges in computational efficiency and domain generalization, diffusion models are expected to play an increasingly important role in the future of intelligent agriculture. As the technology continues to evolve, it holds substantial promise for addressing pressing global issues in food security and environmental sustainability.

[545] From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation

Rongyao Cai, Ming Jin, Qingsong Wen, Kexin Zhang

Main category: cs.LG

TL;DR: DARSD is a novel UDA framework addressing domain shift in time series by decomposing representations into transferable and non-transferable components, outperforming 12 UDA methods on four benchmarks.

DetailsMotivation: Domain shift in time series causes models trained on source domains to fail in target domains. Current UDA methods treat features as indivisible, ignoring their intrinsic compositions.

Method: DARSD decomposes representations into domain-invariant and domain-specific parts using an adversarial invariant basis, pseudo-labeling, and hybrid contrastive optimization.

Result: DARSD outperforms 12 UDA algorithms, achieving top performance in 35 out of 53 scenarios across four benchmarks.

Conclusion: DARSD effectively addresses domain shift by disentangling transferable knowledge, demonstrating superior performance in time series UDA tasks.

Abstract: Domain shift poses a fundamental challenge in time series analysis, where models trained on source domain often fail dramatically when applied in target domain with different yet similar distributions. While current unsupervised domain adaptation (UDA) methods attempt to align cross-domain feature distributions, they typically treat features as indivisible entities, ignoring their intrinsic compositions that govern domain adaptation. We introduce DARSD, a novel UDA framework with theoretical explainability that explicitly realizes UDA tasks from the perspective of representation space decomposition. Our core insight is that effective domain adaptation requires not just alignment, but principled disentanglement of transferable knowledge from mixed representations. DARSD consists of three synergistic components: (I) An adversarial learnable common invariant basis that projects original features into a domain-invariant subspace while preserving semantic content; (II) A prototypical pseudo-labeling mechanism that dynamically separates target features based on confidence, hindering error accumulation; (III) A hybrid contrastive optimization strategy that simultaneously enforces feature clustering and consistency while mitigating emerging distribution gaps. Comprehensive experiments conducted on four benchmarks (WISDM, HAR, HHAR, and MFD) demonstrate DARSD’s superiority against 12 UDA algorithms, achieving optimal performance in 35 out of 53 scenarios and ranking first across all benchmarks.

[546] HGCN(O): A Self-Tuning GCN HyperModel Toolkit for Outcome Prediction in Event-Sequence Data

Fang Wang, Paolo Ceravolo, Ernesto Damiani

Main category: cs.LG

TL;DR: HGCN(O) is a self-tuning GCN toolkit for event sequence prediction, outperforming traditional methods, especially on unbalanced data.

DetailsMotivation: To improve prediction accuracy and stability in event sequence tasks, especially for unbalanced datasets, using diverse GCN architectures.

Method: Four GCN architectures (O-GCN, T-GCN, TP-GCN, TE-GCN) with varied node/graph attributes and temporal dependencies via edge weights.

Result: GCNConv models excel on unbalanced data; all models perform well on balanced data. HGCN(O) outperforms traditional methods.

Conclusion: HGCN(O) is effective for event sequence prediction, with applications like PBPM, and shows robustness across data types.

Abstract: We propose HGCN(O), a self-tuning toolkit using Graph Convolutional Network (GCN) models for event sequence prediction. Featuring four GCN architectures (O-GCN, T-GCN, TP-GCN, TE-GCN) across the GCNConv and GraphConv layers, our toolkit integrates multiple graph representations of event sequences with different choices of node- and graph-level attributes and in temporal dependencies via edge weights, optimising prediction accuracy and stability for balanced and unbalanced datasets. Extensive experiments show that GCNConv models excel on unbalanced data, while all models perform consistently on balanced data. Experiments also confirm the superior performance of HGCN(O) over traditional approaches. Applications include Predictive Business Process Monitoring (PBPM), which predicts future events or states of a business process based on event logs.

[547] Phase-Locked SNR Band Selection for Weak Mineral Signal Detection in Hyperspectral Imagery

Judy X Yang

Main category: cs.LG

TL;DR: A two-stage framework improves mineral detection in hyperspectral imaging by filtering noisy bands and refining data for better unmixing accuracy.

DetailsMotivation: Weak mineral signatures in hyperspectral imaging are often obscured by noise and redundant bands, limiting detection performance.

Method: The method involves SNR-based band selection, spectral smoothing, KMeans clustering for endmember extraction, and NNLS for abundance unmixing.

Result: The pipeline enhances unmixing accuracy and detection of weak mineral zones, validated by cosine similarity and RMSE metrics.

Conclusion: The two-stage strategy is practical and reproducible for spectral dimensionality reduction and unmixing in geological HSI.

Abstract: Hyperspectral imaging offers detailed spectral information for mineral mapping; however, weak mineral signatures are often masked by noisy and redundant bands, limiting detection performance. To address this, we propose a two-stage integrated framework for enhanced mineral detection in the Cuprite mining district. In the first stage, we compute the signal-to-noise ratio (SNR) for each spectral band and apply a phase-locked thresholding technique to discard low-SNR bands, effectively removing redundancy and suppressing background noise. Savitzky-Golay filtering is then employed for spectral smoothing, serving a dual role first to stabilize trends during band selection, and second to preserve fine-grained spectral features during preprocessing. In the second stage, the refined HSI data is reintroduced into the model, where KMeans clustering is used to extract 12 endmember spectra (W1 custom), followed by non negative least squares (NNLS) for abundance unmixing. The resulting endmembers are quantitatively compared with laboratory spectra (W1 raw) using cosine similarity and RMSE metrics. Experimental results confirm that our proposed pipeline improves unmixing accuracy and enhances the detection of weak mineral zones. This two-pass strategy demonstrates a practical and reproducible solution for spectral dimensionality reduction and unmixing in geological HSI applications.

[548] FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, Huimin Zhao

Main category: cs.LG

TL;DR: FGBench introduces a dataset with 625K molecular property reasoning problems, incorporating fine-grained functional group (FG) information to enhance LLMs’ interpretability and reasoning in chemistry tasks.

DetailsMotivation: Existing datasets focus on molecular-level property prediction but lack FG-level data, which is crucial for linking molecular structures with textual descriptions and improving LLMs' reasoning in chemistry.

Method: FGBench provides a dataset with precise FG annotations and localized information, covering 245 functional groups across three reasoning tasks: single FG impacts, multiple FG interactions, and direct molecular comparisons.

Result: Benchmarking on 7K curated data shows current LLMs struggle with FG-level reasoning, indicating a need for improved capabilities.

Conclusion: FGBench’s methodology offers a framework for generating FG-level datasets, advancing LLMs’ understanding of molecular structure-property relationships in chemistry.

Abstract: Large language models (LLMs) have gained significant attention in chemistry. However, most existing datasets center on molecular-level property prediction and overlook the role of fine-grained functional group (FG) information. Incorporating FG-level data can provide valuable prior knowledge that links molecular structures with textual descriptions, which can be used to build more interpretable, structure-aware LLMs for reasoning on molecule-related tasks. Moreover, LLMs can learn from such fine-grained information to uncover hidden relationships between specific functional groups and molecular properties, thereby advancing molecular design and drug discovery. Here, we introduce FGBench, a dataset comprising 625K molecular property reasoning problems with functional group information. Functional groups are precisely annotated and localized within the molecule, which ensures the dataset’s interoperability thereby facilitating further multimodal applications. FGBench includes both regression and classification tasks on 245 different functional groups across three categories for molecular property reasoning: (1) single functional group impacts, (2) multiple functional group interactions, and (3) direct molecular comparisons. In the benchmark of state-of-the-art LLMs on 7K curated data, the results indicate that current LLMs struggle with FG-level property reasoning, highlighting the need to enhance reasoning capabilities in LLMs for chemistry tasks. We anticipate that the methodology employed in FGBench to construct datasets with functional group-level information will serve as a foundational framework for generating new question-answer pairs, enabling LLMs to better understand fine-grained molecular structure-property relationships. The dataset and evaluation code are available at https://github.com/xuanliugit/FGBench.

[549] SpectrumWorld: Artificial Intelligence Foundation for Spectroscopy

Zhuo Yang, Jiaqing Xie, Shuaike Shen, Daolang Wang, Yeyun Chen, Ben Gao, Shuzhou Sun, Biqing Qi, Dongzhan Zhou, Lei Bai, Linjiang Chen, Shufei Zhang, Jun Jiang, Tianfan Fu, Yuqiang Li

Main category: cs.LG

TL;DR: SpectrumLab is a unified platform for deep learning in spectroscopy, offering tools, benchmarks, and empirical insights to standardize research.

DetailsMotivation: To address the lack of standardized formulations in deep learning research for spectroscopy.

Method: Introduces SpectrumLab with three components: a Python library, SpectrumAnnotator for benchmarks, and SpectrumBench for diverse tasks.

Result: Empirical studies on SpectrumBench reveal limitations of current multimodal LLMs.

Conclusion: SpectrumLab aims to be a foundational tool for future deep learning advancements in spectroscopy.

Abstract: Deep learning holds immense promise for spectroscopy, yet research and evaluation in this emerging field often lack standardized formulations. To address this issue, we introduce SpectrumLab, a pioneering unified platform designed to systematize and accelerate deep learning research in spectroscopy. SpectrumLab integrates three core components: a comprehensive Python library featuring essential data processing and evaluation tools, along with leaderboards; an innovative SpectrumAnnotator module that generates high-quality benchmarks from limited seed data; and SpectrumBench, a multi-layered benchmark suite covering 14 spectroscopic tasks and over 10 spectrum types, featuring spectra curated from over 1.2 million distinct chemical substances. Thorough empirical studies on SpectrumBench with 18 cutting-edge multimodal LLMs reveal critical limitations of current approaches. We hope SpectrumLab will serve as a crucial foundation for future advancements in deep learning-driven spectroscopy.

[550] Imbalance-Robust and Sampling-Efficient Continuous Conditional GANs via Adaptive Vicinity and Auxiliary Regularization

Xin Ding, Yun Chen, Yongwei Wang, Kao Zhang, Sen Zhang, Peibei Cao, Xiangxue Wang

Main category: cs.LG

TL;DR: CcGAN-AVAR improves CcGAN by addressing data imbalance and computational inefficiency, offering faster inference and better generation quality.

DetailsMotivation: Existing methods like CcGAN and CCDM have limitations: CcGAN suffers from data imbalance, and CCDM is computationally expensive.

Method: CcGAN-AVAR introduces adaptive vicinity and a multi-task discriminator for better training, leveraging GAN’s one-step generation.

Result: Achieves 300x-2000x faster inference and state-of-the-art generation quality on benchmark datasets.

Conclusion: CcGAN-AVAR outperforms existing methods in efficiency and quality, addressing key challenges in conditional generative modeling.

Abstract: Recent advances in conditional generative modeling have introduced Continuous conditional Generative Adversarial Network (CcGAN) and Continuous Conditional Diffusion Model (CCDM) for estimating high-dimensional data distributions conditioned on scalar, continuous regression labels (e.g., angles, ages, or temperatures). However, these approaches face fundamental limitations: CcGAN suffers from data imbalance due to fixed-size vicinity constraints, while CCDM requires computationally expensive iterative sampling. We present CcGAN-AVAR, an enhanced CcGAN framework that addresses both challenges: (1) leveraging the GAN framework’s native one-step generation to overcome CCDMs’ sampling bottleneck (achieving 300x-2000x faster inference), while (2) two novel components specifically target data imbalance - an adaptive vicinity mechanism that dynamically adjusts vicinity’s size, and a multi-task discriminator that constructs two regularization terms (through auxiliary regression and density ratio estimation) to significantly improve generator training. Extensive experiments on four benchmark datasets (64x64 to 192x192 resolution) across eight challenging imbalanced settings demonstrate that CcGAN-AVAR achieves state-of-the-art generation quality while maintaining sampling efficiency.

[551] Stochastic Encodings for Active Feature Acquisition

Alexander Norcliffe, Changhee Lee, Fergus Imrie, Mihaela van der Schaar, Pietro Lio

Main category: cs.LG

TL;DR: The paper introduces a latent variable model for Active Feature Acquisition, outperforming existing methods like Reinforcement Learning and greedy mutual information maximization.

DetailsMotivation: Existing methods for Active Feature Acquisition, such as Reinforcement Learning and greedy mutual information maximization, have training difficulties or myopic acquisitions.

Method: A latent variable model is proposed, trained supervisedly, reasoning about features across unobserved realizations in a stochastic latent space.

Result: The approach reliably outperforms diverse baselines on synthetic and real datasets.

Conclusion: The introduced latent variable model effectively addresses shortcomings of prior methods in Active Feature Acquisition.

Abstract: Active Feature Acquisition is an instance-wise, sequential decision making problem. The aim is to dynamically select which feature to measure based on current observations, independently for each test instance. Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic acquisitions. To address these shortcomings, we introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a stochastic latent space. Extensive evaluation on a large range of synthetic and real datasets demonstrates that our approach reliably outperforms a diverse set of baselines.

[552] A Compression Based Classification Framework Using Symbolic Dynamics of Chaotic Maps

Parth Naik, Harikrishnan N B

Main category: cs.LG

TL;DR: A novel classification framework, ChaosComp, uses symbolic dynamics and chaotic maps for efficient data encoding and classification, showing competitive results on real-world datasets.

DetailsMotivation: To reinterpret classification through dynamical systems and compression, foundational perspectives in learning theory and information processing.

Method: Symbolic sequences from thresholded training data are evolved via chaotic maps, forming class-specific probabilistic models. Testing involves symbolic encoding and compression-based prediction.

Result: Competitive performance on datasets like Breast Cancer Wisconsin (F1=0.9531), Seeds (F1=0.9475), and Iris (F1=0.8469).

Conclusion: ChaosComp offers a unique fusion of dynamical systems and compression, providing a fresh perspective on classification without aiming for state-of-the-art performance.

Abstract: We propose a novel classification framework grounded in symbolic dynamics and data compression using chaotic maps. The core idea is to model each class by generating symbolic sequences from thresholded real-valued training data, which are then evolved through a one-dimensional chaotic map. For each class, we compute the transition probabilities of symbolic patterns (e.g., 00', 01’, 10', and 11’ for the second return map) and aggregate these statistics to form a class-specific probabilistic model. During testing phase, the test data are thresholded and symbolized, and then encoded using the class-wise symbolic statistics via back iteration, a dynamical reconstruction technique. The predicted label corresponds to the class yielding the shortest compressed representation, signifying the most efficient symbolic encoding under its respective chaotic model. This approach fuses concepts from dynamical systems, symbolic representations, and compression-based learning. We evaluate the proposed method: \emph{ChaosComp} on both synthetic and real-world datasets, demonstrating competitive performance compared to traditional machine learning algorithms (e.g., macro F1-scores for the proposed method on Breast Cancer Wisconsin = 0.9531, Seeds = 0.9475, Iris = 0.8469 etc.). Rather than aiming for state-of-the-art performance, the goal of this research is to reinterpret the classification problem through the lens of dynamical systems and compression, which are foundational perspectives in learning theory and information processing.

[553] Clinical Expert Uncertainty Guided Generalized Label Smoothing for Medical Noisy Label Learning

Kunyu Zhang, Lin Gu, Liangchen Liu, Yingke Chen, Binyang Wang, Jin Yan, Yingying Zhu

Main category: cs.LG

TL;DR: The paper addresses label noise in medical image datasets caused by expert uncertainty in clinical notes, proposing an uncertainty-aware benchmark and label smoothing method to improve performance.

DetailsMotivation: Existing methods for creating medical image datasets from clinical notes overlook expert uncertainty, leading to noisy labels. This work aims to incorporate expert-driven uncertainty to improve label quality.

Method: The study examines the impact of expert uncertainty on label noise and introduces a clinical expert uncertainty-aware benchmark and label smoothing method.

Result: The proposed method significantly outperforms current state-of-the-art approaches in handling noisy labels.

Conclusion: Incorporating expert uncertainty into medical image analysis improves label quality and performance, addressing a critical gap in existing methods.

Abstract: Many previous studies have proposed extracting image labels from clinical notes to create large-scale medical image datasets at a low cost. However, these approaches inherently suffer from label noise due to uncertainty from the clinical experts. When radiologists and physicians analyze medical images to make diagnoses, they often include uncertainty-aware notes such as maybe'' or not excluded’’. Unfortunately, current text-mining methods overlook these nuances, resulting in the creation of noisy labels. Existing methods for handling noisy labels in medical image analysis, which typically address the problem through post-processing techniques, have largely ignored the important issue of expert-driven uncertainty contributing to label noise. To better incorporate the expert-written uncertainty in clinical notes into medical image analysis and address the label noise issue, we first examine the impact of clinical expert uncertainty on label noise. We then propose a clinical expert uncertainty-aware benchmark, along with a label smoothing method, which significantly improves performance compared to current state-of-the-art approaches.

cs.MA

[554] Frequency Point Game Environment for UAVs via Expert Knowledge and Large Language Model

Jingpu Yang

Main category: cs.MA

TL;DR: UAV-FPG is a game-theoretic model for UAV communication, integrating expert knowledge and large language models to improve anti-jamming strategies and path planning.

DetailsMotivation: Address challenges in spectrum competition modeling, expert knowledge integration, and opponent behavior prediction in UAV communication.

Method: Propose UAV-FPG, a game-theoretic model simulating dynamic interactions between interference and anti-interference strategies, using expert knowledge and large language models for optimization.

Result: Integration of expert knowledge and large language models improves path planning and outperforms fixed-path strategies.

Conclusion: UAV-FPG enhances anti-jamming strategies and intelligent decision-making in UAV communication systems.

Abstract: Unmanned Aerial Vehicles (UAVs) have made significant advancements in communication stability and security through techniques such as frequency hopping, signal spreading, and adaptive interference suppression. However, challenges remain in modeling spectrum competition, integrating expert knowledge, and predicting opponent behavior. To address these issues, we propose UAV-FPG (Unmanned Aerial Vehicle - Frequency Point Game), a game-theoretic environment model that simulates the dynamic interaction between interference and anti-interference strategies of opponent and ally UAVs in communication frequency bands. The model incorporates a prior expert knowledge base to optimize frequency selection and employs large language models for path planning, simulating a “strong adversary”. Experimental results highlight the effectiveness of integrating the expert knowledge base and the large language model, with the latter significantly improving path planning in dynamic scenarios through iterative interactions, outperforming fixed-path strategies. UAV-FPG provides a robust platform for advancing anti-jamming strategies and intelligent decision-making in UAV communication systems.

[555] TransAM: Transformer-Based Agent Modeling for Multi-Agent Systems via Local Trajectory Encoding

Conor Wallace, Umer Siddique, Yongcan Cao

Main category: cs.MA

TL;DR: The paper introduces TransAM, a transformer-based method for agent modeling using local trajectories, improving policy representations and performance in multi-agent systems.

DetailsMotivation: Existing agent modeling methods often rely on unrealistic access to other agents' episodic trajectories, limiting real-world applicability.

Method: Proposes TransAM, a transformer-based approach to encode local trajectories into an embedding space to capture other agents’ policies.

Result: Demonstrates strong policy representations, improved agent modeling, and higher episodic returns in cooperative, competitive, and mixed environments.

Conclusion: TransAM is effective for robust agent modeling using only local trajectories, enhancing performance in multi-agent systems.

Abstract: Agent modeling is a critical component in developing effective policies within multi-agent systems, as it enables agents to form beliefs about the behaviors, intentions, and competencies of others. Many existing approaches assume access to other agents’ episodic trajectories, a condition often unrealistic in real-world applications. Consequently, a practical agent modeling approach must learn a robust representation of the policies of the other agents based only on the local trajectory of the controlled agent. In this paper, we propose \texttt{TransAM}, a novel transformer-based agent modeling approach to encode local trajectories into an embedding space that effectively captures the policies of other agents. We evaluate the performance of the proposed method in cooperative, competitive, and mixed multi-agent environments. Extensive experimental results demonstrate that our approach generates strong policy representations, improves agent modeling, and leads to higher episodic returns.

[556] Engineered over Emergent Communication in MARL for Scalable and Sample-Efficient Cooperative Task Allocation in a Partially Observable Grid

Brennen A. Hill, Mant Koh En Wei, Thangavel Jishnuanandh

Main category: cs.MA

TL;DR: Comparison of learned (LDC) and engineered (Intention Communication) strategies in MARL shows engineered methods outperform in complex environments.

DetailsMotivation: To evaluate the effectiveness of learned versus engineered communication in cooperative MARL tasks.

Method: Learned Direct Communication (LDC) and Intention Communication (ITGM + MGN) are tested in fully and partially observable conditions.

Result: Engineered approach (Intention Communication) performs better, especially in complex environments.

Conclusion: Engineered communication strategies are more scalable and effective than learned ones in complex MARL tasks.

Abstract: We compare the efficacy of learned versus engineered communication strategies in a cooperative multi-agent reinforcement learning (MARL) environment. For the learned approach, we introduce Learned Direct Communication (LDC), where agents generate messages and actions concurrently via a neural network. Our engineered approach, Intention Communication, employs an Imagined Trajectory Generation Module (ITGM) and a Message Generation Network (MGN) to formulate messages based on predicted future states. Both strategies are evaluated on their success rates in cooperative tasks under fully and partially observable conditions. Our findings indicate that while emergent communication is viable, the engineered approach demonstrates superior performance and scalability, particularly as environmental complexity increases.

[557] Distributionally Robust Markov Games with Average Reward

Zachary Roch, Yue Wang

Main category: cs.MA

TL;DR: The paper introduces a distributionally robust Markov game (DR-MG) with average rewards for multi-agent decision-making under uncertainty, focusing on long-term performance. It connects multi-agent and single-agent settings, proves the existence of a robust Nash Equilibrium, and develops an algorithm for computing equilibria.

DetailsMotivation: To address multi-agent decision-making under uncertainty over extended horizons, where sustained reliability is critical, by optimizing worst-case average rewards.

Method: Formulates DR-MG with average rewards, connects multi-agent and single-agent settings, derives solvability of the robust Bellman equation, proves existence of robust Nash Equilibrium, and develops the robust Nash-Iteration algorithm.

Result: Establishes theoretical guarantees for system stability, proves existence of robust Nash Equilibrium, and provides an algorithm for computing equilibria. Also shows the connection between average-reward and discounted Nash Equilibria.

Conclusion: The work offers a comprehensive theoretical and algorithmic foundation for optimal strategies in uncertain, long-running multi-agent environments, bridging robust single-agent problems to multi-agent settings.

Abstract: This paper introduces the formulation of a distributionally robust Markov game (DR-MG) with average rewards, a crucial framework for multi-agent decision-making under uncertainty over extended horizons. Unlike finite-horizon or discounted models, the average-reward criterion naturally captures long-term performance for systems designed for continuous operation, where sustained reliability is paramount. We account for uncertainty in transition kernels, with players aiming to optimize their worst-case average reward. We first establish a connection between the multi-agent and single agent settings, and derive the solvability of the robust Bellman equation under the average-reward formulation. We then rigorously prove the existence of a robust Nash Equilibrium (NE), offering essential theoretical guarantees for system stability. We further develop and analyze an algorithm named robust Nash-Iteration to compute the robust Nash Equilibria among all agents, providing practical tools for identifying optimal strategies in complex, uncertain, and long-running multi-player environments. Finally, we demonstrate the connection between the average-reward NE and the well-studied discounted NEs, showing that the former can be approximated as the discount factor approaches one. Together, these contributions provide a comprehensive theoretical and algorithmic foundation for identifying optimal strategies in complex, uncertain, and long-running multi-player environments, which allow for the future extension of robust average-reward single-agent problems to the multi-agent setting.

[558] TVDO: Tchebycheff Value-Decomposition Optimization for Multi-Agent Reinforcement Learning

Xiaoliang Hu, Pengcheng Guo, Yadong Li, Guanyu Li, Zhen Cui, Jian Yang

Main category: cs.MA

TL;DR: The paper proposes a factorized Tchebycheff value-decomposition optimization (TVDO) method to address inconsistency in cooperative multiagent reinforcement learning (MARL) under centralized training with decentralized execution (CTDE).

DetailsMotivation: The inconsistency between jointly-trained policies and individually-executed actions in CTDE is a key challenge in MARL.

Method: A nonlinear Tchebycheff aggregation function is formulated to tightly constrain the upper bound of individual action-value bias, ensuring global optimum.

Result: Theoretical proof shows TVDO satisfies Individual-Global-Max (IGM) conditions, and empirical tests in climb/penalty games and SMAC benchmark confirm its superiority over state-of-the-art MARL baselines.

Conclusion: TVDO effectively ensures policy consistency and outperforms existing methods in cooperative MARL.

Abstract: In cooperative multiagent reinforcement learning (MARL), centralized training with decentralized execution (CTDE) has recently attracted more attention due to the physical demand. However, the most dilemma therein is the inconsistency between jointly-trained policies and individually-executed actions. In this article, we propose a factorized Tchebycheff value-decomposition optimization (TVDO) method to overcome the trouble of inconsistency. In particular, a nonlinear Tchebycheff aggregation function is formulated to realize the global optimum by tightly constraining the upper bound of individual action-value bias, which is inspired by the Tchebycheff method of multi-objective optimization. We theoretically prove that, under no extra limitations, the factorized value decomposition with Tchebycheff aggregation satisfies the sufficiency and necessity of Individual-Global-Max (IGM), which guarantees the consistency between the global and individual optimal action-value function. Empirically, in the climb and penalty game, we verify that TVDO precisely expresses the global-to-individual value decomposition with a guarantee of policy consistency. Meanwhile, we evaluate TVDO in the SMAC benchmark, and extensive experiments demonstrate that TVDO achieves a significant performance superiority over some SOTA MARL baselines.

cs.MM

[559] VisAug: Facilitating Speech-Rich Web Video Navigation and Engagement with Auto-Generated Visual Augmentations

Baoquan Zhao, Xiaofan Ma, Qianshi Pang, Ruomei Wang, Fan Zhou, Shujin Lin

Main category: cs.MM

TL;DR: VisAug is an interactive system that enhances speech-rich video navigation by generating visual augmentations based on speech content, addressing limitations of visual-based summarization.

DetailsMotivation: The rise of speech-rich video content lacks effective summarization tools, as most systems rely on visual cues, not audio.

Method: VisAug automatically generates visual augmentations from speech content to improve navigation and engagement.

Result: The system shows potential to significantly enhance video content consumption and engagement.

Conclusion: VisAug offers a promising solution for navigating speech-rich videos in a digital landscape dominated by video content.

Abstract: The widespread adoption of digital technology has ushered in a new era of digital transformation across all aspects of our lives. Online learning, social, and work activities, such as distance education, videoconferencing, interviews, and talks, have led to a dramatic increase in speech-rich video content. In contrast to other video types, such as surveillance footage, which typically contain abundant visual cues, speech-rich videos convey most of their meaningful information through the audio channel. This poses challenges for improving content consumption using existing visual-based video summarization, navigation, and exploration systems. In this paper, we present VisAug, a novel interactive system designed to enhance speech-rich video navigation and engagement by automatically generating informative and expressive visual augmentations based on the speech content of videos. Our findings suggest that this system has the potential to significantly enhance the consumption and engagement of information in an increasingly video-driven digital landscape.

[560] OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset

Quang-Linh Tran, Binh Nguyen, Gareth J. F. Jones, Cathal Gurrin

Main category: cs.MM

TL;DR: The paper introduces OpenLifelogQA, a novel lifelog QA dataset built on an 18-month lifelog dataset, featuring 14,187 Q&A pairs for practical applications in memory enhancement.

DetailsMotivation: To address the lack of large-scale, real-world QA datasets for lifelog data, hindering research in memory preservation and enhancement applications.

Method: Constructed OpenLifelogQA, a diverse and practical QA dataset, and evaluated it using baseline experiments with metrics like BERT Score, ROUGE-L, and LLM Score.

Result: Achieved competitive performance (89.7% BERT Score, 25.87% ROUGE-L, 3.9665 LLM Score) with the LLaVA-NeXT-Interleave 7B model.

Conclusion: OpenLifelogQA supports lifelog research, enabling advancements like personal chat-based assistants for lifelog data.

Abstract: Lifelogging refers to the process of passively collecting, storing, and analysing personal daily life data using wearable devices. This data can support applications in memory preservation and enhancement. For example, using an ask-and-answer strategy, question-answering (QA) on lifelog data opens an interactive and interesting way to explore memorable events and insights into daily life. However, research resources for QA on lifelog data are limited to small-sized or synthetic QA datasets. In this paper, we present a novel lifelog QA dataset called OpenLifelogQA, building upon an 18-month lifelog dataset. Our dataset focuses on an open-ended and practical QA with real-world application in daily lifelog usage. We construct 14,187 pairs of Q&A with diverse types and difficulty levels. A baseline experiment is reported for this dataset with competitive average performance of 89.7% BERT Score, 25.87% ROUGE-L and 3.9665 LLM Score from LLaVA-NeXT-Interleave 7B model. We release this Q&A dataset to the research community to support new research into lifelog technologies, such as enabling personal chat-based assistants for lifelog data to become a reality.

eess.AS

[561] SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao

Main category: eess.AS

TL;DR: SecoustiCodec is a novel speech codec addressing challenges like residual paralinguistic info and semantic incompleteness, achieving SOTA reconstruction quality at low bitrates.

DetailsMotivation: Existing speech codecs struggle with semantic encoding due to residual paralinguistic info, incomplete semantics, and lack of streaming support.

Method: Proposes a cross-modal aligned low-bitrate streaming codec using VAE and FSQ for quantization, contrastive learning for disentanglement, and multi-stage optimization.

Result: Achieves PESQ scores of 1.77/2.58 at 0.27/1 kbps, outperforming existing methods.

Conclusion: SecoustiCodec effectively disentangles semantic and paralinguistic info, offering high-quality reconstruction and open-sourcing for broader use.

Abstract: Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of tokens while maintaining high codebook utilization. A semantic disentanglement method based on contrastive learning is proposed, which aligns text and speech in a joint multimodal frame-level space, effectively removing paralinguistic information from semantic encoding. An acoustic-constrained multi-stage optimization strategy is proposed to ensure robust and stable convergence. Figure~\ref{fig:pesq_kbps_below_2kbps} shows SecoustiCodec achieves SOTA (state-of-the-art) reconstruction quality (PESQ) of 1.77/2.58 at 0.27/1 kbps. The code and model weights for SecoustiCodec will be open-sourced upon the completion of the peer-review process. We’ve open-sourced SecoustiCodec’s demo, code, and model weights.

[562] Real-time speech enhancement in noise for throat microphone using neural audio codec as foundation model

Julien Hauret, Thomas Joubaud, Éric Bavu

Main category: eess.AS

TL;DR: A real-time speech enhancement demo using a throat microphone, fine-tuning Kyutai’s Mimi neural audio codec on the Vibravox dataset, outperforming state-of-the-art models.

DetailsMotivation: To enhance speech captured in noisy environments using a throat microphone, which naturally attenuates noise but reduces audio bandwidth.

Method: Fine-tune Kyutai’s Mimi neural audio codec on the Vibravox dataset (paired air-conducted and throat microphone recordings) for real-time inference.

Result: Demonstrates superior performance compared to state-of-the-art models, with an interactive interface for users.

Conclusion: The approach effectively enhances throat microphone speech in noisy environments, offering real-time performance and user-friendly features.

Abstract: We present a real-time speech enhancement demo using speech captured with a throat microphone. This demo aims to showcase the complete pipeline, from recording to deep learning-based post-processing, for speech captured in noisy environments with a body-conducted microphone. The throat microphone records skin vibrations, which naturally attenuate external noise, but this robustness comes at the cost of reduced audio bandwidth. To address this challenge, we fine-tune Kyutai’s Mimi–a neural audio codec supporting real-time inference–on Vibravox, a dataset containing paired air-conducted and throat microphone recordings. We compare this enhancement strategy against state-of-the-art models and demonstrate its superior performance. The inference runs in an interactive interface that allows users to toggle enhancement, visualize spectrograms, and monitor processing latency.

[563] Fast Algorithm for Moving Sound Source

Dong Yang

Main category: eess.AS

TL;DR: Proposes Yang’s motion spatio-temporal sampling theory to simulate motion-induced reverberation, improving speech enhancement models in dynamic scenarios.

DetailsMotivation: Existing methods lack physics-compliant motion data simulation, limiting training for speech enhancement in moving scenarios.

Method: Decomposes moving image source’s impulse response into linear time-invariant modulation and fractional delay, using hierarchical sampling for efficiency.

Result: Accurately restores amplitude and phase changes in motion, outperforming GSound, and enhances speech enhancement model robustness.

Conclusion: Provides high-quality dynamic training data and improves multi-channel voice tracking, solving industry challenges.

Abstract: Modern neural network-based speech processing systems need reverberation resistance, relying on large amounts of reverberation data for training. Existing methods simulate dynamic scenarios by sampling static systems or supplement with measured data, but struggle to simulate motion data conforming to physical laws. To address insufficient training data for speech enhancement models in moving scenarios, this paper proposes Yang’s motion spatio-temporal sampling reconstruction theory, enabling efficient simulation of motion-induced continuous time-varying reverberation. It breaks through the limitations of traditional static Image-Source Method (ISM) in time-varying systems by decomposing the moving image source’s impulse response into linear time-invariant modulation and discrete time-varying fractional delay, establishing a physics-compliant moving sound field model. Based on the band-limited nature of motion displacement, a hierarchical sampling strategy is adopted: high sampling rates for low-order images to retain details, and low rates for high-order ones to reduce complexity, combined with a fast synthesis architecture for real-time simulation. Experiments show that compared to open-source model GSound, the theory more accurately restores amplitude and phase changes in moving scenarios, solving the industry challenge of motion sound source data simulation. It provides high-quality dynamic training data for speech enhancement models and improves the robustness of multi-channel end-to-end voice tracking algorithms.

[564] Kernel ridge regression based sound field estimation using a rigid spherical microphone array

Ryo Matsuda, Juliano G. C. Ribeiro, Hitoshi Akiyama, Jorge Trevino

Main category: eess.AS

TL;DR: A kernel ridge regression method for sound field estimation using a rigid spherical microphone array, incorporating boundary constraints of scatterers.

DetailsMotivation: Existing methods assume open-sphere configurations or ignore scatterer boundary conditions. This work leverages the rigid sphere's known properties for better estimation.

Method: Uses kernel ridge regression with physically constrained and adapted kernel functions, incorporating rigid sphere boundary conditions into the sound field representation.

Result: Demonstrated effectiveness through numerical simulations and real-world experiments with a new spherical microphone array.

Conclusion: The proposed method improves sound field estimation by explicitly accounting for rigid scatterer boundary conditions.

Abstract: We propose a sound field estimation method based on kernel ridge regression using a rigid spherical microphone array. Kernel ridge regression with physically constrained kernel functions, and further with kernel functions adapted to observed sound fields, have proven to be powerful tools. However, such methods generally assume an open-sphere microphone array configuration, i.e., no scatterers exist within the observation or estimation region. Alternatively, some approaches assume the presence of scatterers and attempt to eliminate their influence through a least-squares formulation. Even then, these methods typically do not incorporate the boundary conditions of the scatterers, which are not presumed to be known. In contrast, we exploit the fact the scatterer here is a rigid sphere. Meaning, both the virtual scattering source locations and the boundary conditions are well-defined. Based on this, we formulate the scattered sound field within the kernel ridge regression framework and propose a novel sound field representation incorporating a boundary constraint. The effectiveness of the proposed method is demonstrated through numerical simulations and real-world experiments using a newly developed spherical microphone array.

[565] PatchDSU: Uncertainty Modeling for Out of Distribution Generalization in Keyword Spotting

Bronya Roni Chernyak, Yael Segal, Yosi Shrem, Joseph Keshet

Main category: eess.AS

TL;DR: PatchDSU improves out-of-domain generalization in speech systems by splitting inputs into patches and augmenting each independently, outperforming other methods.

DetailsMotivation: Addressing distribution shifts in speech systems due to varying environments and speaker diversity, which challenge deep learning models.

Method: Extends Domain Shifts with Uncertainty (DSU) by splitting input into patches and augmenting each patch independently. Evaluated on Google Speech Commands, Librispeech, and TED-LIUM under noise conditions.

Result: PatchDSU and DSU outperform other methods, with PatchDSU showing more consistent improvements across scenarios.

Conclusion: PatchDSU effectively tackles out-of-distribution issues in speech, offering robust generalization.

Abstract: Deep learning models excel at many tasks but rely on the assumption that training and test data follow the same distribution. This assumption often does not hold in real-world speech systems, where distribution shifts are common due to varying environments, recording conditions, and speaker diversity. The method of Domain Shifts with Uncertainty (DSU) augments the input of each neural network layer based on the input feature statistics. It addresses the problem of out-of-domain generalization by assuming feature statistics follow a multivariate Gaussian distribution and substitutes the input with sampled features from this distribution. While effective for computer vision, applying DSU to speech presents challenges due to the nature of the data. Unlike static visual data, speech is a temporal signal commonly represented by a spectrogram

  • the change of frequency over time. This representation cannot be treated as a simple image, and the resulting sparsity can lead to skewed feature statistics when applied to the entire input. To tackle out-of-distribution issues in keyword spotting, we propose PatchDSU, which extends DSU by splitting the input into patches and independently augmenting each patch. We evaluated PatchDSU and DSU alongside other methods on the Google Speech Commands, Librispeech, and TED-LIUM. Additionally, we evaluated performance under white Gaussian and MUSAN music noise conditions. We also explored out-of-domain generalization by analyzing model performance on datasets they were not trained on. Overall, in most cases, both PatchDSU and DSU outperform other methods. Notably, PatchDSU demonstrates more consistent improvements across the evaluated scenarios compared to other approaches.

[566] Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Yifan Yang, Shujie Liu, Jinyu Li, Yuxuan Hu, Haibin Wu, Hui Wang, Jianwei Yu, Lingwei Meng, Haiyang Sun, Yanqing Liu, Yan Lu, Kai Yu, Xie Chen

Main category: eess.AS

TL;DR: PALLE introduces a pseudo-autoregressive (PAR) codec language model for TTS, combining AR and NAR strengths for faster, high-quality speech synthesis.

DetailsMotivation: Address the trade-offs between slow AR models and less controllable NAR models in zero-shot TTS.

Method: Two-stage system: PAR for initial generation with dynamic-length spans, followed by NAR refinement of low-confidence tokens.

Result: Outperforms state-of-the-art systems in quality, similarity, and intelligibility, with 10x faster inference.

Conclusion: PALLE unifies AR and NAR advantages, offering efficient, high-quality zero-shot TTS.

Abstract: Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://microsoft.com/research/project/vall-e-x/palle.

eess.IV

[567] MPCA-based Domain Adaptation for Transfer Learning in Ultrasonic Guided Waves

Lucio Pinello, Francesco Cadini, Luca Lomazzi

Main category: eess.IV

TL;DR: A transfer learning framework using MPCA and CNN improves damage localization in UGW-based SHM, reducing errors across materials and sensor setups.

DetailsMotivation: Data scarcity and limited generalization hinder large-scale UGW-based ML methods for SHM.

Method: Combines MPCA for shared feature extraction and fine-tuning to adapt a pre-trained CNN to new domains.

Result: Tested on 12 cases, showing reduced localization error and better domain alignment.

Conclusion: The MPCA-based TL framework is robust, data-efficient, and effective for UGW-based SHM.

Abstract: Ultrasonic Guided Waves (UGWs) represent a promising diagnostic tool for Structural Health Monitoring (SHM) in thin-walled structures, and their integration with machine learning (ML) algorithms is increasingly being adopted to enable real-time monitoring capabilities. However, the large-scale deployment of UGW-based ML methods is constrained by data scarcity and limited generalisation across different materials and sensor configurations. To address these limitations, this work proposes a novel transfer learning (TL) framework based on Multilinear Principal Component Analysis (MPCA). First, a Convolutional Neural Network (CNN) for regression is trained to perform damage localisation for a plated structure. Then, MPCA and fine-tuning are combined to have the CNN work for a different plate. By jointly applying MPCA to the source and target domains, the method extracts shared latent features, enabling effective domain adaptation without requiring prior assumptions about dimensionality. Following MPCA, fine-tuning enables adapting the pre-trained CNN to a new domain without the need for a large training dataset. The proposed MPCA-based TL method was tested against 12 case studies involving different composite materials and sensor arrays. Statistical metrics were used to assess domains alignment both before and after MPCA, and the results demonstrate a substantial reduction in localisation error compared to standard TL techniques. Hence, the proposed approach emerges as a robust, data-efficient, and statistically based TL framework for UGW-based SHM.

[568] Spatial-Temporal-Spectral Mamba with Sparse Deformable Token Sequence for Enhanced MODIS Time Series Classification

Zack Dewis, Zhengsen Xu, Yimin Zhu, Motasem Alkayid, Mabel Heffring, Lincoln Linlin Xu

Main category: eess.IV

TL;DR: A novel STSMamba method with deformable token sequence is proposed for MODIS time series classification, addressing challenges like high dimensionality and mixed pixels. It introduces TGS, SDMS, and three Mamba modules (SDSpaM, SDSpeM, SDTM) for improved accuracy and efficiency.

DetailsMotivation: MODIS time series data is vital for land cover classification but suffers from high dimensionality, mixed pixels, and coupling effects, making classification difficult.

Method: The STSMamba method includes a TGS module for initial feature learning, SDMS for efficient Mamba sequencing, and three Mamba modules (SDSpaM, SDSpeM, SDTM) for spatial-temporal-spectral feature learning.

Result: The proposed method outperforms state-of-the-art approaches in classification accuracy while reducing computational complexity.

Conclusion: STSMamba effectively addresses MODIS classification challenges, offering higher accuracy and efficiency through innovative feature learning and sequencing techniques.

Abstract: Although MODIS time series data are critical for supporting dynamic, large-scale land cover land use classification, it is a challenging task to capture the subtle class signature information due to key MODIS difficulties, e.g., high temporal dimensionality, mixed pixels, and spatial-temporal-spectral coupling effect. This paper presents a novel spatial-temporal-spectral Mamba (STSMamba) with deformable token sequence for enhanced MODIS time series classification, with the following key contributions. First, to disentangle temporal-spectral feature coupling, a temporal grouped stem (TGS) module is designed for initial feature learning. Second, to improve Mamba modeling efficiency and accuracy, a sparse, deformable Mamba sequencing (SDMS) approach is designed, which can reduce the potential information redundancy in Mamba sequence and improve the adaptability and learnability of the Mamba sequencing. Third, based on SDMS, to improve feature learning, a novel spatial-temporal-spectral Mamba architecture is designed, leading to three modules, i.e., a sparse deformable spatial Mamba module (SDSpaM), a sparse deformable spectral Mamba module (SDSpeM), and a sparse deformable temporal Mamba module (SDTM) to explicitly learn key information sources in MODIS. The proposed approach is tested on MODIS time series data in comparison with many state-of-the-art approaches, and the results demonstrate that the proposed approach can achieve higher classification accuracy with reduced computational complexity.

[569] Evaluation of 3D Counterfactual Brain MRI Generation

Pengwei Sun, Wei Peng, Lun Yu Li, Yixin Wang, Kilian M. Pohl

Main category: eess.IV

TL;DR: The paper introduces a method to generate realistic 3D brain MRIs using counterfactual approaches with anatomy-guided conditioning, evaluated on ADNI and NCANDA datasets.

DetailsMotivation: To address challenges in generating realistic 3D brain MRIs that respect anatomical and causal constraints, aiming for better disease understanding and data generation.

Method: Six generative models are adapted into 3D counterfactual approaches using an anatomy-guided framework with regional brain volumes as conditioning inputs.

Result: Anatomically grounded conditioning modifies targeted regions effectively but struggles with preserving non-targeted structures.

Conclusion: The work sets a foundation for interpretable generative modeling of brain MRIs but calls for improved architectures to better capture anatomical interdependencies.

Abstract: Counterfactual generation offers a principled framework for simulating hypothetical changes in medical imaging, with potential applications in understanding disease mechanisms and generating physiologically plausible data. However, generating realistic structural 3D brain MRIs that respect anatomical and causal constraints remains challenging due to data scarcity, structural complexity, and the lack of standardized evaluation protocols. In this work, we convert six generative models into 3D counterfactual approaches by incorporating an anatomy-guided framework based on a causal graph, in which regional brain volumes serve as direct conditioning inputs. Each model is evaluated with respect to composition, reversibility, realism, effectiveness and minimality on T1-weighted brain MRIs (T1w MRIs) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). In addition, we test the generalizability of each model with respect to T1w MRIs of the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA). Our results indicate that anatomically grounded conditioning successfully modifies the targeted anatomical regions; however, it exhibits limitations in preserving non-targeted structures. Beyond laying the groundwork for more interpretable and clinically relevant generative modeling of brain MRIs, this benchmark highlights the need for novel architectures that more accurately capture anatomical interdependencies.

[570] REFLECT: Rectified Flows for Efficient Brain Anomaly Correction Transport

Farzad Beizaee, Sina Hajimiri, Ismail Ben Ayed, Gregory Lodygensky, Christian Desrosiers, Jose Dolz

Main category: eess.IV

TL;DR: REFLECT introduces a novel framework using rectified flows for unsupervised anomaly detection in brain MR images, enabling efficient correction and precise anomaly localization in a single step.

DetailsMotivation: Accurate anomaly localization in brain imaging is challenging due to complex anatomy and lack of labeled abnormal data.

Method: REFLECT leverages rectified flows to create a direct, linear correction trajectory for abnormal MR images, enabling single-step inference.

Result: REFLECT outperforms state-of-the-art methods on UAD benchmarks, offering efficient correction and precise anomaly detection.

Conclusion: REFLECT provides a robust, efficient solution for unsupervised anomaly detection in brain imaging, with potential for clinical applications.

Abstract: Unsupervised anomaly detection (UAD) in brain imaging is crucial for identifying pathologies without the need for labeled data. However, accurately localizing anomalies remains challenging due to the intricate structure of brain anatomy and the scarcity of abnormal examples. In this work, we introduce REFLECT, a novel framework that leverages rectified flows to establish a direct, linear trajectory for correcting abnormal MR images toward a normal distribution. By learning a straight, one-step correction transport map, our method efficiently corrects brain anomalies and can precisely localize anomalies by detecting discrepancies between anomalous input and corrected counterpart. In contrast to the diffusion-based UAD models, which require iterative stochastic sampling, rectified flows provide a direct transport map, enabling single-step inference. Extensive experiments on popular UAD brain segmentation benchmarks demonstrate that REFLECT significantly outperforms state-of-the-art unsupervised anomaly detection methods. The code is available at https://github.com/farzad-bz/REFLECT.

[571] AMD-Mamba: A Phenotype-Aware Multi-Modal Framework for Robust AMD Prognosis

Puzhen Wu, Mingquan Lin, Qingyu Chen, Emily Y. Chew, Zhiyong Lu, Yifan Peng, Hexin Dong

Main category: eess.IV

TL;DR: AMD-Mamba is a multi-modal framework for AMD prognosis, integrating fundus images, genetic variants, and socio-demographic data. It uses metric learning and Vision Mamba for improved feature representation and outperforms existing methods.

DetailsMotivation: AMD is a leading cause of vision loss, necessitating better prognosis tools for early intervention.

Method: Combines color fundus images, genetic variants, and socio-demographic variables with metric learning and Vision Mamba for local and global feature fusion.

Result: Outperforms conventional methods, identifies a significant AMD biomarker, and improves early detection of high-risk patients.

Conclusion: AMD-Mamba enhances AMD prognosis precision, aiding proactive management.

Abstract: Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss, making effective prognosis crucial for timely intervention. In this work, we propose AMD-Mamba, a novel multi-modal framework for AMD prognosis, and further develop a new AMD biomarker. This framework integrates color fundus images with genetic variants and socio-demographic variables. At its core, AMD-Mamba introduces an innovative metric learning strategy that leverages AMD severity scale score as prior knowledge. This strategy allows the model to learn richer feature representations by aligning learned features with clinical phenotypes, thereby improving the capability of conventional prognosis methods in capturing disease progression patterns. In addition, unlike existing models that use traditional CNN backbones and focus primarily on local information, such as the presence of drusen, AMD-Mamba applies Vision Mamba and simultaneously fuses local and long-range global information, such as vascular changes. Furthermore, we enhance prediction performance through multi-scale fusion, combining image information with clinical variables at different resolutions. We evaluate AMD-Mamba on the AREDS dataset, which includes 45,818 color fundus photographs, 52 genetic variants, and 3 socio-demographic variables from 2,741 subjects. Our experimental results demonstrate that our proposed biomarker is one of the most significant biomarkers for the progression of AMD. Notably, combining this biomarker with other existing variables yields promising improvements in detecting high-risk AMD patients at early stages. These findings highlight the potential of our multi-modal framework to facilitate more precise and proactive management of AMD.

[572] ClinicalFMamba: Advancing Clinical Assessment using Mamba-based Multimodal Neuroimaging Fusion

Meng Zhou, Farzad Khalvati

Main category: eess.IV

TL;DR: Proposes ClinicalFMamba, a CNN-Mamba hybrid for efficient 2D/3D medical image fusion, excelling in local and global feature modeling with real-time performance.

DetailsMotivation: Existing methods (CNNs, Transformers) have limitations in global context modeling or computational efficiency, hindering clinical deployment.

Method: Introduces ClinicalFMamba, combining CNN and Mamba for local-global feature synergy, with a tri-plane scanning strategy for 3D data.

Result: Outperforms baselines in fusion metrics and downstream tasks like brain tumor classification, with real-time efficiency.

Conclusion: Sets a new standard for efficient, clinically viable multimodal medical image fusion.

Abstract: Multimodal medical image fusion integrates complementary information from different imaging modalities to enhance diagnostic accuracy and treatment planning. While deep learning methods have advanced performance, existing approaches face critical limitations: Convolutional Neural Networks (CNNs) excel at local feature extraction but struggle to model global context effectively, while Transformers achieve superior long-range modeling at the cost of quadratic computational complexity, limiting clinical deployment. Recent State Space Models (SSMs) offer a promising alternative, enabling efficient long-range dependency modeling in linear time through selective scan mechanisms. Despite these advances, the extension to 3D volumetric data and the clinical validation of fused images remains underexplored. In this work, we propose ClinicalFMamba, a novel end-to-end CNN-Mamba hybrid architecture that synergistically combines local and global feature modeling for 2D and 3D images. We further design a tri-plane scanning strategy for effectively learning volumetric dependencies in 3D images. Comprehensive evaluations on three datasets demonstrate the superior fusion performance across multiple quantitative metrics while achieving real-time fusion. We further validate the clinical utility of our approach on downstream 2D/3D brain tumor classification tasks, achieving superior performance over baseline methods. Our method establishes a new paradigm for efficient multimodal medical image fusion suitable for real-time clinical deployment.

[573] A Survey of Medical Point Cloud Shape Learning: Registration, Reconstruction and Variation

Tongxu Zhang, Zhiming Liang, Bei Wang

Main category: eess.IV

TL;DR: A survey of learning-based shape analysis for medical point clouds, covering registration, reconstruction, and variation modeling, with trends like hybrid representations and generative techniques.

DetailsMotivation: Point clouds offer a compact, surface-preserving alternative for 3D medical imaging, but challenges like data scarcity and clinical robustness persist.

Method: Review of literature (2021-2025), summarizing methods, datasets, metrics, and clinical applications.

Result: Highlights trends like self-supervised models and generative techniques, alongside limitations like inter-patient variability.

Conclusion: Future directions focus on advancing point cloud-based shape learning for medical imaging, addressing current challenges.

Abstract: Point clouds have become an increasingly important representation for 3D medical imaging, offering a compact, surface-preserving alternative to traditional voxel or mesh-based approaches. Recent advances in deep learning have enabled rapid progress in extracting, modeling, and analyzing anatomical shapes directly from point cloud data. This paper provides a comprehensive and systematic survey of learning-based shape analysis for medical point clouds, focusing on three fundamental tasks: registration, reconstruction, and variation modeling. We review recent literature from 2021 to 2025, summarize representative methods, datasets, and evaluation metrics, and highlight clinical applications and unique challenges in the medical domain. Key trends include the integration of hybrid representations, large-scale self-supervised models, and generative techniques. We also discuss current limitations, such as data scarcity, inter-patient variability, and the need for interpretable and robust solutions for clinical deployment. Finally, future directions are outlined for advancing point cloud-based shape learning in medical imaging.

[574] Nexus-INR: Diverse Knowledge-guided Arbitrary-Scale Multimodal Medical Image Super-Resolution

Bo Zhang, JianFei Huo, Zheng Zhang, Wufan Wang, Hui Gao, Xiangyang Gong, Wendong Wang

Main category: eess.IV

TL;DR: Nexus-INR is a novel framework for adaptive-resolution medical image super-resolution, combining dual-branch encoding, cross-modal knowledge distillation, and integrated segmentation to outperform existing methods.

DetailsMotivation: Traditional CNN-based methods lack flexibility for arbitrary-resolution super-resolution (ARSR), and INR-based methods struggle with multi-modal images. Nexus-INR addresses these gaps.

Method: Nexus-INR uses a dual-branch encoder, cross-modal knowledge distillation, and an integrated segmentation module to enhance reconstruction and segmentation.

Result: Experiments on BraTS2020 show Nexus-INR outperforms state-of-the-art methods in super-resolution and segmentation tasks.

Conclusion: Nexus-INR effectively combines diverse knowledge and tasks to achieve high-quality ARSR, improving both reconstruction and downstream performance.

Abstract: Arbitrary-resolution super-resolution (ARSR) provides crucial flexibility for medical image analysis by adapting to diverse spatial resolutions. However, traditional CNN-based methods are inherently ill-suited for ARSR, as they are typically designed for fixed upsampling factors. While INR-based methods overcome this limitation, they still struggle to effectively process and leverage multi-modal images with varying resolutions and details. In this paper, we propose Nexus-INR, a Diverse Knowledge-guided ARSR framework, which employs varied information and downstream tasks to achieve high-quality, adaptive-resolution medical image super-resolution. Specifically, Nexus-INR contains three key components. A dual-branch encoder with an auxiliary classification task to effectively disentangle shared anatomical structures and modality-specific features; a knowledge distillation module using cross-modal attention that guides low-resolution modality reconstruction with high-resolution reference, enhanced by self-supervised consistency loss; an integrated segmentation module that embeds anatomical semantics to improve both reconstruction quality and downstream segmentation performance. Experiments on the BraTS2020 dataset for both super-resolution and downstream segmentation demonstrate that Nexus-INR outperforms state-of-the-art methods across various metrics.

[575] GL-LCM: Global-Local Latent Consistency Models for Fast High-Resolution Bone Suppression in Chest X-Ray Images

Yifei Sun, Zhanghao Chen, Hao Zheng, Yuqing Lu, Lixin Duan, Fenglei Fan, Ahmed Elazab, Xiang Wan, Changmiao Wang, Ruiquan Ge

Main category: eess.IV

TL;DR: The paper introduces GL-LCM, a deep learning model for fast and high-resolution bone suppression in CXR images, addressing limitations of existing diffusion-based methods.

DetailsMotivation: Bone structures in CXR images obscure critical details, hindering diagnosis. Existing methods struggle with balancing bone suppression and detail preservation, along with high computational costs.

Method: Proposes GL-LCM, combining lung segmentation, dual-path sampling, and global-local fusion, with Local-Enhanced Guidance to address boundary artifacts.

Result: GL-LCM outperforms existing methods in bone suppression and computational efficiency on datasets SZCH-X-Rays and JSRT.

Conclusion: GL-LCM offers a practical solution for improving CXR image clarity and diagnostic accuracy with superior performance and efficiency.

Abstract: Chest X-Ray (CXR) imaging for pulmonary diagnosis raises significant challenges, primarily because bone structures can obscure critical details necessary for accurate diagnosis. Recent advances in deep learning, particularly with diffusion models, offer significant promise for effectively minimizing the visibility of bone structures in CXR images, thereby improving clarity and diagnostic accuracy. Nevertheless, existing diffusion-based methods for bone suppression in CXR imaging struggle to balance the complete suppression of bones with preserving local texture details. Additionally, their high computational demand and extended processing time hinder their practical use in clinical settings. To address these limitations, we introduce a Global-Local Latent Consistency Model (GL-LCM) architecture. This model combines lung segmentation, dual-path sampling, and global-local fusion, enabling fast high-resolution bone suppression in CXR images. To tackle potential boundary artifacts and detail blurring in local-path sampling, we further propose Local-Enhanced Guidance, which addresses these issues without additional training. Comprehensive experiments on a self-collected dataset SZCH-X-Rays, and the public dataset JSRT, reveal that our GL-LCM delivers superior bone suppression and remarkable computational efficiency, significantly outperforming several competitive methods. Our code is available at https://github.com/diaoquesang/GL-LCM.

[576] Evaluating the Predictive Value of Preoperative MRI for Erectile Dysfunction Following Radical Prostatectomy

Gideon N. L. Rouwendaal, Daniël Boeke, Inge L. Cox, Henk G. van der Poel, Margriet C. van Dijk-de Haan, Regina G. H. Beets-Tan, Thierry N. Boellaard, Wilson Silva

Main category: eess.IV

TL;DR: MRI-based models for predicting post-prostatectomy ED showed slight improvements over anatomical features but did not surpass clinical-only models. Fusion models offered minimal gains, with clinical features remaining the strongest predictors.

DetailsMotivation: To evaluate whether preoperative MRI adds predictive value for ED post-radical prostatectomy compared to clinical features alone.

Method: Four modeling strategies were tested: clinical-only baseline, classical models with handcrafted MRI features, deep learning on MRI slices, and multimodal fusion of imaging and clinical data.

Result: MRI models (AUC 0.569) slightly outperformed anatomical approaches (AUC 0.554) but were inferior to clinical baseline (AUC 0.663). Fusion models showed marginal improvement (AUC 0.586). Clinical features were the top contributors.

Conclusion: MRI did not outperform clinical predictors but may complement them in future multimodal approaches, as imaging models focused on relevant anatomical regions.

Abstract: Accurate preoperative prediction of erectile dysfunction (ED) is important for counseling patients undergoing radical prostatectomy. While clinical features are established predictors, the added value of preoperative MRI remains underexplored. We investigate whether MRI provides additional predictive value for ED at 12 months post-surgery, evaluating four modeling strategies: (1) a clinical-only baseline, representing current state-of-the-art; (2) classical models using handcrafted anatomical features derived from MRI; (3) deep learning models trained directly on MRI slices; and (4) multimodal fusion of imaging and clinical inputs. Imaging-based models (maximum AUC 0.569) slightly outperformed handcrafted anatomical approaches (AUC 0.554) but fell short of the clinical baseline (AUC 0.663). Fusion models offered marginal gains (AUC 0.586) but did not exceed clinical-only performance. SHAP analysis confirmed that clinical features contributed most to predictive performance. Saliency maps from the best-performing imaging model suggested a predominant focus on anatomically plausible regions, such as the prostate and neurovascular bundles. While MRI-based models did not improve predictive performance over clinical features, our findings suggest that they try to capture patterns in relevant anatomical structures and may complement clinical predictors in future multimodal approaches.

[577] CADD: Context aware disease deviations via restoration of brain images using normative conditional diffusion models

Ana Lawry Aguila, Ayodeji Ijishakin, Juan Eugenio Iglesias, Tomomi Takenaga, Yukihiro Nomura, Takeharu Yoshikawa, Osamu Abe, Shouhei Hanaoka

Main category: eess.IV

TL;DR: CADD is a conditional diffusion model for normative modeling in 3D brain images, improving anomaly detection by incorporating clinical context and a novel inpainting strategy.

DetailsMotivation: Detecting pathology in heterogeneous medical cohorts is challenging, and existing methods lack clinical context or perform poorly in restoring healthy regions.

Method: CADD uses a conditional diffusion model with an inference inpainting strategy to balance anomaly removal and feature retention.

Result: CADD achieves state-of-the-art performance in detecting neurological abnormalities across diverse clinical datasets.

Conclusion: CADD advances normative modeling by integrating clinical guidance and robust restoration, enhancing disease detection in real-world medical data.

Abstract: Applying machine learning to real-world medical data, e.g. from hospital archives, has the potential to revolutionize disease detection in brain images. However, detecting pathology in such heterogeneous cohorts is a difficult challenge. Normative modeling, a form of unsupervised anomaly detection, offers a promising approach to studying such cohorts where the ``normal’’ behavior is modeled and can be used at subject level to detect deviations relating to disease pathology. Diffusion models have emerged as powerful tools for anomaly detection due to their ability to capture complex data distributions and generate high-quality images. Their performance relies on image restoration; differences between the original and restored images highlight potential abnormalities. However, unlike normative models, these diffusion model approaches do not incorporate clinical information which provides important context to guide the disease detection process. Furthermore, standard approaches often poorly restore healthy regions, resulting in poor reconstructions and suboptimal detection performance. We present CADD, the first conditional diffusion model for normative modeling in 3D images. To guide the healthy restoration process, we propose a novel inference inpainting strategy which balances anomaly removal with retention of subject-specific features. Evaluated on three challenging datasets, including clinical scans, which may have lower contrast, thicker slices, and motion artifacts, CADD achieves state-of-the-art performance in detecting neurological abnormalities in heterogeneous cohorts.

[578] FCDM: A Physics-Guided Bidirectional Frequency Aware Convolution and Diffusion-Based Model for Sinogram Inpainting

Jiaze E, Srutarshi Banerjee, Tekin Bicer, Guannan Wang, Yanfu Zhang, Bin Ren

Main category: eess.IV

TL;DR: A diffusion-based framework for sparse-view CT sinogram restoration, addressing structured signal loss with physics-guided constraints and outperforming baselines.

DetailsMotivation: Sparse-view CT reduces radiation and scan time but causes incomplete sinograms with structured signal loss, which standard inpainting models fail to address due to neglecting angular dependencies and physical consistency.

Method: Proposes a diffusion-based framework with bidirectional frequency reasoning, angular-aware masking, physics-guided constraints, and frequency-adaptive noise control.

Result: Achieves SSIM over 0.93 and PSNR above 31 dB, outperforming baselines in synthetic and real-world datasets.

Conclusion: The proposed method effectively restores sinograms, addressing the limitations of standard inpainting models and improving reconstruction accuracy.

Abstract: Computed tomography (CT) is widely used in scientific and medical imaging, but acquiring full-view sinograms requires high radiation dose and long scan times. Sparse-view CT alleviates this burden but yields incomplete sinograms with structured signal loss, hampering accurate reconstruction. Unlike RGB images, sinograms encode overlapping features along projection paths and exhibit directional spectral patterns. Standard inpainting models overlook these properties, treating missing data as local holes and neglecting angular dependencies and physical consistency. We propose~\modelname, a diffusion-based framework tailored for sinograms, which restores global structure through bidirectional frequency reasoning and angular-aware masking, while enforcing physical plausibility via physics-guided constraints and frequency-adaptive noise control. Experiments on synthetic and real-world datasets show that~\modelname~consistently outperforms baselines, achieving SSIM over 0.93 and PSNR above 31 dB across diverse sparse-view scenarios.

[579] Physical Degradation Model-Guided Interferometric Hyperspectral Reconstruction with Unfolding Transformer

Yuansheng Li, Yunhao Zou, Linwei Chen, Ying Fu

Main category: eess.IV

TL;DR: A novel IHI reconstruction pipeline addresses training data scarcity and degradation challenges using a physics-based degradation model and the IHRUT transformer, outperforming existing methods.

DetailsMotivation: IHI's potential is limited by complex errors and lack of training data, hindering performance enhancement.

Method: Proposes a simplified IHI degradation model for dataset synthesis and the IHRUT transformer for spectral correction and detail restoration.

Result: Demonstrates superior performance and generalization, validated experimentally.

Conclusion: The method effectively bridges IHI reconstruction with deep learning, offering a practical solution for remote sensing.

Abstract: Interferometric Hyperspectral Imaging (IHI) is a critical technique for large-scale remote sensing tasks due to its advantages in flux and spectral resolution. However, IHI is susceptible to complex errors arising from imaging steps, and its quality is limited by existing signal processing-based reconstruction algorithms. Two key challenges hinder performance enhancement:

  1. the lack of training datasets. 2) the difficulty in eliminating IHI-specific degradation components through learning-based methods. To address these challenges, we propose a novel IHI reconstruction pipeline. First, based on imaging physics and radiometric calibration data, we establish a simplified yet accurate IHI degradation model and a parameter estimation method. This model enables the synthesis of realistic IHI training datasets from hyperspectral images (HSIs), bridging the gap between IHI reconstruction and deep learning. Second, we design the Interferometric Hyperspectral Reconstruction Unfolding Transformer (IHRUT), which achieves effective spectral correction and detail restoration through a stripe-pattern enhancement mechanism and a spatial-spectral transformer architecture. Experimental results demonstrate the superior performance and generalization capability of our method.The code and are available at https://github.com/bit1120203554/IHRUT.

[580] Topology Optimization in Medical Image Segmentation with Fast Euler Characteristic

Liu Li, Qiang Ma, Cheng Ouyang, Johannes C. Paetzold, Daniel Rueckert, Bernhard Kainz

Main category: eess.IV

TL;DR: A novel topology-aware segmentation method using Euler Characteristic (χ) improves topological correctness in medical images while maintaining pixel-wise accuracy.

DetailsMotivation: Conventional metrics like Dice score fail to ensure clinically acceptable topological accuracy, such as continuous boundaries or closed surfaces, which is critical in medical imaging.

Method: Proposes a fast χ computation for 2D/3D data, uses χ error as a metric, identifies topological violations via a map, and refines segmentation with a correction network.

Result: Experiments on 2D/3D datasets show significant improvement in topological correctness without compromising pixel-wise accuracy.

Conclusion: The method effectively addresses topological constraints in segmentation, offering a practical solution for clinical applications.

Abstract: Deep learning-based medical image segmentation techniques have shown promising results when evaluated based on conventional metrics such as the Dice score or Intersection-over-Union. However, these fully automatic methods often fail to meet clinically acceptable accuracy, especially when topological constraints should be observed, e.g., continuous boundaries or closed surfaces. In medical image segmentation, the correctness of a segmentation in terms of the required topological genus sometimes is even more important than the pixel-wise accuracy. Existing topology-aware approaches commonly estimate and constrain the topological structure via the concept of persistent homology (PH). However, these methods are difficult to implement for high dimensional data due to their polynomial computational complexity. To overcome this problem, we propose a novel and fast approach for topology-aware segmentation based on the Euler Characteristic ($\chi$). First, we propose a fast formulation for $\chi$ computation in both 2D and 3D. The scalar $\chi$ error between the prediction and ground-truth serves as the topological evaluation metric. Then we estimate the spatial topology correctness of any segmentation network via a so-called topological violation map, i.e., a detailed map that highlights regions with $\chi$ errors. Finally, the segmentation results from the arbitrary network are refined based on the topological violation maps by a topology-aware correction network. Our experiments are conducted on both 2D and 3D datasets and show that our method can significantly improve topological correctness while preserving pixel-wise segmentation accuracy.

[581] Predicting EGFR Mutation in LUAD from Histopathological Whole-Slide Images Using Pretrained Foundation Model and Transfer Learning: An Indian Cohort Study

Sagar Singh Gwal, Rajan, Suyash Devgan, Shraddhanjali Satapathy, Abhishek Goyal, Nuruddin Mohammad Iqbal, Vivaan Jain, Prabhat Singh Mallik, Deepali Jain, Ishaan Gupta

Main category: eess.IV

TL;DR: A deep learning framework using vision transformers and attention-based multiple instance learning predicts EGFR mutation status in lung adenocarcinoma from H&E-stained slides with high accuracy.

DetailsMotivation: Predicting EGFR mutation status is crucial for clinical decision-making in lung adenocarcinoma, especially in Southeast Asian populations with higher mutation incidence.

Method: A DL framework combining vision transformers (ViT) and attention-based multiple instance learning (ABMIL) was trained on an Indian cohort (170 WSI) and tested on internal (30 WSI) and external (TCGA, 86 WSI) datasets.

Result: The model achieved AUCs of 0.933 (internal) and 0.965 (external), outperforming prior studies and demonstrating feasibility in resource-limited settings.

Conclusion: The study shows that routine pathology slides can accurately predict EGFR mutations using foundation models and ABMIL, aiding clinical decisions.

Abstract: Lung adenocarcinoma (LUAD) is a subtype of non-small cell lung cancer (NSCLC). LUAD with mutation in the EGFR gene accounts for approximately 46% of LUAD cases. Patients carrying EGFR mutations can be treated with specific tyrosine kinase inhibitors (TKIs). Hence, predicting EGFR mutation status can help in clinical decision making. H&E-stained whole slide imaging (WSI) is a routinely performed screening procedure for cancer staging and subtyping, especially affecting the Southeast Asian populations with significantly higher incidence of the mutation when compared to Caucasians (39-64% vs 7-22%). Recent progress in AI models has shown promising results in cancer detection and classification. In this study, we propose a deep learning (DL) framework built on vision transformers (ViT) based pathology foundation model and attention-based multiple instance learning (ABMIL) architecture to predict EGFR mutation status from H&E WSI. The developed pipeline was trained using data from an Indian cohort (170 WSI) and evaluated across two independent datasets: Internal test (30 WSI from Indian cohort) set, and an external test set from TCGA (86 WSI). The model shows consistent performance across both datasets, with AUCs of 0.933 (+/-0.010), and 0.965 (+/-0.015) for the internal and external test sets respectively. This proposed framework can be efficiently trained on small datasets, achieving superior performance as compared to several prior studies irrespective of training domain. The current study demonstrates the feasibility of accurately predicting EGFR mutation status using routine pathology slides, particularly in resource-limited settings using foundation models and attention-based multiple instance learning.

[582] Identifying actionable driver mutations in lung cancer using an efficient Asymmetric Transformer Decoder

Biagio Brattoli, Jack Shi, Jongchan Park, Taebum Lee, Donggeun Yoo, Sergio Pereira

Main category: eess.IV

TL;DR: The paper evaluates MIL techniques to detect six NSCLC driver mutations and introduces an Asymmetric Transformer Decoder for improved performance, outperforming existing models by 3-4%.

DetailsMotivation: Limited genetic testing availability and lengthy turnaround times hinder NSCLC treatment decisions, prompting the need for ML-based CPath solutions.

Method: Uses Multiple Instance Learning (MIL) and introduces an Asymmetric Transformer Decoder with varying query and key-value dimensions to minimize overfitting. Incorporates tissue type directly into the model.

Result: Outperforms top MIL models by 3% on average and over 4% for rare mutations like ERBB2 and BRAF.

Conclusion: The proposed ML method advances practical alternatives to standard genetic testing for NSCLC.

Abstract: Identifying actionable driver mutations in non-small cell lung cancer (NSCLC) can impact treatment decisions and significantly improve patient outcomes. Despite guideline recommendations, broader adoption of genetic testing remains challenging due to limited availability and lengthy turnaround times. Machine Learning (ML) methods for Computational Pathology (CPath) offer a potential solution; however, research often focuses on only one or two common mutations, limiting the clinical value of these tools and the pool of patients who can benefit from them. This study evaluates various Multiple Instance Learning (MIL) techniques to detect six key actionable NSCLC driver mutations: ALK, BRAF, EGFR, ERBB2, KRAS, and MET ex14. Additionally, we introduce an Asymmetric Transformer Decoder model that employs queries and key-values of varying dimensions to maintain a low query dimensionality. This approach efficiently extracts information from patch embeddings and minimizes overfitting risks, proving highly adaptable to the MIL setting. Moreover, we present a method to directly utilize tissue type in the model, addressing a typical MIL limitation where either all regions or only some specific regions are analyzed, neglecting biological relevance. Our method outperforms top MIL models by an average of 3%, and over 4% when predicting rare mutations such as ERBB2 and BRAF, moving ML-based tests closer to being practical alternatives to standard genetic testing.

Last updated: 2025-08-22
Built with Hugo, theme modified on Stack