Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 65]
cs.CV [Total: 151]
cs.AI [Total: 36]
cs.SD [Total: 8]
cs.LG [Total: 99]
cs.MA [Total: 4]
cs.MM [Total: 3]
eess.AS [Total: 13]
eess.IV [Total: 15]

cs.CL

[1] A Unifying Scheme for Extractive Content Selection Tasks

Shmuel Amar, Ori Shapira, Aviv Slobodkin, Ido Dagan

Main category: cs.CL

TL;DR: This paper introduces a unified framework called instruction-guided content selection (IGCS) for NLP tasks involving text span selection, along with a benchmark and synthetic dataset that improve performance across diverse content selection tasks through transfer learning.

Details

Motivation: Content selection tasks in NLP (selecting relevant text spans from source texts) have been studied in isolation with separate modeling approaches, datasets, and evaluation metrics, despite sharing the same core objective. There is a need for a unified framework to handle these diverse but related tasks.

Method: The authors propose instruction-guided content selection (IGCS) as a unified framework where task definitions and instance-specific requests are provided as instructions to language models. They create IGCSBench (a unified benchmark) and develop a large synthetic dataset for training. They also address LLM inference issues and evaluation metrics for content selection tasks.

Result: Transfer learning with the synthetic datasets often boosts performance across various content selection tasks, regardless of whether dedicated training data is available. The unified framework and benchmark successfully demonstrate utility across diverse content selection scenarios.

Conclusion: The IGCS framework, IGCSBench benchmark, and synthetic datasets provide valuable resources for future content selection models. The unified approach shows promise for improving performance across diverse content selection tasks through transfer learning and standardized evaluation.

Abstract: A broad range of NLP tasks involve selecting relevant text spans from given source texts. Despite this shared objective, such \textit{content selection} tasks have traditionally been studied in isolation, each with its own modeling approaches, datasets, and evaluation metrics. In this work, we propose \textit{instruction-guided content selection (IGCS)} as a beneficial unified framework for such settings, where the task definition and any instance-specific request are encapsulated as instructions to a language model. To promote this framework, we introduce \igcsbench{}, the first unified benchmark covering diverse content selection tasks. Further, we create a large generic synthetic dataset that can be leveraged for diverse content selection tasks, and show that transfer learning with these datasets often boosts performance, whether dedicated training for the targeted task is available or not. Finally, we address generic inference time issues that arise in LLM-based modeling of content selection, assess a generic evaluation metric, and overall propose the utility of our resources and methods for future content selection models. Models and datasets available at https://github.com/shmuelamar/igcs.

[2] AI-based Clinical Decision Support for Primary Care: A Real-World Study

Robert Korom, Sarah Kiptinness, Najib Adan, Kassim Said, Catherine Ithuli, Oliver Rotich, Boniface Kimani, Irene King’ori, Stellah Kamau, Elizabeth Atemba, Muna Aden, Preston Bowman, Michael Sharman, Rebecca Soskin Hicks, Rebecca Distler, Johannes Heidecke, Rahul K. Arora, Karan Singhal

Main category: cs.CL

TL;DR: A study evaluating AI Consult, an LLM-based clinical decision support tool, showed 16% fewer diagnostic errors and 13% fewer treatment errors across 39,849 patient visits in Kenyan primary care clinics, with all participating clinicians reporting improved care quality.

Details

Motivation: To evaluate the real-world impact of large language model-based clinical decision support systems in reducing medical errors and improving patient care quality in live clinical settings, particularly in resource-constrained environments like primary care clinics in Kenya.

Method: Conducted a quality improvement study across 15 primary care clinics in Nairobi, Kenya, comparing 39,849 patient visits with and without AI Consult access. Independent physicians rated visits to identify clinical errors, and clinicians were surveyed about their experience with the tool. The AI system was designed to integrate into existing workflows while preserving clinician autonomy.

Result: Clinicians using AI Consult made 16% fewer diagnostic errors and 13% fewer treatment errors compared to those without access. The tool would potentially avert 22,000 diagnostic errors and 29,000 treatment errors annually at Penda Health alone. All surveyed clinicians reported improved care quality, with 75% describing the improvement as “substantial”.

Conclusion: LLM-based clinical decision support tools can significantly reduce medical errors in real-world clinical settings when properly integrated into clinician workflows. The study provides a practical framework for responsible adoption of AI in healthcare, demonstrating that such tools can enhance rather than replace clinical decision-making while maintaining clinician autonomy.

Abstract: We evaluate the impact of large language model-based clinical decision support in live care. In partnership with Penda Health, a network of primary care clinics in Nairobi, Kenya, we studied AI Consult, a tool that serves as a safety net for clinicians by identifying potential documentation and clinical decision-making errors. AI Consult integrates into clinician workflows, activating only when needed and preserving clinician autonomy. We conducted a quality improvement study, comparing outcomes for 39,849 patient visits performed by clinicians with or without access to AI Consult across 15 clinics. Visits were rated by independent physicians to identify clinical errors. Clinicians with access to AI Consult made relatively fewer errors: 16% fewer diagnostic errors and 13% fewer treatment errors. In absolute terms, the introduction of AI Consult would avert diagnostic errors in 22,000 visits and treatment errors in 29,000 visits annually at Penda alone. In a survey of clinicians with AI Consult, all clinicians said that AI Consult improved the quality of care they delivered, with 75% saying the effect was “substantial”. These results required a clinical workflow-aligned AI Consult implementation and active deployment to encourage clinician uptake. We hope this study demonstrates the potential for LLM-based clinical decision support tools to reduce errors in real-world settings and provides a practical framework for advancing responsible adoption.

[3] Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs

Shuyuan Lin, Lei Duan, Philip Hughes, Yuxuan Sheng

Main category: cs.CL

TL;DR: This paper introduces SALU (Self-Aware LLM for Unanswerability), a novel approach that integrates unanswerability detection directly into LLMs to prevent hallucinated responses in conversational information retrieval systems by training the model to know when to abstain from answering.

Details

Motivation: Conversational Information Retrieval systems face a critical challenge in reliably handling unanswerable questions to prevent generating misleading or hallucinated content. Traditional approaches using external classifiers introduce inconsistencies with core generative LLMs, creating a need for better integrated solutions.

Method: SALU uses a multi-task learning framework that trains LLMs for both standard Question Answering and explicit abstention generation for unanswerable queries. The key innovation is a confidence-score-guided reinforcement learning with human feedback (RLHF) phase that explicitly penalizes hallucinated responses and rewards appropriate abstentions, fostering intrinsic self-awareness of knowledge boundaries.

Result: SALU consistently outperforms strong baselines, including hybrid LLM-classifier systems, in overall accuracy for correctly answering or abstaining from questions on the custom C-IR_Answerability dataset. Human evaluation confirms superior reliability with high scores in factuality, appropriate abstention, and a dramatic reduction in hallucination.

Conclusion: SALU successfully demonstrates the ability to robustly “know when to say ‘I don’t know’” by deeply integrating unanswerability detection within the LLM’s generative process, offering a more reliable solution for conversational information retrieval systems compared to traditional external classifier approaches.

Abstract: Conversational Information Retrieval (CIR) systems, while offering intuitive access to information, face a significant challenge: reliably handling unanswerable questions to prevent the generation of misleading or hallucinated content. Traditional approaches often rely on external classifiers, which can introduce inconsistencies with the core generative Large Language Models (LLMs). This paper introduces Self-Aware LLM for Unanswerability (SALU), a novel approach that deeply integrates unanswerability detection directly within the LLM’s generative process. SALU is trained using a multi-task learning framework for both standard Question Answering (QA) and explicit abstention generation for unanswerable queries. Crucially, it incorporates a confidence-score-guided reinforcement learning with human feedback (RLHF) phase, which explicitly penalizes hallucinated responses and rewards appropriate abstentions, fostering intrinsic self-awareness of knowledge boundaries. Through extensive experiments on our custom-built C-IR_Answerability dataset, SALU consistently outperforms strong baselines, including hybrid LLM-classifier systems, in overall accuracy for correctly answering or abstaining from questions. Human evaluation further confirms SALU’s superior reliability, achieving high scores in factuality, appropriate abstention, and, most importantly, a dramatic reduction in hallucination, demonstrating its ability to robustly “know when to say ‘I don’t know’.”

[4] Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning

Aleksandr Perevalov, Andreas Both

Main category: cs.CL

TL;DR: mKGQAgent is a human-inspired framework that converts natural language questions into SPARQL queries using coordinated LLM agents for multilingual knowledge graph question answering, achieving first place in Text2SPARQL challenge 2025.

Details

Motivation: Accessing knowledge via multilingual natural-language interfaces is challenging in information retrieval. Prior approaches combined components to solve downstream tasks but lacked modularity and interpretability in converting natural language to SPARQL queries for knowledge graphs.

Method: The paper introduces mKGQAgent, a modular framework that breaks down natural language to SPARQL conversion into interpretable subtasks. It uses coordinated LLM agent workflow for planning, entity linking, and query refinement, guided by an experience pool for in-context learning to handle multilingual KGQA efficiently.

Result: mKGQAgent achieved first place among participants in the Text2SPARQL challenge 2025, evaluated on DBpedia-based and Corporate-based KGQA benchmarks, demonstrating superior performance in multilingual knowledge graph question answering.

Conclusion: The work successfully demonstrates that human-inspired modular approaches with coordinated LLM agents can effectively handle multilingual semantic parsing, opening new avenues for developing human-like reasoning systems in multilingual KGQA tasks.

Abstract: Accessing knowledge via multilingual natural-language interfaces is one of the emerging challenges in the field of information retrieval and related ones. Structured knowledge stored in knowledge graphs can be queried via a specific query language (e.g., SPARQL). Therefore, one needs to transform natural-language input into a query to fulfill an information need. Prior approaches mostly focused on combining components (e.g., rule-based or neural-based) that solve downstream tasks and come up with an answer at the end. We introduce mKGQAgent, a human-inspired framework that breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks. By leveraging a coordinated LLM agent workflow for planning, entity linking, and query refinement - guided by an experience pool for in-context learning - mKGQAgent efficiently handles multilingual KGQA. Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants. This work opens new avenues for developing human-like reasoning systems in multilingual semantic parsing.

[5] Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain

Rishemjit Kaur, Arshdeep Singh Bhankhar, Surangika Ranathunga, Jashanpreet Singh Salh, Sudhir Rajput, Vidhi, Kashish Mahendra, Bhavika Berwal, Ritesh Kumar

Main category: cs.CL

TL;DR: This paper develops multilingual agricultural QA systems by fine-tuning language-specific LLMs using synthetic datasets generated from agriculture documents in English, Hindi, and Punjabi, achieving better accuracy and relevance compared to general-purpose LLMs.

Details

Motivation: General-purpose LLMs provide generic agricultural advice lacking precision in local and multilingual contexts due to insufficient domain-specific training and scarcity of high-quality, region-specific datasets. Farmers need accurate agriculture-related information in their native languages for agricultural success.

Method: The authors generate multilingual synthetic agricultural datasets from agriculture-specific documents in English, Hindi, and Punjabi, then fine-tune language-specific LLMs on these datasets to create specialized agricultural QA systems.

Result: Evaluation on curated multilingual datasets shows significant improvements in factual accuracy, relevance, and agricultural consensus for the fine-tuned models compared to baseline general-purpose LLMs, particularly in multilingual and low-resource settings.

Conclusion: Synthetic data-driven, language-specific fine-tuning is an effective strategy for improving LLM performance in agriculture. This approach enables more accurate and localized agricultural advisory services, helping bridge the knowledge gap in AI-driven agricultural solutions for diverse linguistic communities.

Abstract: Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Although large language models (LLMs) can be used to implement Question Answering (QA) systems, simply using publicly available general-purpose LLMs in agriculture typically offer generic advisories, lacking precision in local and multilingual contexts due to insufficient domain-specific training and scarcity of high-quality, region-specific datasets. Our study addresses these limitations by generating multilingual synthetic agricultural datasets (English, Hindi, Punjabi) from agriculture-specific documents and fine-tuning language-specific LLMs. Our evaluation on curated multilingual datasets demonstrates significant improvements in factual accuracy, relevance, and agricultural consensus for the fine-tuned models compared to their baseline counterparts. These results highlight the efficacy of synthetic data-driven, language-specific fine-tuning as an effective strategy to improve the performance of LLMs in agriculture, especially in multilingual and low-resource settings. By enabling more accurate and localized agricultural advisory services, this study provides a meaningful step toward bridging the knowledge gap in AI-driven agricultural solutions for diverse linguistic communities.

[6] Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks

Giulio Pelosio, Devesh Batra, Noémie Bovey, Robert Hankache, Cristovao Iglesias, Greig Cowan, Raad Khraishi

Main category: cs.CL

TL;DR: This paper investigates nationality bias in Large Language Models by replacing explicit nationality labels with culturally indicative names, finding that smaller models exhibit more bias and lower accuracy than larger models, with biases persisting even when demographic markers are implicit.

Details

Motivation: Large Language Models can exhibit latent biases towards specific nationalities even without explicit demographic markers. The authors wanted to create a more realistic testing scenario by using culturally indicative names instead of explicit nationality labels, which better reflects real-world LLM applications.

Method: The researchers developed a novel name-based benchmarking approach derived from the Bias Benchmark for QA (BBQ) dataset, substituting explicit nationality labels with culturally indicative names. They tested this approach across multiple LLMs from OpenAI, Google, and Anthropic, measuring both bias magnitude and accuracy in ambiguous contexts.

Result: Small models showed significantly more bias and lower accuracy compared to larger models. For example, Claude Haiku exhibited 9% stereotypical bias versus 3.5% for Claude Sonnet, with Sonnet outperforming by 117.7% in accuracy. Small models also retained more errors in ambiguous contexts (GPT-4o-mini retained 76% of error rate vs 68% for GPT-4o).

Conclusion: The research demonstrates the stubborn resilience of biases in LLMs, showing that nationality biases persist even when demographic information is presented implicitly through names rather than explicit labels. This has profound implications for AI system development and deployment in diverse, global contexts.

Abstract: Large Language Models (LLMs) can exhibit latent biases towards specific nationalities even when explicit demographic markers are not present. In this work, we introduce a novel name-based benchmarking approach derived from the Bias Benchmark for QA (BBQ) dataset to investigate the impact of substituting explicit nationality labels with culturally indicative names, a scenario more reflective of real-world LLM applications. Our novel approach examines how this substitution affects both bias magnitude and accuracy across a spectrum of LLMs from industry leaders such as OpenAI, Google, and Anthropic. Our experiments show that small models are less accurate and exhibit more bias compared to their larger counterparts. For instance, on our name-based dataset and in the ambiguous context (where the correct choice is not revealed), Claude Haiku exhibited the worst stereotypical bias scores of 9%, compared to only 3.5% for its larger counterpart, Claude Sonnet, where the latter also outperformed it by 117.7% in accuracy. Additionally, we find that small models retain a larger portion of existing errors in these ambiguous contexts. For example, after substituting names for explicit nationality references, GPT-4o retains 68% of the error rate versus 76% for GPT-4o-mini, with similar findings for other model providers, in the ambiguous context. Our research highlights the stubborn resilience of biases in LLMs, underscoring their profound implications for the development and deployment of AI systems in diverse, global contexts.

[7] Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors

Ming Huang, Zehan Li, Yan Hu, Wanjing Wang, Andrew Wen, Scott Lane, Salih Selek, Lokesh Shahani, Rodrigo Machado-Vieira, Jair Soares, Hua Xu, Hongfang Liu

Main category: cs.CL

TL;DR: This study uses generative large language models (GPT-3.5 and GPT-4.5) for multi-label classification of suicide-related factors from psychiatric electronic health records, achieving strong performance with finetuned GPT-3.5 (0.94 accuracy, 0.91 F1) and demonstrating the feasibility of AI-powered clinical classification for suicide prevention.

Details

Motivation: Suicide is a major global health crisis affecting over 720,000 deaths annually. Early identification of suicide-related factors (including suicide ideation, attempts, exposure, and self-injury) is critical for intervention, but existing AI approaches treat suicidality as binary classification, missing the complexity of co-occurring risk factors in clinical settings.

Method: The researchers developed a novel end-to-end generative multi-label classification pipeline using GPT-3.5 and GPT-4.5 models on psychiatric electronic health records. They introduced advanced evaluation methods including label set level metrics and multilabel confusion matrices for comprehensive error analysis of suicide-related factor detection.

Result: Finetuned GPT-3.5 achieved top performance with 0.94 partial match accuracy and 0.91 F1 score. GPT-4.5 with guided prompting showed superior performance across label sets, particularly for rare/minority labels, indicating more balanced and robust classification. The analysis revealed systematic error patterns like conflation of suicide ideation and attempts.

Conclusion: The study demonstrates the feasibility of using generative AI for complex clinical classification tasks and provides a blueprint for structuring unstructured EHR data to support large-scale clinical research and evidence-based medicine in suicide prevention and mental health care.

Abstract: Suicide remains a pressing global health crisis, with over 720,000 deaths annually and millions more affected by suicide ideation (SI) and suicide attempts (SA). Early identification of suicidality-related factors (SrFs), including SI, SA, exposure to suicide (ES), and non-suicidal self-injury (NSSI), is critical for timely intervention. While prior studies have applied AI to detect SrFs in clinical notes, most treat suicidality as a binary classification task, overlooking the complexity of cooccurring risk factors. This study explores the use of generative large language models (LLMs), specifically GPT-3.5 and GPT-4.5, for multi-label classification (MLC) of SrFs from psychiatric electronic health records (EHRs). We present a novel end to end generative MLC pipeline and introduce advanced evaluation methods, including label set level metrics and a multilabel confusion matrix for error analysis. Finetuned GPT-3.5 achieved top performance with 0.94 partial match accuracy and 0.91 F1 score, while GPT-4.5 with guided prompting showed superior performance across label sets, including rare or minority label sets, indicating a more balanced and robust performance. Our findings reveal systematic error patterns, such as the conflation of SI and SA, and highlight the models tendency toward cautious over labeling. This work not only demonstrates the feasibility of using generative AI for complex clinical classification tasks but also provides a blueprint for structuring unstructured EHR data to support large scale clinical research and evidence based medicine.

[8] Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

Arduin Findeis, Floris Weers, Guoli Yin, Ke Ye, Ruoming Pang, Tom Gunter

Main category: cs.CL

TL;DR: This paper proposes a tool-using agentic system that augments AI annotators with web search and code execution capabilities to improve pairwise preference evaluation of LLM responses, particularly for long-form factual, math, and code tasks.

Details

Motivation: Standard pairwise preference evaluation by human or AI annotators can be problematic for certain domains - annotators may focus on writing quality rather than factual accuracy in responses with many factual statements, and high-quality comparisons are difficult to obtain for complex tasks like math and code.

Method: The authors develop a tool-using agentic system that enhances AI annotators with external validation tools including web search and code execution. This system grounds evaluations in external sources rather than relying solely on the LLM’s internal knowledge and biases.

Result: Extensive experiments across RewardBench, RewardMath, and three new datasets show that external tools improve performance in many cases for long-form factual, math, and code evaluation tasks. However, results also reveal sensitivity to simple parameters like prompts and highlight limitations of current benchmarks.

Conclusion: External tools can enhance AI annotator performance for challenging evaluation domains, but the approach has limitations and performance is sensitive to implementation details. The work emphasizes the need for improved, non-saturated annotator benchmarks.

Abstract: Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the “better” response. This approach can provide feedback for domains where other hard-coded metrics are difficult to obtain (e.g., chat response quality), thereby helping model evaluation or training. However, for some domains high-quality pairwise comparisons can be tricky to obtain - from AI and humans. For example, for responses with many factual statements, annotators may disproportionately weigh writing quality rather than underlying facts. In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging response domains: long-form factual, math and code tasks. We propose a tool-using agentic system to provide higher quality feedback on these domains. Our system uses web-search and code execution to ground itself based on external validation, independent of the LLM’s internal knowledge and biases. We provide extensive experimental results evaluating our method across the three targeted response domains as well as general annotation tasks, using RewardBench (incl. AlpacaEval and LLMBar), RewardMath, as well as three new datasets for domains with saturated pre-existing datasets. Our results indicate that external tools can indeed improve performance in many, but not all, cases. More generally, our experiments highlight the sensitivity of performance to simple parameters (e.g., prompt) and the need for improved (non-saturated) annotator benchmarks. We share our code at https://github.com/apple/ml-agent-evaluator.

[9] Evolutionary Feature-wise Thresholding for Binary Representation of NLP Embeddings

Soumen Sinha, Shahryar Rahnamayan, Azam Asilian Bidgoli

Main category: cs.CL

TL;DR: The paper proposes a Coordinate Search-based optimization framework to find optimal feature-specific thresholds for converting continuous text embeddings (like BERT) into binary representations, achieving better performance than traditional fixed-threshold binarization methods while maintaining storage and computational efficiency.

Details

Motivation: Large-scale NLP applications require efficient text embeddings with reduced storage and computational costs. Traditional binary conversion methods use fixed thresholds across all features, which is suboptimal. The authors aim to improve binary encoding performance by finding optimal thresholds for each individual feature rather than using a one-size-fits-all approach.

Method: The authors develop a Coordinate Search-based optimization framework that identifies optimal thresholds for each feature individually when converting continuous embeddings (such as BERT) into binary representations (barcodes). This replaces the common approach of using a single fixed threshold across all features with feature-specific optimized thresholds.

Result: Extensive experiments and statistical tests across different NLP tasks and datasets show that binary embeddings generated using optimal feature-specific thresholds outperform traditional binarization methods in accuracy. The method demonstrates promising results in various NLP applications and proves to be versatile for application beyond just NLP embeddings.

Conclusion: The proposed Coordinate Search-based optimization framework successfully generates more accurate binary representations of text embeddings compared to traditional thresholding methods. The technique is versatile and applicable to any features beyond NLP, making it valuable for a wide range of machine learning applications where efficient binary representations are needed.

Abstract: Efficient text embedding is crucial for large-scale natural language processing (NLP) applications, where storage and computational efficiency are key concerns. In this paper, we explore how using binary representations (barcodes) instead of real-valued features can be used for NLP embeddings derived from machine learning models such as BERT. Thresholding is a common method for converting continuous embeddings into binary representations, often using a fixed threshold across all features. We propose a Coordinate Search-based optimization framework that instead identifies the optimal threshold for each feature, demonstrating that feature-specific thresholds lead to improved performance in binary encoding. This ensures that the binary representations are both accurate and efficient, enhancing performance across various features. Our optimal barcode representations have shown promising results in various NLP applications, demonstrating their potential to transform text representation. We conducted extensive experiments and statistical tests on different NLP tasks and datasets to evaluate our approach and compare it to other thresholding methods. Binary embeddings generated using using optimal thresholds found by our method outperform traditional binarization methods in accuracy. This technique for generating binary representations is versatile and can be applied to any features, not just limited to NLP embeddings, making it useful for a wide range of domains in machine learning applications.

[10] Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng, Lu Xu, Runsheng Yu, Rong Cao, Ting Han, Zeyang Li, Sitong Liu, Shengtao Ma, Shiguang Pan, Jiongchen Xiao, Nuo Xu, Meng Yang, Rong Ye, Yiming Yu, Ruofei Zhang, Wanyi Zhang, Wenhao Zhu, Liehao Zou, Lu Lu, Yuxuan Wang, Yonghui Wu

Main category: cs.CL

TL;DR: Seed-LiveInterpret 2.0 is an end-to-end simultaneous interpretation system that achieves high-quality, ultra-low-latency speech-to-speech translation with voice cloning, reducing latency by 70% compared to existing solutions while maintaining over 70% translation accuracy.

Details

Motivation: Existing simultaneous interpretation systems face critical challenges including poor transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and speech inflation in long-form content, limiting their practical deployment in product-level applications.

Method: The paper introduces a novel duplex speech-to-speech understanding-generating framework that combines large-scale pretraining with reinforcement learning to create an end-to-end SI model with voice cloning capabilities.

Result: The system achieves over 70% translation correctness in complex scenarios as validated by human interpreters, outperforms commercial SI solutions in translation quality, and reduces average latency from nearly 10 seconds to approximately 3 seconds (70% reduction).

Conclusion: Seed-LiveInterpret 2.0 successfully addresses the major limitations of simultaneous interpretation systems by delivering a product-level solution that significantly improves the balance between translation accuracy and latency, making real-time speech-to-speech interpretation more practically viable.

Abstract: Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.

[11] CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards

Cheng Liu, Yifei Lu, Fanghua Ye, Jian Li, Xingyu Chen, Feiliang Ren, Zhaopeng Tu, Xiaolong Li

Main category: cs.CL

TL;DR: CogDual is a novel Role-Playing Language Agent that uses a “cognize-then-respond” reasoning paradigm inspired by cognitive psychology, combining external situational awareness and internal self-awareness to improve character consistency and contextual alignment in role-playing tasks.

Details

Motivation: Existing Role-Playing Language Agents (RPLAs) rely on prompt engineering or supervised fine-tuning but neglect the underlying cognitive mechanisms that drive character behaviors, leading to limitations in character consistency and contextual understanding.

Method: CogDual adopts a “cognize-then-respond” reasoning paradigm that jointly models external situational awareness and internal self-awareness. The approach is further optimized using reinforcement learning with two general-purpose reward schemes designed for open-domain text generation.

Result: Extensive experiments on CoSER benchmark, Cross-MR, and LifeChoice datasets show that CogDual consistently outperforms existing baselines and demonstrates effective generalization across diverse role-playing tasks.

Conclusion: CogDual successfully addresses the limitations of existing RPLAs by incorporating cognitive psychology principles, resulting in improved character consistency and contextual alignment that generalizes well across different role-playing scenarios.

Abstract: Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying \emph{cognitive} mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce \textbf{CogDual}, a novel RPLA adopting a \textit{cognize-then-respond } reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.

[12] SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: The paper introduces SKA-Bench, a comprehensive benchmark for evaluating large language models’ understanding of structured knowledge (KG, Table, KG+Text, Table+Text) through four fundamental capabilities: noise robustness, order insensitivity, information integration, and negative rejection.

Details

Motivation: Existing evaluations for structured knowledge understanding in LLMs are non-rigorous and focus on single types of structured knowledge, lacking comprehensive assessment of specific capabilities across different knowledge forms.

Method: A three-stage pipeline to construct SKA-Bench instances containing questions, answers, positive knowledge units, and noisy knowledge units. The benchmark covers four structured knowledge forms (KG, Table, KG+Text, Table+Text) and evaluates four fundamental abilities through expanded testbeds.

Result: Empirical evaluation of 8 representative LLMs (including DeepSeek-R1) reveals significant challenges in structured knowledge understanding, with performance affected by noise amount, knowledge unit order, and hallucination phenomena.

Conclusion: Current LLMs still struggle with structured knowledge understanding despite recent progress, and the proposed SKA-Bench provides a more rigorous framework for diagnosing these shortcomings across multiple knowledge types and fundamental capabilities.

Abstract: Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at https://github.com/Lza12a/SKA-Bench.

[13] FinGAIA: An End-to-End Benchmark for Evaluating AI Agents in Finance

Lingfeng Zeng, Fangqi Lou, Zixuan Wang, Jiajie Xu, Jinyi Niu, Mengping Li, Yifan Dong, Qi Qi, Wei Zhang, Ziwei Yang, Jun Han, Ruilun Feng, Ruiqi Hu, Lejie Zhang, Zhengbo Feng, Yicheng Ren, Xin Guo, Zhaowei Liu, Dongpo Cheng, Weige Cai, Liwen Zhang

Main category: cs.CL

TL;DR: This paper introduces FinGAIA, the first comprehensive benchmark for evaluating AI agents in financial domains, comprising 407 tasks across 7 financial sub-domains with 3 complexity levels, revealing that even the best AI agent (ChatGPT) only achieves 48.9% accuracy, significantly underperforming financial experts.

Details

Motivation: The rapid development of AI agents shows great potential for automating complex tasks, but their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. There is a need for a comprehensive benchmark to evaluate AI agents' practical abilities specifically in financial domains.

Method: The authors created FinGAIA, an end-to-end benchmark with 407 meticulously crafted tasks spanning seven major financial sub-domains (securities, funds, banking, insurance, futures, trusts, and asset management). Tasks are organized into three hierarchical complexity levels: basic business analysis, asset decision support, and strategic risk management. They evaluated 10 mainstream AI agents in a zero-shot setting.

Result: ChatGPT achieved the highest overall accuracy of 48.9%, which is superior to non-professionals but still lags behind financial experts by over 35 percentage points. Error analysis identified five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, and others.

Conclusion: The benchmark reveals significant gaps in AI agents’ financial domain capabilities compared to human experts. The identified failure patterns provide crucial directions for future research in developing more capable financial AI agents. This work establishes the first agent benchmark specifically for the financial domain to objectively assess and promote agent development in this critical field.

Abstract: The booming development of AI agents presents unprecedented opportunities for automating complex tasks across various domains. However, their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. This paper introduces FinGAIA, an end-to-end benchmark designed to evaluate the practical abilities of AI agents in the financial domain. FinGAIA comprises 407 meticulously crafted tasks, spanning seven major financial sub-domains: securities, funds, banking, insurance, futures, trusts, and asset management. These tasks are organized into three hierarchical levels of scenario depth: basic business analysis, asset decision support, and strategic risk management. We evaluated 10 mainstream AI agents in a zero-shot setting. The best-performing agent, ChatGPT, achieved an overall accuracy of 48.9%, which, while superior to non-professionals, still lags financial experts by over 35 percentage points. Error analysis has revealed five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, among others. These patterns point to crucial directions for future research. Our work provides the first agent benchmark closely related to the financial domain, aiming to objectively assess and promote the development of agents in this crucial field. Partial data is available at https://github.com/SUFE-AIFLM-Lab/FinGAIA.

[14] Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge

Miaomiao Gao, Xiaoxiao Xiang, Yiwen Guo

Main category: cs.CL

TL;DR: The paper presents a Triple X speech recognition system using an encoder-adapter-LLM architecture that achieved second place in the Multi-Lingual Conversational Speech Language Modeling Challenge by combining text-based LLMs with domain-specific adaptations and multi-stage training on multilingual datasets.

Details

Motivation: To optimize speech recognition accuracy in multilingual conversational scenarios by leveraging the powerful reasoning capabilities of large language models while addressing the challenges of multilingual speech recognition in conversational contexts.

Method: An innovative encoder-adapter-LLM architecture that combines text-based large language models with domain-specific adaptations, implemented through a meticulously designed multi-stage training strategy using extensive multilingual audio datasets.

Result: The system achieved competitive Word Error Rate (WER) performance on both development and test sets, securing second place in the Multi-Lingual Conversational Speech Language Modeling Challenge ranking.

Conclusion: The proposed encoder-adapter-LLM architecture with multi-stage training effectively harnesses LLM reasoning capabilities for multilingual speech recognition, demonstrating strong performance in conversational speech scenarios and validating the approach’s effectiveness in the challenge setting.

Abstract: This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge. Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture. This framework harnesses the powerful reasoning capabilities of text-based large language models while incorporating domain-specific adaptations. To further enhance multilingual recognition performance, we adopted a meticulously designed multi-stage training strategy leveraging extensive multilingual audio datasets. Experimental results demonstrate that our approach achieves competitive Word Error Rate (WER) performance on both dev and test sets, obtaining second place in the challenge ranking.

[15] Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

Mai Ali, Christopher Lucasius, Tanmay P. Patel, Madison Aitken, Jacob Vorstman, Peter Szatmari, Marco Battaglia, Deepa Kundur

Main category: cs.CL

TL;DR: This paper proposes a trimodal approach for depression detection using speech data that combines text, acoustic features, and vocal biomarkers with multi-task learning to simultaneously predict depression, suicidal ideation, and sleep disturbances, achieving 70.8% balanced accuracy.

Details

Motivation: Traditional speech-based depression detection treats speech as a single modality, missing the rich information available from multiple speech components. Adolescent depression often co-occurs with suicidal ideation and sleep disturbances, presenting an opportunity for multi-task learning to improve prediction accuracy.

Method: The study develops a trimodal multimedia approach using large language model architectures that integrates: (1) speech-derived text, (2) acoustic landmarks, and (3) vocal biomarkers. They implement multi-task learning (MTL) to simultaneously predict depression, suicidal ideation, and sleep disturbances, along with longitudinal analysis to model temporal changes across multiple clinical interactions.

Result: The proposed trimodal, longitudinal MTL approach achieved 70.8% balanced accuracy on the Depression Early Warning dataset, outperforming unimodal, single-task, and non-longitudinal baseline methods.

Conclusion: Treating speech as a trimodal data source combined with multi-task learning and longitudinal modeling significantly improves depression detection performance compared to traditional single-modality approaches, demonstrating the value of comprehensive multimodal analysis for mental health assessment.

Abstract: Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions’ progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.

[16] The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models

Giuseppe Russo, Debora Nozza, Paul Röttger, Dirk Hovy

Main category: cs.CL

TL;DR: This paper introduces a benchmark to evaluate how well Large Language Models align with human moral judgments, revealing that LLMs struggle with moral disagreement and use fewer moral values than humans, then proposes Dynamic Moral Profiling to improve alignment by 64.3%.

Details

Motivation: People increasingly seek moral advice from Large Language Models, but little is known about how well these models align with human moral judgments. Understanding this alignment is crucial as LLMs may influence human moral decisions.

Method: The authors created the Moral Dilemma Dataset with 1,618 real-world moral dilemmas paired with human judgments and rationales. They treated moral alignment as a pluralistic distributional alignment task, comparing LLM and human judgment distributions. They also built a 60-value taxonomy from 3,783 value expressions to analyze moral reasoning diversity. Finally, they developed Dynamic Moral Profiling (DMP), a Dirichlet-based sampling method that conditions model outputs on human-derived value profiles.

Result: LLMs only reproduce human judgments under high consensus and alignment deteriorates sharply when human disagreement increases. LLMs rely on a narrower set of moral values compared to humans, revealing a “pluralistic moral gap.” The proposed Dynamic Moral Profiling method improves alignment by 64.3% and enhances value diversity in LLM moral reasoning.

Conclusion: There exists a significant pluralistic moral gap between LLMs and humans in both judgment distribution and value diversity. Dynamic Moral Profiling offers a promising approach to bridge this gap, providing a step toward more pluralistic and human-aligned moral guidance from LLMs.

Abstract: People increasingly rely on Large Language Models (LLMs) for moral advice, which may influence humans’ decisions. Yet, little is known about how closely LLMs align with human moral judgments. To address this, we introduce the Moral Dilemma Dataset, a benchmark of 1,618 real-world moral dilemmas paired with a distribution of human moral judgments consisting of a binary evaluation and a free-text rationale. We treat this problem as a pluralistic distributional alignment task, comparing the distributions of LLM and human judgments across dilemmas. We find that models reproduce human judgments only under high consensus; alignment deteriorates sharply when human disagreement increases. In parallel, using a 60-value taxonomy built from 3,783 value expressions extracted from rationales, we show that LLMs rely on a narrower set of moral values than humans. These findings reveal a pluralistic moral gap: a mismatch in both the distribution and diversity of values expressed. To close this gap, we introduce Dynamic Moral Profiling (DMP), a Dirichlet-based sampling method that conditions model outputs on human-derived value profiles. DMP improves alignment by 64.3% and enhances value diversity, offering a step toward more pluralistic and human-aligned moral guidance from LLMs.

[17] Adaptive Graph Pruning for Multi-Agent Communication

Boyi Li, Zhonghan Zhao, Der-Horng Lee, Gaoang Wang

Main category: cs.CL

TL;DR: This paper proposes Adaptive Graph Pruning (AGP), a framework that dynamically optimizes both the number of agents and their communication topology in LLM-based multi-agent systems, achieving superior performance while being more token-economical and training-efficient than existing methods.

Details

Motivation: Current LLM-based multi-agent systems use fixed numbers of agents and static communication structures, which limits their ability to adapt to varying task complexities. There is a need for adaptive systems that can optimize both agent quantity and communication topology based on specific task requirements.

Method: The paper introduces a two-stage training strategy: (1) independently training soft-pruning networks for different agent quantities to determine optimal agent-quantity-specific complete graphs and positional masks, and (2) jointly optimizing hard-pruning (agent quantity) and soft-pruning (communication topology) within a maximum complete graph to dynamically configure agents and their communication per task.

Result: AGP achieves state-of-the-art results across six benchmarks with 2.58%-9.84% performance improvement, generalizes across multiple LLM architectures, dynamically constructs task-optimized communication topologies, reduces token consumption by 90%+, and surpasses existing baselines after only ten training steps.

Conclusion: The proposed AGP framework successfully addresses the limitations of fixed multi-agent systems by providing a task-adaptive solution that jointly optimizes agent quantity and communication topology, demonstrating superior performance, efficiency, and adaptability across various reasoning and code generation tasks.

Abstract: Large Language Model (LLM) based multi-agent systems have shown remarkable performance in various tasks, especially when enhanced through collaborative communication. However, current methods often rely on a fixed number of agents and static communication structures, limiting their ability to adapt to varying task complexities. In this paper, we propose Adaptive Graph Pruning (AGP), a novel task-adaptive multi-agent collaboration framework that jointly optimizes agent quantity (hard-pruning) and communication topology (soft-pruning). Specifically, our method employs a two-stage training strategy: firstly, independently training soft-pruning networks for different agent quantities to determine optimal agent-quantity-specific complete graphs and positional masks across specific tasks; and then jointly optimizing hard-pruning and soft-pruning within a maximum complete graph to dynamically configure the number of agents and their communication topologies per task. Extensive experiments demonstrate that our approach is: (1) High-performing, achieving state-of-the-art results across six benchmarks and consistently generalizes across multiple mainstream LLM architectures, with a increase in performance of $2.58%\sim 9.84%$; (2) Task-adaptive, dynamically constructing optimized communication topologies tailored to specific tasks, with an extremely high performance in all three task categories (general reasoning, mathematical reasoning, and code generation); (3) Token-economical, having fewer training steps and token consumption at the same time, with a decrease in token consumption of $90%+$; and (4) Training-efficient, achieving high performance with very few training steps compared with other methods. The performance will surpass the existing baselines after about ten steps of training under six benchmarks.

[18] CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings

Kyeongkyu Lee, Seonghwan Yoon, Hongki Lim

Main category: cs.CL

TL;DR: CLARIFID is a novel framework for automatic radiology report generation that mirrors expert workflow by learning from Findings to Impression, uses reinforcement learning with clinical accuracy rewards, and incorporates multi-view chest X-ray images to improve diagnostic correctness and clinical reliability.

Details

Motivation: Current automatic radiology report generation methods struggle with clinical reliability and factual correctness, often focusing on fluent text generation rather than diagnostic accuracy. Most approaches rely on single-view images and fail to ensure proper clinical reasoning flow, limiting their practical applicability in clinical settings.

Method: CLARIFID employs a four-step approach: (1) section-aware pretraining to learn logical flow from Findings to Impression, (2) fine-tuning with Proximal Policy Optimization using CheXbert F1 score as reward, (3) reasoning-aware decoding that generates Findings before Impression, and (4) multi-view fusion using vision-transformer-based encoder. The framework includes reasoning-aware next-token forcing and report-level re-ranking during inference.

Result: Experimental results on MIMIC-CXR dataset show that CLARIFID achieves superior clinical efficacy compared to existing baselines, outperforming them on both standard natural language generation metrics and clinically aware evaluation scores, demonstrating improved diagnostic correctness and clinical reliability.

Conclusion: CLARIFID successfully addresses the key challenges in automatic radiology report generation by directly optimizing for diagnostic correctness through expert-mimicking workflow, reinforcement learning with clinical rewards, and multi-view image integration, resulting in more clinically reliable and factually correct radiology reports.

Abstract: Automatic generation of radiology reports has the potential to alleviate radiologists’ significant workload, yet current methods struggle to deliver clinically reliable conclusions. In particular, most prior approaches focus on producing fluent text without effectively ensuring the factual correctness of the reports and often rely on single-view images, limiting diagnostic comprehensiveness. We propose CLARIFID, a novel framework that directly optimizes diagnostic correctness by mirroring the two-step workflow of experts. Specifically, CLARIFID (1) learns the logical flow from Findings to Impression through section-aware pretraining, (2) is fine-tuned with Proximal Policy Optimization in which the CheXbert F1 score of the Impression section serves as the reward, (3) enforces reasoning-aware decoding that completes “Findings” before synthesizing the “Impression”, and (4) fuses multiple chest X-ray views via a vision-transformer-based multi-view encoder. During inference, we apply a reasoning-aware next-token forcing strategy followed by report-level re-ranking, ensuring that the model first produces a comprehensive Findings section before synthesizing the Impression and thereby preserving coherent clinical reasoning. Experimental results on the MIMIC-CXR dataset demonstrate that our method achieves superior clinical efficacy and outperforms existing baselines on both standard NLG metrics and clinically aware scores.

[19] Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents

Zhili Shen, Chenxin Diao, Pascual Merita, Pavlos Vougiouklis, Jeff Z. Pan

Main category: cs.CL

TL;DR: This paper adapts GeAR, a state-of-the-art graph-based retrieval-augmented generation (RAG) solution, and evaluates its performance on the SIGIR 2025 LiveRAG Challenge to assess its general applicability beyond specific tasks.

Details

Motivation: Current graph-based RAG approaches that use structured information like entities and relations are typically designed for specific tasks (multi-hop QA, query-focused summarization) with limited evidence of general applicability across broader datasets, creating a need to evaluate their performance on diverse challenges.

Method: The authors adapt GeAR (a state-of-the-art graph-based RAG solution) and test it on the SIGIR 2025 LiveRAG Challenge to explore its performance and identify limitations in a broader context.

Result: The paper explores the performance and limitations of the adapted GeAR system on the SIGIR 2025 LiveRAG Challenge, though specific performance metrics are not provided in the abstract.

Conclusion: The study provides insights into the general applicability of graph-based RAG solutions beyond their original specific task domains by evaluating GeAR on a standardized challenge dataset.

Abstract: Recent studies have explored graph-based approaches to retrieval-augmented generation, leveraging structured or semi-structured information – such as entities and their relations extracted from documents – to enhance retrieval. However, these methods are typically designed to address specific tasks, such as multi-hop question answering and query-focused summarisation, and therefore, there is limited evidence of their general applicability across broader datasets. In this paper, we aim to adapt a state-of-the-art graph-based RAG solution: $\text{GeAR}$ and explore its performance and limitations on the SIGIR 2025 LiveRAG Challenge.

[20] Investigating Subjective Factors of Argument Strength: Storytelling, Emotions, and Hedging

Carlotta Quensel, Neele Falk, Gabriella Lapesa

Main category: cs.CL

TL;DR: This paper investigates how subjective features (emotions, storytelling, and hedging) impact argument strength by analyzing their effects on both objective argument quality and subjective persuasion through regression analysis on standard datasets.

Details

Motivation: There is a lack of large-scale analyses examining the relationship between subjective features and argument strength, despite the growing recognition that subjectivity should be treated as an asset rather than a problem in NLP research.

Method: The authors conducted regression analysis to quantify the impact of three subjective factors (emotions, storytelling, and hedging) on two standard datasets annotated for objective argument quality and subjective persuasion. They also compared and evaluated automated annotation methods for each subjective feature.

Result: The analysis revealed different patterns of impact for subjective features on argument strength: storytelling and hedging showed contrasting effects on objective versus subjective argument quality, while emotions’ influence depended on their rhetorical utilization rather than the domain.

Conclusion: Subjective features have distinct and sometimes contrasting effects on different facets of argument strength, with storytelling and hedging affecting objective and subjective quality differently, and emotions’ impact being determined by rhetorical context rather than domain-specific factors.

Abstract: In assessing argument strength, the notions of what makes a good argument are manifold. With the broader trend towards treating subjectivity as an asset and not a problem in NLP, new dimensions of argument quality are studied. Although studies on individual subjective features like personal stories exist, there is a lack of large-scale analyses of the relation between these features and argument strength. To address this gap, we conduct regression analysis to quantify the impact of subjective factors $-$ emotions, storytelling, and hedging $-$ on two standard datasets annotated for objective argument quality and subjective persuasion. As such, our contribution is twofold: at the level of contributed resources, as there are no datasets annotated with all studied dimensions, this work compares and evaluates automated annotation methods for each subjective feature. At the level of novel insights, our regression analysis uncovers different patterns of impact of subjective features on the two facets of argument strength encoded in the datasets. Our results show that storytelling and hedging have contrasting effects on objective and subjective argument quality, while the influence of emotions depends on their rhetoric utilization rather than the domain.

[21] Each to Their Own: Exploring the Optimal Embedding in RAG

Shiting Chen, Zijian Zhao, Jinsong Chen

Main category: cs.CL

TL;DR: This paper proposes two methods to enhance Retrieval-Augmented Generation (RAG) by combining multiple embedding models: Mixture-Embedding RAG and Confident RAG, with Confident RAG showing significant improvements of ~10% over vanilla LLMs and ~5% over standard RAG.

Details

Motivation: Different embedding models in RAG systems exhibit varying performance across domains due to heterogeneous training data and architectures, leading to inconsistent similarity calculations and varying response quality from LLMs. The authors aim to leverage the complementary strengths of multiple embedding models to improve RAG performance.

Method: Two approaches are proposed: (1) Mixture-Embedding RAG - sorts and selects retrievals from multiple embedding models based on standardized similarity scores, and (2) Confident RAG - generates multiple responses using different embedding models and selects the response with the highest confidence level.

Result: Mixture-Embedding RAG does not outperform vanilla RAG. However, Confident RAG demonstrates consistent improvements of approximately 10% over vanilla LLMs and 5% over standard RAG across different LLMs and embedding models, proving to be an effective plug-and-play approach.

Conclusion: Confident RAG successfully addresses the limitations of single embedding models in RAG systems by leveraging multiple models and confidence-based selection. The method shows consistent improvements across various domains and model combinations, making it a practical and efficient enhancement for RAG systems.

Abstract: Recently, as Large Language Models (LLMs) have fundamentally impacted various fields, the methods for incorporating up-to-date information into LLMs or adding external knowledge to construct domain-specific models have garnered wide attention. Retrieval-Augmented Generation (RAG), serving as an inference-time scaling method, is notable for its low cost and minimal effort for parameter tuning. However, due to heterogeneous training data and model architecture, the variant embedding models used in RAG exhibit different benefits across various areas, often leading to different similarity calculation results and, consequently, varying response quality from LLMs. To address this problem, we propose and examine two approaches to enhance RAG by combining the benefits of multiple embedding models, named Mixture-Embedding RAG and Confident RAG. Mixture-Embedding RAG simply sorts and selects retrievals from multiple embedding models based on standardized similarity; however, it does not outperform vanilla RAG. In contrast, Confident RAG generates responses multiple times using different embedding models and then selects the responses with the highest confidence level, demonstrating average improvements of approximately 10% and 5% over vanilla LLMs and RAG, respectively. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play approach for various domains. We will release our code upon publication.

[22] MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing

Main category: cs.CL

TL;DR: This paper introduces MultiNRC, a new benchmark with over 1,000 native reasoning questions in French, Spanish, and Chinese to evaluate LLMs’ multilingual reasoning capabilities. The benchmark reveals that current LLMs struggle with native multilingual reasoning, with none scoring above 50%, and they perform significantly better in English than in other languages.

Details

Motivation: Existing multilingual reasoning benchmarks are biased towards English contexts because they are typically created by translating English benchmarks. There is a need to evaluate LLMs' reasoning capabilities in native linguistic and cultural contexts across diverse languages, rather than relying on translated materials that may not capture authentic cultural reasoning patterns.

Method: The authors created MultiNRC, a benchmark containing over 1,000 reasoning questions written by native speakers in French, Spanish, and Chinese. The benchmark covers four categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance. For some categories, English equivalent translations were provided for direct comparison. They systematically evaluated 14 leading LLMs across different families on this benchmark.

Result: The evaluation revealed three key findings: (1) No LLM scored above 50% on MultiNRC, indicating poor performance on native multilingual reasoning; (2) LLMs showed varying strengths and weaknesses across linguistic, cultural, and logical reasoning tasks; (3) Most models performed substantially better (+10%) on math reasoning in English compared to original languages, highlighting persistent challenges with culturally grounded knowledge.

Conclusion: Current LLMs have significant limitations in native multilingual reasoning capabilities. The substantial performance gap between English and other languages, particularly in culturally grounded tasks, indicates that LLMs still struggle with authentic multilingual and multicultural reasoning. This highlights the need for better multilingual training and evaluation approaches that go beyond simple translation-based benchmarks.

Abstract: Although recent Large Language Models (LLMs) have shown rapid improvement on reasoning benchmarks in English, the evaluation of such LLMs’ multilingual reasoning capability across diverse languages and cultural contexts remains limited. Existing multilingual reasoning benchmarks are typically constructed by translating existing English reasoning benchmarks, biasing these benchmarks towards reasoning problems with context in English language/cultures. In this work, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark designed to assess LLMs on more than 1,000 native, linguistic and culturally grounded reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance. For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English. This set of English equivalents can provide a direct comparison of LLM reasoning capacity in other languages vs. English on the same reasoning questions. We systematically evaluate current 14 leading LLMs covering most LLM families on MultiNRC and its English equivalent set. The results show that (1) current LLMs are still not good at native multilingual reasoning, with none scoring above 50% on MultiNRC; (2) LLMs exhibit distinct strengths and weaknesses in handling linguistic, cultural, and logical reasoning tasks; (3) Most models perform substantially better in math reasoning in English compared to in original languages (+10%), indicating persistent challenges with culturally grounded knowledge.

[23] Synthetic Voice Data for Automatic Speech Recognition in African Languages

Brian DeRenzi, Anna Dixon, Mohamed Aymane Farhi, Christian Resch

Main category: cs.CL

TL;DR: This paper presents the first systematic assessment of using synthetic voice data to improve Automatic Speech Recognition (ASR) for African languages, creating over 2,500 hours of synthetic data at 1% of real data cost and demonstrating that combining real and synthetic data can match or exceed real-data-only baselines.

Details

Motivation: Speech technology is inaccessible for most of the 2300+ African languages due to lack of sufficient training data. The high cost of collecting real speech data creates a barrier for developing ASR systems for low-resource African languages, necessitating cost-effective alternatives like synthetic voice data generation.

Method: A three-step process: (1) LLM-driven text creation for target languages, (2) Text-to-Speech (TTS) voice synthesis to generate synthetic audio, and (3) ASR fine-tuning using Wav2Vec-BERT-2.0 models with combinations of real and synthetic data. The approach was evaluated on three African languages: Hausa, Dholuo, and Chichewa.

Result: Generated over 2,500 hours of synthetic voice data at below 1% cost of real data. For Hausa, 250h real + 250h synthetic data matched 500h real-data baseline performance, with best results using 579h real + 450-993h synthetic data. Chichewa showed 6.5% relative WER improvement with 1:2 real-to-synthetic ratio. Dholuo results were mixed, showing improvements on some but not all evaluation datasets.

Conclusion: Synthetic voice data can significantly augment real data for African language ASR development at dramatically reduced costs. However, the effectiveness varies by language and data ratios. The study reveals need for more robust evaluation protocols and accurate evaluation datasets. All data and models are publicly released to facilitate further research in this area.

Abstract: Speech technology remains out of reach for most of the over 2300 languages in Africa. We present the first systematic assessment of large-scale synthetic voice corpora for African ASR. We apply a three-step process: LLM-driven text creation, TTS voice synthesis, and ASR fine-tuning. Eight out of ten languages for which we create synthetic text achieved readability scores above 5 out of 7. We evaluated ASR improvement for three (Hausa, Dholuo, Chichewa) and created more than 2,500 hours of synthetic voice data at below 1% of the cost of real data. Fine-tuned Wav2Vec-BERT-2.0 models trained on 250h real and 250h synthetic Hausa matched a 500h real-data-only baseline, while 579h real and 450h to 993h synthetic data created the best performance. We also present gender-disaggregated ASR performance evaluation. For very low-resource languages, gains varied: Chichewa WER improved about 6.5% relative with a 1:2 real-to-synthetic ratio; a 1:1 ratio for Dholuo showed similar improvements on some evaluation data, but not on others. Investigating intercoder reliability, ASR errors and evaluation datasets revealed the need for more robust reviewer protocols and more accurate evaluation data. All data and models are publicly released to invite further work to improve synthetic data for African languages.

[24] A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE)

Bowen Zheng, Ming Ma, Zhongqiao Lin, Tianming Yang

Main category: cs.CL

TL;DR: The paper proposes SPADE, a novel decoding method that improves early-exit algorithms for large language models by aligning intermediate layer representations with output layers, reducing inference costs while maintaining accuracy.

Details

Motivation: Large language models are computationally expensive due to their deep structures, and existing early-exit algorithms suffer from poor performance due to misalignment between intermediate and output layer representations that cause decoding inaccuracy.

Method: SPADE (SPace Alignment DEcoding) aligns intermediate layer representations with the output layer by propagating a minimally reduced sequence of only the start token and answer token. The method includes training a linear approximation that computes entropy-based confidence metrics for optimized early-exit decision-making, creating a hybrid early-exit algorithm.

Result: The approach significantly reduces inference costs without compromising accuracy, offering a scalable and efficient solution for deploying large language models in real-world applications.

Conclusion: SPADE successfully addresses the representation misalignment problem in early-exit algorithms, enabling cost-effective inference for large language models while maintaining high-quality outputs through improved space alignment and confidence-based decision making.

Abstract: Large language models are computationally expensive due to their deep structures. Prior research has shown that intermediate layers contain sufficient information to generate accurate answers, leading to the development of early-exit algorithms that reduce inference costs by terminating computation at earlier layers. However, these methods often suffer from poor performance due to misalignment between intermediate and output layer representations that lead to decoding inaccuracy. To address these challenges, we propose SPADE (SPace Alignment DEcoding), a novel decoding method that aligns intermediate layer representations with the output layer by propagating a minimally reduced sequence consisting of only the start token and the answer token. We further optimize the early-exit decision-making process by training a linear approximation of SPADE that computes entropy-based confidence metrics. Putting them together, we create a hybrid early-exit algorithm that monitors confidence levels and stops inference at intermediate layers while using SPADE to generate high-quality outputs. This approach significantly reduces inference costs without compromising accuracy, offering a scalable and efficient solution for deploying large language models in real-world applications.

[25] WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Changxin Tian, Jiapeng Wang, Qian Zhao, Kunlong Chen, Jia Liu, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou

Main category: cs.CL

TL;DR: WSM is a framework that replaces traditional learning rate decay with model merging techniques, achieving superior performance across multiple benchmarks by treating decay strategies as principled model averaging schemes.

Details

Motivation: Traditional learning rate scheduling requires decay phases that may not be optimal. Recent decay-free approaches show promise, and model merging techniques offer potential solutions. There's a need for a unified framework that connects learning rate decay with model merging while maintaining compatibility with various optimization methods.

Method: The paper presents Warmup-Stable and Merge (WSM), a general framework that establishes formal connections between learning rate decay and model merging. WSM emulates various decay strategies (cosine, linear, inverse square root) as principled model averaging schemes. The method focuses on merge duration as the key parameter for checkpoint aggregation during training.

Result: WSM consistently outperforms the Warmup-Stable-Decay (WSD) approach across multiple benchmarks with significant improvements: +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The framework also shows advantages in supervised fine-tuning scenarios. Merge duration was identified as the most critical factor, more important than checkpoint interval and merge quantity.

Conclusion: WSM provides a unified theoretical foundation for learning rate scheduling through model merging, demonstrating superior performance compared to traditional decay methods. The framework’s compatibility with diverse optimization methods and effectiveness in both training and fine-tuning scenarios highlights its potential for long-term model refinement applications.

Abstract: Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM’s potential for long-term model refinement.

[26] Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries

Victor Hartman, Petter Törnberg

Main category: cs.CL

TL;DR: This study introduces zero-shot Large Language Models for cross-lingual classification of negative campaigning and uses this method to analyze 18 million tweets from parliamentarians across 19 European countries, revealing that governing parties use less negative messaging while extreme and populist parties (especially radical right) engage in higher negativity levels.

Details

Motivation: Empirical research on negative campaigning has been limited by high costs and scalability issues of existing classification methods, creating a need for more efficient and scalable approaches to study political communication across different languages and countries.

Method: The study employs zero-shot Large Language Models (LLMs) for cross-lingual classification of negative campaigning, validated against benchmark datasets in ten languages and compared with human coders and conventional supervised machine learning approaches. The method is then applied to analyze 18 million tweets from parliamentarians in 19 European countries between 2017-2022.

Result: LLMs achieved performance comparable to native-speaking human coders and outperformed conventional supervised machine learning approaches. The analysis revealed consistent cross-national patterns: governing parties use less negative messaging, while ideologically extreme and populist parties, particularly radical right parties, engage in significantly higher levels of negativity.

Conclusion: The study demonstrates that LLMs enable scalable, transparent, and replicable research in political communication across linguistic and cultural contexts. Party-level characteristics significantly shape strategic communication patterns in multiparty systems, with governing status and ideological extremism being key predictors of negative campaigning behavior.

Abstract: Negative campaigning is a central feature of political competition, yet empirical research has been limited by the high cost and limited scalability of existing classification methods. This study makes two key contributions. First, it introduces zero-shot Large Language Models (LLMs) as a novel approach for cross-lingual classification of negative campaigning. Using benchmark datasets in ten languages, we demonstrate that LLMs achieve performance on par with native-speaking human coders and outperform conventional supervised machine learning approaches. Second, we leverage this novel method to conduct the largest cross-national study of negative campaigning to date, analyzing 18 million tweets posted by parliamentarians in 19 European countries between 2017 and 2022. The results reveal consistent cross-national patterns: governing parties are less likely to use negative messaging, while ideologically extreme and populist parties – particularly those on the radical right – engage in significantly higher levels of negativity. These findings advance our understanding of how party-level characteristics shape strategic communication in multiparty systems. More broadly, the study demonstrates the potential of LLMs to enable scalable, transparent, and replicable research in political communication across linguistic and cultural contexts.

[27] Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

Main category: cs.CL

TL;DR: This paper introduces Efficiency Leverage (EL) as a metric to predict the computational advantage of Mixture-of-Experts (MoE) models over dense models, deriving scaling laws through training 300+ models and validating with a pilot model that achieves 7x computational savings while matching dense model performance.

Details

Motivation: The critical challenge in MoE architectures is the inability to predict model capacity for given configurations (expert activation ratio and granularity), which hinders efficient scaling of Large Language Models despite MoE's promise of decoupling parameters from computational cost.

Method: The authors introduce Efficiency Leverage (EL) metric and conduct a large-scale empirical study training over 300 models up to 28B parameters to systematically investigate relationships between MoE configurations and EL, then integrate findings into unified scaling laws.

Result: EL follows predictable power laws driven by expert activation ratio and total compute budget, with expert granularity as a non-linear modulator. The pilot model Ling-mini-beta (0.85B active parameters) matched a 6.1B dense model’s performance while using 7x fewer computational resources on identical 1T token dataset.

Conclusion: The work establishes a principled, empirically-grounded foundation for scaling efficient MoE models by providing accurate scaling laws that can predict computational advantages of MoE architectures based on their configurations.

Abstract: Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.

[28] TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa

Parker Riley, Siamak Shakeri, Waleed Ammar, Jonathan H. Clark

Main category: cs.CL

TL;DR: The paper introduces TyDi QA-WANA, a multilingual question-answering dataset with 28K examples across 10 western Asian and northern African language varieties, designed to evaluate models’ ability to answer information-seeking questions using large text contexts.

Details

Motivation: Existing QA datasets lack representation of western Asian and northern African languages, and many multilingual datasets rely on translation which can cause cultural relevance issues. There's a need for authentic, large-context QA evaluation in these underrepresented language varieties.

Method: Direct data collection in 10 language varieties of western Asia and northern Africa without translation. Questions were designed to elicit genuine information-seeking behavior, paired with entire articles that may or may not contain answers, creating a large-context QA task.

Result: Created a dataset of 28K QA examples across 10 language varieties. Evaluated two baseline models on the dataset and released both code and data publicly for community use.

Conclusion: TyDi QA-WANA provides a valuable resource for evaluating multilingual QA models on underrepresented languages with authentic, culturally relevant content and large text contexts, facilitating future research improvements in this area.

Abstract: We present TyDi QA-WANA, a question-answering dataset consisting of 28K examples divided among 10 language varieties of western Asia and northern Africa. The data collection process was designed to elicit information-seeking questions, where the asker is genuinely curious to know the answer. Each question in paired with an entire article that may or may not contain the answer; the relatively large size of the articles results in a task suitable for evaluating models’ abilities to utilize large text contexts in answering questions. Furthermore, the data was collected directly in each language variety, without the use of translation, in order to avoid issues of cultural relevance. We present performance of two baseline models, and release our code and data to facilitate further improvement by the research community.

[29] Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach

Mohammed Bouri, Adnane Saoud

Main category: cs.CL

TL;DR: This paper introduces Growth Bound Matrices (GBM), a novel regularization technique to improve NLP model robustness against adversarial attacks like synonym substitutions, with focus on LSTM, S4 state space models, and CNNs, achieving up to 8.8% improvement in adversarial robustness.

Details

Motivation: NLP models remain vulnerable to adversarial attacks such as synonym substitutions. While robustness improvements have been studied for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs) like S4 is understudied due to their unique challenges in sequential processing and complex parameter dynamics.

Method: The authors introduce a novel regularization technique based on Growth Bound Matrices (GBM) to reduce the impact of input perturbations on model outputs. They compute GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN).

Result: Extensive experiments across multiple architectures and benchmark datasets show that the GBM method improves adversarial robustness by up to 8.8% over existing baselines, outperforming several state-of-the-art methods in adversarial defense while also improving generalization on clean text.

Conclusion: The GBM regularization technique effectively enhances NLP model robustness against word substitution attacks across different architectures, with particular success in providing the first systematic analysis of S4 state space model robustness, demonstrating superior performance compared to existing adversarial defense methods.

Abstract: Despite advancements in Natural Language Processing (NLP), models remain vulnerable to adversarial attacks, such as synonym substitutions. While prior work has focused on improving robustness for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs), such as S4, remains understudied. These architectures pose unique challenges due to their sequential processing and complex parameter dynamics. In this paper, we introduce a novel regularization technique based on Growth Bound Matrices (GBM) to improve NLP model robustness by reducing the impact of input perturbations on model outputs. We focus on computing the GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN). Our method aims to (1) enhance resilience against word substitution attacks, (2) improve generalization on clean text, and (3) providing the first systematic analysis of SSM (S4) robustness. Extensive experiments across multiple architectures and benchmark datasets demonstrate that our method improves adversarial robustness by up to 8.8% over existing baselines. These results highlight the effectiveness of our approach, outperforming several state-of-the-art methods in adversarial defense. Codes are available at https://github.com/BouriMohammed/GBM

[30] From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan

Main category: cs.CL

TL;DR: This paper develops a systematic pipeline to convert real physician feedback into structured checklists for evaluating AI-generated clinical notes, showing superior performance compared to existing automated metrics in aligning with clinician preferences.

Details

Motivation: Evaluating AI-generated clinical notes is challenging due to high subjectivity and limited scalability of expert review, while existing automated metrics often fail to align with real-world physician preferences, creating a need for better evaluation methods.

Method: The authors propose a pipeline that systematically distills real user feedback into structured, interpretable checklists for note evaluation that can be enforced by LLM-based evaluators, using deidentified data from over 21,000 clinical encounters from a deployed AI medical scribe system.

Result: The feedback-derived checklist outperforms baseline approaches in coverage, diversity, and predictive power for human ratings, demonstrates robustness to quality-degrading perturbations, shows significant alignment with clinician preferences, and proves practical for identifying notes below quality thresholds.

Conclusion: The proposed checklist-based evaluation methodology provides a scalable, interpretable, and clinician-aligned approach for assessing AI-generated clinical notes that can effectively identify quality issues in offline research settings.

Abstract: AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters, prepared in accordance with the HIPAA safe harbor standard, from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms baseline approaches in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist’s robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, the checklist can help identify notes likely to fall below our chosen quality thresholds.

[31] AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer

Danny D. Leybzon, Shreyas Tirumala, Nishant Jain, Summer Gillen, Michael Jackson, Cameron McPhee, Jennifer Schmidt

Main category: cs.CL

TL;DR: This paper presents an AI telephone surveying system using large language models, automatic speech recognition, and speech synthesis to conduct quantitative surveys. The system was tested with pilot surveys and showed that shorter instruments and more responsive AI interviewers improve completion rates, reduce break-offs, and increase respondent satisfaction.

Details

Motivation: With the rise of voice-enabled AI systems, quantitative survey researchers need new data-collection methods that can scale studies while maintaining human-like interactivity and methodological rigor. Traditional interactive voice response (IVR) technology lacks the natural and adaptive respondent experience that modern AI can provide, being less robust to interruptions, corrections, and human speech variations.

Method: The researchers built an AI telephone surveying system combining large language models (LLM), automatic speech recognition (ASR), and speech synthesis technologies. The system was specifically designed for quantitative research, strictly adhering to research best practices including question order randomization, answer order randomization, and exact wording requirements.

Result: The system was validated through two pilot surveys conducted with the SSRS Opinion Panel, followed by a separate human-administered survey to assess respondent experiences. Results showed that shorter survey instruments and more responsive AI interviewers contributed to improvements in survey completion rates, reduced break-off rates, and higher respondent satisfaction scores.

Conclusion: AI telephone surveying represents a viable new data-collection mode for quantitative research that can scale studies while maintaining quality. The effectiveness of such systems depends on instrument length and AI responsiveness, with shorter surveys and more adaptive AI leading to better outcomes across key performance metrics.

Abstract: With the rise of voice-enabled artificial intelligence (AI) systems, quantitative survey researchers have access to a new data-collection mode: AI telephone surveying. By using AI to conduct phone interviews, researchers can scale quantitative studies while balancing the dual goals of human-like interactivity and methodological rigor. Unlike earlier efforts that used interactive voice response (IVR) technology to automate these surveys, voice AI enables a more natural and adaptive respondent experience as it is more robust to interruptions, corrections, and other idiosyncrasies of human speech. We built and tested an AI system to conduct quantitative surveys based on large language models (LLM), automatic speech recognition (ASR), and speech synthesis technologies. The system was specifically designed for quantitative research, and strictly adhered to research best practices like question order randomization, answer order randomization, and exact wording. To validate the system’s effectiveness, we deployed it to conduct two pilot surveys with the SSRS Opinion Panel and followed-up with a separate human-administered survey to assess respondent experiences. We measured three key metrics: the survey completion rates, break-off rates, and respondent satisfaction scores. Our results suggest that shorter instruments and more responsive AI interviewers may contribute to improvements across all three metrics studied.

[32] Megrez2 Technical Report

Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Bo Zhao, Guohao Dai, Yu Wang

Main category: cs.CL

TL;DR: Megrez2 is a lightweight 3B/7.5B parameter language model that uses cross-layer expert sharing and pre-gated routing to achieve competitive performance with larger models while being optimized for device deployment.

Details

Motivation: The need for high-performance language models that can be efficiently deployed on resource-constrained devices, balancing accuracy, efficiency, and deployability for real-world applications.

Method: Novel cross-layer expert sharing mechanism that reuses expert modules across adjacent transformer layers, combined with pre-gated routing for memory-efficient expert loading. The model uses 3B activated parameters and 7.5B stored parameters, trained on 5-trillion tokens with supervised fine-tuning and reinforcement learning with verifiable rewards.

Result: Megrez2-Preview demonstrates competitive or superior performance compared to larger models across language understanding, instruction following, mathematical reasoning, and code generation tasks, despite having significantly fewer parameters.

Conclusion: The Megrez2 architecture successfully achieves an optimal balance between model performance and resource efficiency, making it highly suitable for deployment in resource-constrained environments while maintaining competitive capabilities across diverse language tasks.

Abstract: We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model’s capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.

[33] Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Linbo Cao, Jinman Zhao

Main category: cs.CL

TL;DR: This paper proposes a debate-driven evaluation paradigm that converts existing QA datasets into adversarial debates between models to better assess reasoning abilities while reducing data contamination and memorization issues in language model evaluation.

Details

Motivation: Standard QA benchmarks are becoming saturated by frontier language models, leading to concerns about data contamination, memorization, and high costs of creating new datasets. There's a need for evaluation methods that can assess genuine reasoning ability rather than memorized responses.

Method: The approach transforms existing QA datasets into structured adversarial debates where one model defends the official answer while another constructs and defends an alternative answer, with a judge model evaluating the debate without knowing the correct solution. This creates multi-round argumentation that increases difficulty and penalizes shallow memorization.

Result: Empirical validation shows the method’s robustness against data contamination - a Llama 3.1 model fine-tuned on test questions improved from 50% to 82% accuracy on standard QA but performed worse in debates. Weaker judges can reliably differentiate stronger debaters, demonstrating scalability to more capable systems.

Conclusion: The debate-based evaluation framework provides a sustainable and cost-effective approach for measuring genuine reasoning ability in advanced language models, demonstrating that “pretraining on the test set is no longer all you need” and offering a viable alternative to expensive benchmark creation.

Abstract: As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates–where one model is given the official answer to defend, and another constructs and defends an alternative answer–adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm’s effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination–a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% -> 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that “pretraining on the test set is no longer all you need,” offering a sustainable path for measuring the genuine reasoning ability of advanced language models.

[34] Multi-Level Explanations for Generative Language Models

Lucas Monteiro Paes, Dennis Wei, Hyo Jin Do, Hendrik Strobelt, Ronny Luss, Amit Dhurandhar, Manish Nagireddy, Karthikeyan Natesan Ramamurthy, Prasanna Sattigeri, Werner Geyer, Soumya Ghosh

Main category: cs.CL

TL;DR: The paper introduces MExGen, a framework that explains how large language models generate responses in context-grounded tasks by scoring the influence of different context parts on the output, extending attribution methods like LIME and SHAP to handle the challenges of expensive inference, long inputs, and text outputs.

Details

Motivation: Understanding what makes large language models produce certain responses in context-grounded tasks like summarization and question-answering is challenging, especially when dealing with high inference costs, long input texts, and text outputs. Existing attribution methods are not well-suited for these specific challenges in LLM applications.

Method: MExGen extends perturbation-based attribution methods (LIME and SHAP) specifically for LLMs in context-grounded tasks. The framework assigns scores to different parts of the context to quantify their influence on the model’s generated output, addressing the unique challenges of high inference cost, long input text, and text generation.

Result: Through systematic automated and human evaluation on summarization and question-answering tasks, MExGen provides more faithful explanations of generated output compared to available alternatives, including LLM self-explanations. The framework has been open-sourced as part of the ICX360 toolkit.

Conclusion: MExGen successfully addresses the challenge of explaining LLM behavior in context-grounded text generation tasks by providing more faithful attributions than existing methods. The framework’s effectiveness is demonstrated through comprehensive evaluation, and its open-source availability enables broader adoption for understanding LLM decision-making processes.

Abstract: Despite the increasing use of large language models (LLMs) for context-grounded tasks like summarization and question-answering, understanding what makes an LLM produce a certain response is challenging. We propose Multi-Level Explanations for Generative Language Models (MExGen), a technique to provide explanations for context-grounded text generation. MExGen assigns scores to parts of the context to quantify their influence on the model’s output. It extends attribution methods like LIME and SHAP to LLMs used in context-grounded tasks where (1) inference cost is high, (2) input text is long, and (3) the output is text. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and question answering. The results show that our framework can provide more faithful explanations of generated output than available alternatives, including LLM self-explanations. We open-source code for MExGen as part of the ICX360 toolkit: https://github$.$com/IBM/ICX360.

Yuanchen Shi, Biao Ma, Longyin Zhang, Fang Kong

Main category: cs.CL

TL;DR: This paper introduces MSAIRS, a new task for multimodal sentiment analysis and intent recognition involving stickers in social media chat. The authors create a Chinese dataset with paired sticker-text combinations and propose MMSAIR model with differential vector construction and cascaded attention mechanisms, demonstrating superior performance over traditional models and MLLMs.

Details

Motivation: Despite stickers' significant impact on sentiment analysis and intent recognition in social media, little research has been conducted in this area. The authors aim to address this research gap by proposing a systematic approach to understand how stickers affect chat sentiment and intent in multimodal social media communication.

Method: The authors propose MMSAIR, a multimodal joint model featuring differential vector construction and cascaded attention mechanisms for enhanced multimodal fusion. They create a novel Chinese dataset containing chat records and stickers from mainstream social media platforms, including paired data with same text/different stickers, same sticker/different contexts, and various stickers with same images but different texts.

Result: MMSAIR significantly outperforms traditional models and advanced MLLMs in sticker interpretation tasks. The experiments demonstrate that jointly modeling sentiment and intent is necessary and effective, as they mutually reinforce each other’s recognition accuracy. The results highlight the challenge and uniqueness of sticker interpretation in social media contexts.

Conclusion: The study establishes MSAIRS as a new research direction for multimodal sentiment analysis and intent recognition involving stickers. The proposed MMSAIR model effectively handles the complexity of sticker interpretation in social media, proving that joint modeling of sentiment and intent enhances overall performance and addresses the unique challenges of multimodal social media communication.

Abstract: Stickers are increasingly used in social media to express sentiment and intent. Despite their significant impact on sentiment analysis and intent recognition, little research has been conducted in this area. To address this gap, we propose a new task: \textbf{M}ultimodal chat \textbf{S}entiment \textbf{A}nalysis and \textbf{I}ntent \textbf{R}ecognition involving \textbf{S}tickers (MSAIRS). Additionally, we introduce a novel multimodal dataset containing Chinese chat records and stickers excerpted from several mainstream social media platforms. Our dataset includes paired data with the same text but different stickers, the same sticker but different contexts, and various stickers consisting of the same images with different texts, allowing us to better understand the impact of stickers on chat sentiment and intent. We also propose an effective multimodal joint model, MMSAIR, featuring differential vector construction and cascaded attention mechanisms for enhanced multimodal fusion. Our experiments demonstrate the necessity and effectiveness of jointly modeling sentiment and intent, as they mutually reinforce each other’s recognition accuracy. MMSAIR significantly outperforms traditional models and advanced MLLMs, demonstrating the challenge and uniqueness of sticker interpretation in social media. Our dataset and code are available on https://github.com/FakerBoom/MSAIRS-Dataset.

[36] Is text normalization relevant for classifying medieval charters?

Florian Atzenhofer-Baumgartner, Tamás Kovács

Main category: cs.CL

TL;DR: This study investigates how text normalization affects the classification of medieval German charters, finding that normalization slightly helps location tasks but hurts dating accuracy, and that traditional ML models outperform transformers for this specific application.

Details

Motivation: To understand whether historical text normalization improves or hinders automated classification tasks for medieval documents, specifically for dating and locating Middle High German charters, given that normalization might remove important linguistic features.

Method: Evaluation of various classifiers (traditional and transformer-based models) on Middle High German charter datasets with and without text normalization, comparing performance on document dating and location classification tasks.

Result: Normalization provides minimal improvement for location tasks but reduces accuracy for dating tasks. Support vector machines and gradient boosting models outperform transformer-based approaches. Original texts retain crucial features that normalization may eliminate.

Conclusion: A selective approach to historical text normalization is recommended, emphasizing the importance of preserving certain textual characteristics that are critical for classification tasks. The study questions the efficiency of transformer models for this specific use case and suggests that traditional ML methods may be more suitable.

Abstract: This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.

Eunwon Kim, Chanho Park, Buru Chang

Main category: cs.CL

TL;DR: This paper introduces SHARE, a long-term dialogue dataset built from movie scripts that includes shared memories between conversational partners, and proposes EPISODE framework to leverage these shared experiences for more engaging and sustainable dialogues.

Details

Motivation: Long-term dialogues need shared memories between individuals to strengthen their bond and facilitate ongoing conversations, but existing dialogue systems lack effective mechanisms to leverage such shared experiences for more engaging interactions.

Method: The authors constructed SHARE dataset from movie scripts containing persona information, event summaries, and shared memories (both explicit and implicit) between two individuals, and developed EPISODE framework that utilizes these shared experiences during dialogue generation.

Result: Experiments demonstrate that incorporating shared memories between individuals significantly improves the engagement and sustainability of long-term dialogues, and the EPISODE framework effectively manages and utilizes shared memories during conversation.

Conclusion: Shared memories are crucial for creating engaging long-term dialogues, and the proposed SHARE dataset and EPISODE framework provide effective solutions for leveraging shared experiences in dialogue systems to enhance conversational quality and sustainability.

Abstract: Shared memories between two individuals strengthen their bond and are crucial for facilitating their ongoing conversations. This study aims to make long-term dialogue more engaging by leveraging these shared memories. To this end, we introduce a new long-term dialogue dataset named SHARE, constructed from movie scripts, which are a rich source of shared memories among various relationships. Our dialogue dataset contains the summaries of persona information and events of two individuals, as explicitly revealed in their conversation, along with implicitly extractable shared memories. We also introduce EPISODE, a long-term dialogue framework based on SHARE that utilizes shared experiences between individuals. Through experiments using SHARE, we demonstrate that shared memories between two individuals make long-term dialogues more engaging and sustainable, and that EPISODE effectively manages shared memories during dialogue. Our dataset and code are available at https://github.com/e1kim/SHARE.

[38] A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects

Qing Cheng, Zefan Zeng, Xingchen Hu, Yuehang Si, Zhong Liu

Main category: cs.CL

TL;DR: This survey paper comprehensively reviews Event Causality Identification (ECI) methods in NLP, proposing a novel classification framework that categorizes approaches into sentence-level (SECI) and document-level (DECI) tasks, while providing quantitative evaluations and future research directions.

Details

Motivation: Event Causality Identification has become a crucial task in NLP for automatically detecting causal relationships between events in text. The field lacks a comprehensive systematic review and unified classification framework to organize the rapidly evolving methods and approaches.

Method: The authors propose a novel classification framework that organizes ECI methods into two primary categories: Sentence-level Event Causality Identification (SECI) and Document-level Event Causality Identification (DECI). They systematically review various approaches including feature pattern-based matching, machine learning classification, deep semantic encoding, prompt-based fine-tuning, causal knowledge pre-training, event graph reasoning, and multi-lingual/cross-lingual methods.

Result: The survey provides a comprehensive analysis of ECI methods’ strengths and limitations, conducts extensive quantitative evaluations on four benchmark datasets, and offers insights into multi-lingual, cross-lingual, and zero-shot ECI approaches using Large Language Models. The classification framework successfully organizes existing methods and clarifies the field’s structure.

Conclusion: The paper establishes a foundational understanding of ECI through systematic categorization and evaluation, identifies key challenges and unresolved issues in current methods, and outlines promising future research directions for this dynamic field, contributing to the advancement of causal relationship detection in natural language processing.

Abstract: Event Causality Identification (ECI) has emerged as a pivotal task in natural language processing (NLP), aimed at automatically detecting causal relationships between events in text. In this comprehensive survey, we systematically elucidate the foundational principles and technical frameworks of ECI, proposing a novel classification framework to categorize and clarify existing methods. {We discuss associated challenges, provide quantitative evaluations, and outline future directions for this dynamic and rapidly evolving field. We first delineate key definitions, problem formalization, and evaluation protocols of ECI. Our classification framework organizes ECI methods based on two primary tasks: Sentence-level Event Causality Identification (SECI) and Document-level Event Causality Identification (DECI). For SECI, we review methods including feature pattern-based matching, machine learning-based classification, deep semantic encoding, prompt-based fine-tuning, and causal knowledge pre-training, alongside common data augmentation strategies. For DECI, we focus on techniques such as deep semantic encoding, event graph reasoning, and prompt-based fine-tuning. We dedicate specific discussions to advancements in multi-lingual and cross-lingual ECI as well as zero-shot ECI leveraging Large Language Models (LLMs). Furthermore, we analyze the strengths, limitations, and unresolved challenges of each method. Extensive quantitative evaluations are conducted on four benchmark datasets to assess various ECI methods. Finally, we explore future research directions.

[39] Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen

Main category: cs.CL

TL;DR: CharacterBot is a novel approach to persona simulation in LLMs that goes beyond basic biographical information to capture both linguistic patterns and deeper thought processes of characters, demonstrated through replicating Chinese writer Lu Xun’s writing style and ideological thinking.

Details

Motivation: Previous persona simulation methods for LLMs only focus on surface-level biographical facts or limited role-play dialogues, failing to capture the deeper thoughts and distinctive thinking processes that constitute a holistic representation of an individual character.

Method: The authors propose CharacterBot with four training tasks derived from Lu Xun’s 17 essay collections: (1) pre-training for external linguistic structures and knowledge, and three fine-tuning tasks: (2) multiple-choice QA, (3) generative QA, and (4) style transfer. They introduce CharLoRA, a parameter updating mechanism where a general linguistic style expert collaborates with task-specific experts to learn both language style and deeper thought understanding.

Result: CharacterBot significantly outperforms baseline models on three evaluation tasks measuring linguistic accuracy and opinion comprehension, demonstrating superior ability to replicate both the writing style and ideological thinking of the target character (Lu Xun).

Conclusion: The work successfully demonstrates that deep character persona simulation can be achieved by combining linguistic pattern learning with thought process modeling, opening new directions for future research in character simulation LLMs that capture both surface-level and deeper cognitive aspects of personalities.

Abstract: Previous approaches to persona simulation large language models (LLMs) have typically relied on learning basic biographical information, or using limited role-play dialogue datasets to capture a character’s responses. However, a holistic representation of an individual goes beyond surface-level facts or conversations to deeper thoughts and thinking. In this work, we introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought processes of a character. Using Lu Xun, a renowned Chinese writer, as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks: multiple-choice question answering, generative question answering, and style transfer, each aligning the LLM with Lu Xun’s internal ideation and writing style. To optimize learning across these tasks, we introduce a CharLoRA parameter updating mechanism, where a general linguistic style expert collaborates with other task-specific experts to better study both the language style and the understanding of deeper thoughts. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics. We hope that this work inspires future research on deep character persona simulation LLM.

[40] Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models

Zheyuan Liu, Guangyao Dou, Xiangchi Yuan, Chunhui Zhang, Zhaoxuan Tan, Meng Jiang

Main category: cs.CL

TL;DR: This paper proposes MANU (Modality Aware Neuron Unlearning), a framework for selectively removing sensitive information from Multimodal Large Language Models (MLLMs) by identifying and pruning important neurons across different modalities while preserving overall model performance.

Details

Motivation: Generative models like LLMs and MLLMs can memorize and reveal sensitive information from their training data, creating privacy and ethical concerns. While unlearning has been explored for LLMs, MLLMs present unique challenges due to entangled knowledge across modalities, making comprehensive unlearning more difficult.

Method: MANU operates in two stages: (1) Important neuron selection - identifies the most influential neurons across modalities relative to targeted forget knowledge, and (2) Selective pruning - removes the selected neurons that contribute most to forget data within each modality while preserving retained knowledge integrity.

Result: Experiments across various MLLM architectures demonstrate that MANU achieves more balanced and comprehensive unlearning in each modality without significantly affecting overall model utility.

Conclusion: MANU effectively addresses the challenge of unlearning in MLLMs by selectively targeting modality-specific neurons that contribute to sensitive information, providing a solution for privacy-preserving multimodal AI systems while maintaining model performance.

Abstract: Generative models such as Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) trained on massive datasets can lead them to memorize and inadvertently reveal sensitive information, raising ethical and privacy concerns. While some prior works have explored this issue in the context of LLMs, it presents a unique challenge for MLLMs due to the entangled nature of knowledge across modalities, making comprehensive unlearning more difficult. To address this challenge, we propose Modality Aware Neuron Unlearning (MANU), a novel unlearning framework for MLLMs designed to selectively clip neurons based on their relative importance to the targeted forget data, curated for different modalities. Specifically, MANU consists of two stages: important neuron selection and selective pruning. The first stage identifies and collects the most influential neurons across modalities relative to the targeted forget knowledge, while the second stage is dedicated to pruning those selected neurons. MANU effectively isolates and removes the neurons that contribute most to the forget data within each modality, while preserving the integrity of retained knowledge. Our experiments conducted across various MLLM architectures illustrate that MANU can achieve a more balanced and comprehensive unlearning in each modality without largely affecting the overall model utility.

[41] An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning

Wei Sun, Qianlong Du, Fuwei Cui, Jiajun Zhang

Main category: cs.CL

TL;DR: This paper introduces EpicPRM, a framework for efficiently creating high-quality process supervision training data for mathematical reasoning in Large Language Models, resulting in the Epic50k dataset that significantly outperforms existing datasets.

Details

Motivation: Existing methods for constructing process supervision training data for mathematical reasoning in LLMs are either costly (manual annotation) or suffer from poor quality (per-step Monte Carlo estimation), creating a need for more efficient and higher-quality approaches.

Method: The paper proposes EpicPRM framework that annotates each intermediate reasoning step based on its quantified contribution and uses an adaptive binary search algorithm to enhance both annotation precision and efficiency.

Result: The framework successfully constructs Epic50k, a high-quality process supervision training dataset with 50k annotated intermediate steps. Process-supervised reward models (PRMs) trained on Epic50k demonstrate significantly superior performance compared to those trained on other publicly available datasets.

Conclusion: EpicPRM provides an effective solution for creating high-quality process supervision training data, offering better annotation precision and efficiency than existing methods, ultimately leading to improved mathematical reasoning capabilities in LLMs.

Abstract: Enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) is of great scientific and practical significance. Researchers typically employ process-supervised reward models (PRMs) to guide the reasoning process, effectively improving the models’ reasoning abilities. However, existing methods for constructing process supervision training data, such as manual annotation and per-step Monte Carlo estimation, are often costly or suffer from poor quality. To address these challenges, this paper introduces a framework called EpicPRM, which annotates each intermediate reasoning step based on its quantified contribution and uses an adaptive binary search algorithm to enhance both annotation precision and efficiency. Using this approach, we efficiently construct a high-quality process supervision training dataset named Epic50k, consisting of 50k annotated intermediate steps. Compared to other publicly available datasets, the PRM trained on Epic50k demonstrates significantly superior performance. Getting Epic50k at https://github.com/xiaolizh1/EpicPRM.

[42] AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation

Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu

Main category: cs.CL

TL;DR: AlignDistil is a novel RLHF-equivalent distillation method that optimizes LLM alignment at the token level rather than response level, achieving better performance and faster convergence by using contrastive DPO rewards and adaptive logit extrapolation.

Details

Motivation: Existing LLM alignment methods like RLHF and DPO optimize all tokens using sparse response-level rewards, which can erroneously punish high-quality tokens or encourage low-quality tokens, leading to suboptimal performance and slow convergence. Token-level reward optimization is needed to address these issues.

Method: The authors propose AlignDistil, which: (1) introduces DPO-learned rewards into RLHF objectives and proves equivalence to token-level distillation with teacher distribution combining DPO and reference model logits, (2) builds contrastive DPO rewards using normal and reverse DPO models to bridge accuracy gaps, and (3) designs token adaptive logit extrapolation to create appropriate teacher distributions for each token.

Result: Experimental results show AlignDistil outperforms existing alignment methods and demonstrates fast convergence due to its token-level distributional reward optimization approach.

Conclusion: AlignDistil successfully addresses the limitations of response-level optimization in LLM alignment by providing an effective token-level reward optimization framework that achieves superior performance and faster convergence compared to existing methods.

Abstract: In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-quality tokens, resulting in suboptimal performance and slow convergence speed. To address this issue, we propose AlignDistil, an RLHF-equivalent distillation method for token-level reward optimization. Specifically, we introduce the reward learned by DPO into the RLHF objective and theoretically prove the equivalence between this objective and a token-level distillation process, where the teacher distribution linearly combines the logits from the DPO model and a reference model. On this basis, we further bridge the accuracy gap between the reward from the DPO model and the pure reward model, by building a contrastive DPO reward with a normal and a reverse DPO model. Moreover, to avoid under- and over-optimization on different tokens, we design a token adaptive logit extrapolation mechanism to construct an appropriate teacher distribution for each token. Experimental results demonstrate the superiority of our AlignDistil over existing methods and showcase fast convergence due to its token-level distributional reward optimization.

[43] Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models

Alessio Galatolo, Zhenbang Dai, Katie Winkle, Meriem Beloucif

Main category: cs.CL

TL;DR: This paper introduces ZOPrO, the first zeroth-order optimization algorithm for preference optimization in large language models, which reduces memory usage compared to traditional gradient-based methods while achieving comparable convergence times on generative tasks.

Details

Motivation: Fine-tuning LLMs with first-order methods is computationally intensive, and while zeroth-order optimization reduces memory usage, it suffers from slow convergence in high-dimensional models. Previous ZO research in LLMs focused mainly on classification tasks, neglecting more complex generative tasks like preference optimization.

Method: The authors analyze the interplay between policy and reward models during traditional preference optimization to uncover patterns in their relative updates. They then adapt Simultaneous Perturbation Stochastic Approximation (SPSA) with a targeted sampling strategy to accelerate convergence for preference optimization tasks.

Result: Experiments on summarization, machine translation, and conversational assistants show that ZOPrO consistently enhances reward signals while achieving convergence times comparable to first-order methods. However, it falls short of some state-of-the-art methods in terms of performance.

Conclusion: This work represents the first application of zeroth-order methods to preference optimization in LLMs, successfully extending beyond classification tasks and opening up a new research direction for memory-efficient LLM fine-tuning, despite not matching all state-of-the-art performance benchmarks.

Abstract: Fine-tuning Large Language Models (LLMs) with first-order methods like back-propagation is computationally intensive. Zeroth-Order (ZO) optimisation uses function evaluations instead of gradients, reducing memory usage, but suffers from slow convergence in high-dimensional models. As a result, ZO research in LLMs has mostly focused on classification, overlooking more complex generative tasks. In this paper, we introduce ZOPrO, a novel ZO algorithm designed for Preference Optimisation in LLMs. We begin by analysing the interplay between policy and reward models during traditional (first-order) Preference Optimisation, uncovering patterns in their relative updates. Guided by these insights, we adapt Simultaneous Perturbation Stochastic Approximation (SPSA) with a targeted sampling strategy to accelerate convergence. Through experiments on summarisation, machine translation, and conversational assistants, we demonstrate that our method consistently enhances reward signals while achieving convergence times comparable to first-order methods. While it falls short of some state-of-the-art methods, our work is the first to apply Zeroth-Order methods to Preference Optimisation in LLMs, going beyond classification tasks and paving the way for a largely unexplored research direction. Code and visualisations are available at https://github.com/alessioGalatolo/VisZOPrO

[44] ORANSight-2.0: Foundational LLMs for O-RAN

Pranshav Gajjar, Vijay K. Shah

Main category: cs.CL

TL;DR: ORANSight-2.0 introduces specialized foundational LLMs for Open Radio Access Networks (O-RAN) by fine-tuning 18 open-source models (1B-70B parameters) using a novel RAG-based instruction-tuning framework called RANSTRUCT, addressing the gap in domain-specific O-RAN language models.

Details

Motivation: Existing general-purpose LLMs fail to address the unique challenges and technical intricacies of O-RAN systems, limiting their integration into Open Radio Access Networks despite their transformative impact in other domains like healthcare and business.

Method: The paper introduces RANSTRUCT, a Retrieval-Augmented Generation (RAG)-based instruction-tuning framework using two LLM agents (Mistral-based Question Generator and Qwen-based Answer Generator) to create high-quality instruction-tuning datasets. These datasets are used to fine-tune 18 pre-trained open-source LLMs via QLoRA across five frameworks (Mistral, Qwen, Llama, Phi, and Gemma).

Result: ORANSight-2.0 successfully develops 18 specialized O-RAN foundational models ranging from 1B to 70B parameters, significantly reducing reliance on proprietary closed-source models while enhancing performance in O-RAN-specific tasks. A novel benchmark called srsRANBench is introduced for evaluation.

Conclusion: The research successfully bridges the gap between general-purpose LLMs and O-RAN domain requirements by creating specialized foundational models that better understand and address O-RAN technical intricacies, potentially advancing LLM integration in telecommunications infrastructure.

Abstract: Despite the transformative impact of Large Language Models (LLMs) across critical domains such as healthcare, customer service, and business marketing, their integration into Open Radio Access Networks (O-RAN) remains limited. This gap is primarily due to the absence of domain-specific foundational models, with existing solutions often relying on general-purpose LLMs that fail to address the unique challenges and technical intricacies of O-RAN. To bridge this gap, we introduce ORANSight-2.0 (O-RAN Insights), a pioneering initiative to develop specialized foundational LLMs tailored for O-RAN. Built on 18 models spanning five open-source LLM frameworks – Mistral, Qwen, Llama, Phi, and Gemma – ORANSight-2.0 fine-tunes models ranging from 1B to 70B parameters, significantly reducing reliance on proprietary, closed-source models while enhancing performance in O-RAN-specific tasks. At the core of ORANSight-2.0 is RANSTRUCT, a novel Retrieval-Augmented Generation (RAG)-based instruction-tuning framework that employs two LLM agents – a Mistral-based Question Generator and a Qwen-based Answer Generator – to create high-quality instruction-tuning datasets. The generated dataset is then used to fine-tune the 18 pre-trained open-source LLMs via QLoRA. To evaluate ORANSight-2.0, we introduce srsRANBench, a novel benchmark designed for code generation and codebase understanding in the context of srsRAN, a widely used 5G O-RAN stack.

Elyas Meguellati, Stefano Civelli, Pietro Bernardelle, Shazia Sadiq, Irwin King, Gianluca Demartini

Main category: cs.CL

TL;DR: This paper develops a lightweight model for detecting persuasive text in political advertising and applies it to analyze Australian Facebook election ads, revealing how political campaigns use persuasion strategies across different contexts and demographics.

Details

Motivation: Political advertising uses subtle persuasive techniques that can influence electoral outcomes, making it crucial to detect these elements for enhancing voter awareness and ensuring transparency in democratic processes.

Method: The authors developed a lightweight model for persuasive text detection that achieves state-of-the-art performance on SemEval 2023 Task 3, then collected the Australian Federal Election 2022 Facebook Ads (APA22) dataset, partially annotated it for persuasion, and fine-tuned the model to adapt from mainstream news to social media content.

Result: The model achieved state-of-the-art performance while requiring significantly fewer computational resources and training data than existing methods. Analysis of the APA22 dataset revealed distinct patterns in how political campaigns use persuasion through different funding strategies, word choices, demographic targeting, and temporal shifts in persuasion intensity approaching election day.

Conclusion: The study demonstrates the necessity of domain-specific modeling for analyzing persuasion on social media and shows how uncovering persuasion strategies can enhance transparency, inform voters, and promote accountability in digital political campaigns.

Abstract: Political advertising plays a pivotal role in shaping public opinion and influencing electoral outcomes, often through subtle persuasive techniques embedded in broader propaganda strategies. Detecting these persuasive elements is crucial for enhancing voter awareness and ensuring transparency in democratic processes. This paper presents an integrated approach that bridges model development and real-world application through two interconnected studies. First, we introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3 while requiring significantly fewer computational resources and training data than existing methods. Second, we demonstrate the model’s practical utility by collecting the Australian Federal Election 2022 Facebook Ads (APA22) dataset, partially annotating a subset for persuasion, and fine-tuning the model to adapt from mainstream news to social media content. We then apply the fine-tuned model to label the remainder of the APA22 dataset, revealing distinct patterns in how political campaigns leverage persuasion through different funding strategies, word choices, demographic targeting, and temporal shifts in persuasion intensity as election day approaches. Our findings not only underscore the necessity of domain-specific modeling for analyzing persuasion on social media but also show how uncovering these strategies can enhance transparency, inform voters, and promote accountability in digital campaigns.

[46] Resona: Improving Context Copying in Linear Recurrence Models with Retrieval

Xinyu Wang, Linrui Ma, Jerry Huang, Peng Lu, Prasanna Parthasarathi, Xiao-Wen Chang, Boxing Chen, Yufei Cui

Main category: cs.CL

TL;DR: Resona is a retrieval augmentation framework that enhances linear recurrent language models’ in-context learning capabilities, helping them compete better with Transformer-based models while maintaining computational efficiency.

Details

Motivation: Linear recurrent models are computationally efficient alternatives to Transformers but suffer from significant performance gaps in in-context learning and context recall tasks. There's a need to improve these models' ability to utilize contextual information without sacrificing their efficiency advantages.

Method: The paper introduces Resona, a simple and scalable framework that augments linear recurrent models with retrieval capabilities. Resona enables these models to integrate retrieved information from the input context, allowing them to adapt their behavior to diverse task requirements.

Result: Experiments across various linear recurrent models show that Resona-augmented models achieve significant performance improvements on both synthetic and real-world natural language tasks, demonstrating enhanced in-context learning and language modeling abilities.

Conclusion: Resona serves as a general-purpose method to bridge the performance gap between linear recurrent models and Transformers in context-dependent tasks, while preserving the computational efficiency that makes linear recurrent models attractive alternatives to traditional Transformer architectures.

Abstract: Recent shifts in the space of large language model (LLM) research have shown an increasing focus on novel architectures to compete with prototypical Transformer-based models that have long dominated this space. Linear recurrent models have proven to be a viable competitor due to their computational efficiency. However, such models still demonstrate a sizable gap compared to Transformers in terms of in-context learning among other tasks that require recalling information from a context. In this work, we introduce Resona, a simple and scalable framework for augmenting linear recurrent models with retrieval. Resona augments models with the ability to integrate retrieved information from the provided input context, enabling tailored behavior to diverse task requirements. Experiments on a variety of linear recurrent models demonstrate that Resona-augmented models observe significant performance gains on a variety of synthetic as well as real-world natural language tasks, highlighting its ability to act as a general purpose method to improve the in-context learning and language modeling abilities of linear recurrent LLMs.

[47] Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang

Main category: cs.CL

TL;DR: This paper proposes a two-stage training approach for multimodal large language models (MLLMs) that combines supervised fine-tuning with reinforcement learning to enhance multimodal reasoning capabilities, achieving state-of-the-art performance on reasoning benchmarks.

Details

Motivation: While "aha moment" patterns in LLMs are often attributed to emergent properties from reinforcement learning, the authors found these patterns exist in MLLMs before RL training but don't necessarily correlate with improved reasoning performance. This motivated them to systematically study how to enhance multimodal reasoning through structured training approaches.

Method: A two-stage training approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO (Group Relative Policy Optimization) to further refine reasoning capabilities in multimodal large language models.

Result: The combined SFT+RL approach consistently outperformed both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. Their 7B model showed substantial improvements (66.3%→73.4% on MathVista, 62.9%→70.4% on We-Math) and their 3B model achieved performance competitive with several 7B models, reaching state-of-the-art performance among open-source MLLMs.

Conclusion: The two-stage training approach (SFT followed by RL) provides an effective method for building advanced multimodal reasoning models, demonstrating that structured initialization through SFT combined with RL refinement leads to superior performance compared to single-stage approaches, offering practical guidance for developing high-performance multimodal reasoning systems.

Abstract: Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While “aha moment” patterns–where models exhibit self-correction through reflection–are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

[48] MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models

Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin, Fei Gao, Wenmin Li

Main category: cs.CL

TL;DR: Researchers developed a Multi-Encryption Framework (MEF) that adapts jailbreak attack strategies based on LLM comprehension abilities, achieving 98.9-99.8% attack success rates on GPT-4 variants by using different encryption techniques for models with varying comprehension levels.

Details

Motivation: Current adversarial jailbreak attacks expose critical vulnerabilities in Large Language Models by circumventing alignment safeguards, but existing approaches don't consider the varying comprehension abilities of different LLMs when designing attack strategies.

Method: The Multi-Encryption Framework (MEF) first categorizes LLM comprehension ability levels, then applies adaptive strategies: Fu+En1 (layered semantic mutations with encryption) for limited comprehension models, and Fu+En1+En2 (additional dual-ended encryption on responses) for strong comprehension models to evade defenses at input, inference, and output stages.

Result: MEF achieved high attack success rates of 98.9% on GPT-4o (29 May 2025 release) and 99.8% on GPT-4.1 (8 July 2025 release), demonstrating the effectiveness of capability-aware jailbreak strategies.

Conclusion: The research provides deeper understanding of vulnerabilities in current LLM alignment mechanisms by showing that jailbreak effectiveness depends on model comprehension abilities, and that adaptive multi-encryption approaches can significantly improve attack success rates against black-box LLMs.

Abstract: Recent advancements in adversarial jailbreak attacks have exposed critical vulnerabilities in Large Language Models (LLMs), enabling the circumvention of alignment safeguards through increasingly sophisticated prompt manipulations. Based on our experiments, we found that the effectiveness of jailbreak strategies is influenced by the comprehension ability of the attacked LLM. Building on this insight, we propose a capability-aware Multi-Encryption Framework (MEF) for evaluating vulnerabilities in black-box LLMs. Specifically, MEF first categorizes the comprehension ability level of the LLM, then applies different strategies accordingly: For models with limited comprehension ability, MEF adopts the Fu+En1 strategy, which integrates layered semantic mutations with an encryption technique, more effectively contributing to evasion of the LLM’s defenses at the input and inference stages. For models with strong comprehension ability, MEF uses a more complex Fu+En1+En2 strategy, in which additional dual-ended encryption techniques are applied to the LLM’s responses, further contributing to evasion of the LLM’s defenses at the output stage. Experimental results demonstrate the effectiveness of our approach, achieving attack success rates of 98.9% on GPT-4o (29 May 2025 release) and 99.8% on GPT-4.1 (8 July 2025 release). Our work contributes to a deeper understanding of the vulnerabilities in current LLM alignment mechanisms.

[49] Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

Main category: cs.CL

TL;DR: This paper evaluates how well current language models understand Basque and Spanish language varieties using Natural Language Inference, finding that models struggle with linguistic variation, particularly in Basque dialects like Western Basque.

Details

Motivation: To assess the capacity of current language technologies to handle linguistic variation in Basque and Spanish, as understanding language varieties is crucial for developing inclusive NLP systems that work across different dialects and regional variants.

Method: The researchers created a manually-curated parallel dataset in Basque and Spanish with their respective variants, then conducted crosslingual and in-context learning experiments using both encoder-only and decoder-based Large Language Models on Natural Language Inference tasks, followed by error analysis and ablation studies.

Result: Performance drops significantly when handling linguistic variation, especially in Basque. Error analysis reveals this decline is due to linguistic variation itself rather than lexical overlap. Encoder-only models particularly struggle with Western Basque, which linguistic theory identifies as more distant from the standard dialect.

Conclusion: Current language models have limited capacity to handle linguistic variation, particularly in less-resourced languages like Basque and their peripheral dialects. The performance degradation aligns with linguistic theory about dialect distance from standard forms, highlighting the need for better approaches to handle language variety in NLP systems.

Abstract: In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.

[50] Large Language Models in Argument Mining: A Survey

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic

Main category: cs.CL

TL;DR: This survey systematically reviews how Large Language Models (LLMs) have transformed Argument Mining (AM) in NLP, providing a comprehensive taxonomy of AM subtasks and analyzing current LLM techniques, evaluation practices, and future research directions.

Details

Motivation: The advent of Large Language Models has profoundly transformed Argument Mining, enabling advanced capabilities like in-context learning, prompt-based generation, and cross-domain adaptability. There is a need to systematically synthesize these recent advancements and provide guidance for researchers in this rapidly evolving domain.

Method: The authors conduct a systematic survey that includes: (1) reviewing foundational theories and annotation frameworks, (2) creating a curated catalog of datasets, (3) developing a comprehensive taxonomy of AM subtasks, (4) analyzing contemporary LLM techniques (prompting, chain-of-thought reasoning, retrieval augmentation), (5) detailing current LLM architectures and methodologies, and (6) critically assessing evaluation practices.

Result: The survey provides a comprehensive taxonomy of AM subtasks and demonstrates how LLM techniques have reconfigured their execution. It identifies pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks, while highlighting emerging trends in LLM-based computational argumentation.

Conclusion: The survey concludes by proposing a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain of argument mining enhanced by large language models.

Abstract: Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.

[51] Modeling Public Perceptions of Science in Media

Jiaxin Pei, Dustin Wright, Isabelle Augenstein, David Jurgens

Main category: cs.CL

TL;DR: Researchers developed a computational framework to model public perception of science news across 12 dimensions, created a large dataset with 10,489 annotations, and showed that predicted perception scores can effectively predict public engagement with scientific content on social media platforms.

Details

Motivation: Science communicators struggle to anticipate how audiences will perceive and interact with scientific news due to the ever-growing volume of information, making it vital to understand public perception for fostering trust and understanding in the scientific community.

Method: The researchers created a computational framework modeling public perception across twelve dimensions (newsworthiness, importance, surprisingness, etc.), built a large-scale dataset with 10,489 annotations from 2,101 diverse US and UK participants, developed NLP models to predict perception scores, and conducted analysis on Reddit to validate their findings through natural experiments.

Result: The study found that individuals’ frequency of science news consumption is the main driver of perception while demographic factors have minimal influence. Posts with more positive perception scores received significantly more comments and upvotes on Reddit, demonstrating a direct connection between estimated public perception and actual engagement patterns.

Conclusion: The research demonstrates that nuanced perception modeling is crucial for science communication and provides new pathways to predict public interest and engagement with scientific content, offering valuable tools for science communicators to better understand and predict audience responses.

Abstract: Effectively engaging the public with science is vital for fostering trust and understanding in our scientific community. Yet, with an ever-growing volume of information, science communicators struggle to anticipate how audiences will perceive and interact with scientific news. In this paper, we introduce a computational framework that models public perception across twelve dimensions, such as newsworthiness, importance, and surprisingness. Using this framework, we create a large-scale science news perception dataset with 10,489 annotations from 2,101 participants from diverse US and UK populations, providing valuable insights into public responses to scientific information across domains. We further develop NLP models that predict public perception scores with a strong performance. Leveraging the dataset and model, we examine public perception of science from two perspectives: (1) Perception as an outcome: What factors affect the public perception of scientific information? (2) Perception as a predictor: Can we use the estimated perceptions to predict public engagement with science? We find that individuals’ frequency of science news consumption is the driver of perception, whereas demographic factors exert minimal influence. More importantly, through a large-scale analysis and carefully designed natural experiment on Reddit, we demonstrate that the estimated public perception of scientific information has direct connections with the final engagement pattern. Posts with more positive perception scores receive significantly more comments and upvotes, which is consistent across different scientific information and for the same science, but are framed differently. Overall, this research underscores the importance of nuanced perception modeling in science communication, offering new pathways to predict public interest and engagement with scientific content.

[52] GTA: Grouped-head latenT Attention

Luoyang Sun, Cheng Deng, Jiwen Jiang, Xinjian Wu, Haifeng Zhang, Lei Chen, Lionel Ni, Jun Wang

Main category: cs.CL

TL;DR: The paper proposes Grouped-Head Latent Attention (GTA), a novel attention mechanism that reduces memory usage by up to 70% and computation by up to 62.5% while maintaining performance, achieving 2x faster inference speed for large language models.

Details

Motivation: Attention mechanisms in large language models create substantial computational and memory overhead, particularly as KV cache and attention computations scale rapidly with text length. This poses challenges for deploying LLMs on hardware with limited resources. The authors observed that attention mechanisms exhibit substantial redundancy through compressible KV cache and high similarity across attention heads.

Method: GTA introduces two key components: (1) a shared attention map mechanism that reuses attention scores across multiple heads to decrease key cache size, and (2) a nonlinear value decoder with learned projections that compresses the value cache into a latent space to further reduce memory requirements.

Result: GTA reduces attention computation FLOPs by up to 62.5% compared to Grouped-Query Attention and shrinks the KV cache by up to 70%. The method achieves a 2x increase in end-to-end inference speed, with prefill benefiting from reduced computational cost and decoding benefiting from smaller cache footprint.

Conclusion: GTA successfully addresses the computational and memory bottlenecks in attention mechanisms while maintaining performance, offering an efficient solution for LLM deployment on resource-constrained hardware without the extra overhead of Multi-Head Latent Attention.

Abstract: Attention mechanisms underpin the success of large language models (LLMs), yet their substantial computational and memory overhead poses challenges for optimizing efficiency and performance. A critical bottleneck arises as KV cache and attention computations scale rapidly with text length, challenging deployment on hardware with limited computational and memory resources. We observe that attention mechanisms exhibit substantial redundancy, since the KV cache can be significantly compressed and attention maps across heads display high similarity, revealing that much of the computation and storage is unnecessary. Leveraging these insights, we propose \textbf{G}rouped-Head Laten\textbf{T} \textbf{A}ttention (GTA), a novel attention mechanism that reduces memory usage and computational complexity while maintaining performance. GTA comprises two components: (1) a shared attention map mechanism that reuses attention scores across multiple heads, decreasing the key cache size; and (2) a nonlinear value decoder with learned projections that compresses the value cache into a latent space, further cutting memory needs. GTA cuts attention computation FLOPs by up to \emph{62.5%} versus Grouped-Query Attention and shrink the KV cache by up to \emph{70%}, all while avoiding the extra overhead of Multi-Head Latent Attention to improve LLM deployment efficiency. Consequently, GTA models achieve a \emph{2x} increase in end-to-end inference speed, with prefill benefiting from reduced computational cost and decoding benefiting from the smaller cache footprint.

[53] A Diagrammatic Calculus for a Functional Model of Natural Language Semantics

Matthieu Pierre Boyer

Main category: cs.CL

TL;DR: This paper proposes a functional programming approach to natural language semantics using category-based type and effect systems, with a diagrammatic calculus for efficient computation of sentence denotations.

Details

Motivation: Traditional denotational approaches to natural language semantics have limited expressiveness in capturing semantic differences between syntactically equivalent expressions. There is a need for more sophisticated formal methods to represent and compute natural language meanings.

Method: The authors develop a category-based type and effect system to formalize semantic distinctions between expressions that appear syntactically similar. They construct a diagrammatic calculus that models both parsing processes and effect handling in natural language semantics.

Result: The paper presents a formal framework that combines functional programming principles with categorical semantics, resulting in a diagrammatic calculus that can efficiently compute denotations for natural language sentences while capturing semantic nuances missed by traditional approaches.

Conclusion: The functional programming approach with category-based type and effect systems successfully increases the expressiveness of natural language semantics beyond traditional denotational methods, and the diagrammatic calculus provides an efficient computational method for semantic analysis.

Abstract: In this paper, we study a functional programming approach to natural language semantics, allowing us to increase the expressiveness of a more traditional denotation style. We will formalize a category based type and effect system to represent the semantic difference between syntactically equivalent expressions. We then construct a diagrammatic calculus to model parsing and handling of effects, providing a method to efficiently compute the denotations for sentences.

[54] Cautious Next Token Prediction

Yizhou Wang, Lingzhi Zhang, Yue Bai, Mang Tik Chiu, Zhengmian Hu, Mingyuan Zhang, Qihua Dong, Yu Yin, Sohrab Amirghodsi, Yun Fu

Main category: cs.CL

TL;DR: The paper proposes Cautious Next Token Prediction (CNTP), a training-free decoding strategy that samples multiple trial paths when model confidence is low and selects the most reliable one based on perplexity scores, consistently outperforming standard decoding methods.

Details

Motivation: Current temperature scaling with nucleus sampling leads to inferior performance in NLP tasks when models are uncertain about testing questions. There's a need for better decoding strategies that can handle model uncertainty more effectively.

Method: CNTP monitors prediction entropy during decoding - when entropy is high (low confidence), it samples multiple independent trial paths from that step until punctuation, then selects the trial with lowest perplexity score. The number of trials is inversely correlated with prediction confidence, mimicking human cautious thinking behavior.

Result: Extensive experiments on both LLMs and MLLMs demonstrate that CNTP consistently outperforms existing standard decoding strategies by a clear margin. Integration with self-consistency further improves performance over vanilla self-consistency.

Conclusion: CNTP is a promising training-free decoding approach that could become a default choice for LLM decoding, effectively handling model uncertainty by exploring multiple paths and selecting the most reliable one based on the model’s own capacity assessment.

Abstract: Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model’s capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings’ behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.

[55] Fairness Evaluation of Large Language Models in Academic Library Reference Services

Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian

Main category: cs.CL

TL;DR: This study evaluates whether large language models (LLMs) can provide equitable virtual reference services in libraries by testing if they differentiate responses based on user demographics (sex, race/ethnicity, institutional role) across six state-of-the-art models.

Details

Motivation: Libraries are considering LLMs for virtual reference services but need to ensure these systems don't reproduce societal biases from training data, which could compromise their commitment to equitable service for all users regardless of demographics or social status.

Method: The researchers prompted six state-of-the-art LLMs to assist patrons with different identities varying by sex, race/ethnicity, and institutional role, then analyzed the responses for evidence of differential treatment or bias.

Result: No evidence of differentiation by race or ethnicity was found. Only minor stereotypical bias against women was detected in one model. LLMs showed nuanced accommodation of institutional roles through appropriate linguistic choices (formality, politeness, domain-specific vocabulary) that reflected professional norms rather than discrimination.

Conclusion: Current LLMs demonstrate promising readiness to support equitable and contextually appropriate communication in academic library reference services, with minimal bias and appropriate professional adaptation to different user roles.

Abstract: As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries’ commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We found no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrated nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

[56] A Mathematical Theory of Discursive Networks

Juan B. Gutiérrez

Main category: cs.CL

TL;DR: This paper introduces discursive networks as a framework for human-LLM collaboration, where reliability emerges from mutual accountability between imperfect agents rather than perfecting individual models.

Details

Motivation: Large language models create a new writing medium where humans and AI interact as equals, but this raises concerns about error propagation and information reliability. The authors seek to understand how errors spread in human-LLM networks and develop methods to ensure system reliability.

Method: The authors develop a mathematical model of discursive networks treating humans and LLMs as equal nodes. They identify four error hazards (drift from truth, self-repair, fresh fabrication, external detection) and propose the Flaws-of-Others (FOO) algorithm - a peer review system where agents critique each other while a harmonizer merges verdicts.

Result: The mathematical model shows that networks with only drift and self-repair stabilize at modest error rates, but adding peer review (even with small probability) shifts the system to a truth-dominant state. The FOO algorithm operationalizes effective peer review in human-LLM networks.

Conclusion: Reliability in human-LLM collaboration comes from connecting imperfect agents into networks with mutual accountability rather than perfecting individual models. The authors identify “epithesis” as an ethical issue when humans fail to engage properly in the discursive network.

Abstract: Large language models (LLMs) turn writing into a live exchange between humans and software. We characterize this new medium as a discursive network that treats people and LLMs as equal nodes and tracks how their statements circulate. We define the generation of erroneous information as invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. We develop a general mathematical model of discursive networks that shows that a network governed only by drift and self-repair stabilizes at a modest error rate. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source Flaws-of-Others (FOO) algorithm: a configurable loop in which any set of agents critique one another while a harmonizer merges their verdicts. We identify an ethical transgression, epithesis, that occurs when humans fail to engage in the discursive network. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from connecting imperfect ones into networks that enforce mutual accountability.

[57] Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong

Main category: cs.CL

TL;DR: This paper develops a large language model for Tibetan by curating the largest Tibetan pre-training corpus to date, applying specialized data processing, and continuing pre/post-training of a multilingual base model, achieving significant improvements over existing models on Tibetan language tasks.

Details

Motivation: Tibetan is severely underrepresented in existing large language models due to the scarcity of high-quality training corpora, creating a significant gap for this low-resource language that needs to be addressed.

Method: The researchers curated the largest Tibetan pre-training corpus by aggregating data from diverse sources, applied a dedicated data cleaning and processing pipeline tailored for Tibetan, and performed continued pre-training and post-training on a multilingual base model to enhance its Tibetan generative capabilities.

Result: The developed model consistently and significantly outperforms both open-source models of similar scale and existing Tibetan-tailored models across a wide range of tasks, as demonstrated through evaluations on newly created high-quality Tibetan benchmarks and existing public benchmarks.

Conclusion: The study successfully addresses the underrepresentation of Tibetan in large language models by creating a comprehensive Tibetan corpus and training methodology, resulting in a model that substantially advances the state-of-the-art for Tibetan language processing.

Abstract: Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

[58] Tiny language models

Ronit D. Gross, Yarden Tzach, Tal Halevi, Ella Koresh, Ido Kanter

Main category: cs.CL

TL;DR: This study investigates tiny language models (TLMs) as accessible alternatives to large language models, demonstrating that TLMs exhibit similar qualitative features to LLMs including benefits from pre-training, performance scaling with dataset size, and the ability to achieve comparable accuracy through ensemble methods.

Details

Motivation: Large language model pre-training is only feasible for dominant companies due to immense computational requirements, limiting broader research participation and creating a critical need for more accessible alternatives that can enable wider NLP research.

Method: The researchers pre-trained BERT-6 and variants of BERT-1 on subsets of Wikipedia dataset and evaluated performance on classification tasks (FewRel, AGNews, DBPedia). They compared pre-trained vs non-pre-trained models and explored soft committee approaches using multiple shallow architectures.

Result: TLMs showed clear performance gaps between pre-trained and non-pre-trained models across classification tasks. Performance improved with larger pre-training datasets and greater token overlap between pre-training and classification data. A soft committee of shallow TLMs could replicate deep TLM accuracy while enabling low-latency inference.

Conclusion: Tiny language models exhibit key qualitative features similar to large language models, making them viable alternatives for NLP research. TLMs may be sufficient for language development in children/adolescents and can illuminate underlying NLP mechanisms while providing computational accessibility for broader research participation.

Abstract: A prominent achievement of natural language processing (NLP) is its ability to understand and generate meaningful human language. This capability relies on complex feedforward transformer block architectures pre-trained on large language models (LLMs). However, LLM pre-training is currently feasible only for a few dominant companies due to the immense computational resources required, limiting broader research participation. This creates a critical need for more accessible alternatives. In this study, we explore whether tiny language models (TLMs) exhibit the same key qualitative features of LLMs. We demonstrate that TLMs exhibit a clear performance gap between pre-trained and non-pre-trained models across classification tasks, indicating the effectiveness of pre-training, even at a tiny scale. The performance gap increases with the size of the pre-training dataset and with greater overlap between tokens in the pre-training and classification datasets. Furthermore, the classification accuracy achieved by a pre-trained deep TLM architecture can be replicated through a soft committee of multiple, independently pre-trained shallow architectures, enabling low-latency TLMs without affecting classification accuracy. Our results are based on pre-training BERT-6 and variants of BERT-1 on subsets of the Wikipedia dataset and evaluating their performance on FewRel, AGNews, and DBPedia classification tasks. Future research on TLM is expected to further illuminate the mechanisms underlying NLP, especially given that its biologically inspired models suggest that TLMs may be sufficient for children or adolescents to develop language. The data and code that support the findings of this study are openly available on https://github.com/Rg32601/Tiny-Language-Models .

[59] From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment

Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, Xiaodong Shi

Main category: cs.CL

TL;DR: The paper proposes NeuronXA, a neuron state-based method to evaluate cross-lingual alignment in large language models, which achieves strong correlation with downstream performance using only 100 parallel sentence pairs.

Details

Motivation: Existing cross-lingual alignment evaluation methods focus on sentence embeddings and struggle with non-smooth representation spaces, particularly impacting semantic alignment evaluation for low-resource languages. There's a need for better evaluation methods inspired by neuroscientific findings about overlapping neuronal activation patterns.

Method: The authors propose Neuron State-Based Cross-Lingual Alignment (NeuronXA), a novel evaluation approach inspired by neuroscientific findings that similar information activates overlapping neuronal regions. This method provides a more semantically grounded approach to assess cross-lingual alignment capabilities of LLMs.

Result: NeuronXA was evaluated on several multilingual LLMs (LLaMA, Qwen, Mistral, GLM, and OLMo) across two transfer tasks and three multilingual benchmarks. With only 100 parallel sentence pairs, it achieved a Pearson correlation of 0.9556 with downstream task performance and 0.8514 with transferability.

Conclusion: NeuronXA effectively assesses both cross-lingual alignment and transferability even with small datasets, demonstrating its potential to advance cross-lingual alignment research and improve semantic understanding of multilingual LLMs.

Abstract: Large language models (LLMs) have demonstrated remarkable multilingual capabilities, however, how to evaluate cross-lingual alignment remains underexplored. Existing alignment benchmarks primarily focus on sentence embeddings, but prior research has shown that neural models tend to induce a non-smooth representation space, which impact of semantic alignment evaluation on low-resource languages. Inspired by neuroscientific findings that similar information activates overlapping neuronal regions, we propose a novel Neuron State-Based Cross-Lingual Alignment (NeuronXA) to assess the cross-lingual a lignment capabilities of LLMs, which offers a more semantically grounded approach to assess cross-lingual alignment. We evaluate NeuronXA on several prominent multilingual LLMs (LLaMA, Qwen, Mistral, GLM, and OLMo) across two transfer tasks and three multilingual benchmarks. The results demonstrate that with only 100 parallel sentence pairs, NeuronXA achieves a Pearson correlation of 0.9556 with downstream tasks performance and 0.8514 with transferability. These findings demonstrate NeuronXA’s effectiveness in assessing both cross-lingual alignment and transferability, even with a small dataset. This highlights its potential to advance cross-lingual alignment research and to improve the semantic understanding of multilingual LLMs.

[60] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Meishan Zhang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: LEAR is a method that improves Retrieval-Augmented Generation (RAG) by learning to extract high-quality evidence through explicit reasoning and conscious extraction, addressing the problem of retrieval noise that degrades LLM performance.

Details

Motivation: Retrieval noises significantly impact the quality of LLMs' generation in RAG systems. Previous evidence extraction methods lack explicit thinking, risk filtering out key clues, and struggle with generalization, necessitating better denoising mechanisms.

Method: LEAR frames evidence reasoning and extraction into a unified response for end-to-end training, uses knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers, and employs three types of verifiable reward functions (answer, length, and format) updated via policy optimization algorithm.

Result: Extensive experiments on three benchmark datasets demonstrate LEAR’s effectiveness in providing compact and high-quality evidence, improving downstream task accuracy, and promoting effective application in online RAG systems.

Conclusion: LEAR successfully addresses retrieval noise in RAG systems by explicitly reasoning to identify potential cues and consciously extracting evidence to avoid omitting key information, leading to improved LLM generation quality and better performance in practical applications.

Abstract: Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose LEAR, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of LEAR, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.

[61] 3LM: Bridging Arabic, STEM, and Code through Benchmarking

Basma El Amel Boussaha, Leen AlQadi, Mugariya Farooq, Shaikha Alsuwaidi, Giulia Campesan, Ahmed Alzubaidi, Mohammed Alyafeai, Hakim Hacid

Main category: cs.CL

TL;DR: This paper introduces 3LM, a suite of three benchmarks designed to evaluate Large Language Models (LLMs) in Arabic for STEM and code generation tasks, addressing the gap in Arabic language evaluation beyond traditional linguistic and cultural content.

Details

Motivation: Most existing Arabic LLM benchmarks focus only on linguistic, cultural, or religious content, leaving a significant gap in evaluating performance on STEM and code generation tasks that are increasingly important for real-world applications. This limits the development and assessment of Arabic LLMs in practical domains.

Method: The authors created three distinct benchmarks: (1) STEM question-answer pairs naturally sourced from Arabic textbooks and educational worksheets, (2) synthetically generated STEM questions using the same educational sources, and (3) a code generation benchmark built by carefully translating two widely used code benchmarks with human-in-the-loop review processes to ensure high-quality and faithful translations.

Result: The paper successfully developed and released three comprehensive benchmarks (3LM suite) that cover STEM and code generation domains for Arabic LLM evaluation. These benchmarks provide high-quality, domain-specific evaluation materials that were previously unavailable for Arabic language models.

Conclusion: The 3LM benchmark suite fills a critical gap in Arabic LLM evaluation by providing specialized benchmarks for STEM and code generation tasks. All benchmarks are publicly released to support and advance Arabic LLM research in these essential but previously underrepresented domains.

Abstract: Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.

[62] WAKENLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Yao Wan, Kejia Huang, Chen Huang, Zhichao Hou, Xuming Hu

Main category: cs.CL

TL;DR: This paper introduces a framework to analyze why Large Language Models output “Unknown” responses, distinguishing between genuine indeterminacy and model failures, and tests methods to improve reasoning accuracy through guided stimulation.

Details

Motivation: Current LLM evaluations focus only on whether "Unknown" answers are honest, but fail to distinguish between genuinely indeterminate inputs and solvable problems that models fail to resolve. This creates a "Vague Perception" phenomenon that obscures understanding of LLM reasoning capabilities and improvement potential.

Method: The authors develop a framework that quantifies the proportion of “Unknown” responses due to model incapacity versus genuine indeterminacy. They test guided stimulation techniques to convert “Unknown” responses into either correct “Known” answers or correct “Unknown” responses with valid reasoning, using baseline frameworks to measure theoretical accuracy limits.

Result: The framework successfully separates different sources of uncertainty in LLM responses and provides clearer insights into reasoning limits. The guided stimulation methods show potential for converting model failures into correct responses, offering a better understanding of LLM reasoning capabilities across different models.

Conclusion: The work provides a new perspective on the “Vague Perception” phenomenon by distinguishing between model incapacity and genuine indeterminacy. This framework offers meaningful insights into LLM reasoning potential and establishes a foundation for improving model performance through targeted interventions.

Abstract: Large Language Models (LLMs) frequently output the label Unknown, yet current evaluations focus almost exclusively on whether such answers are honest rather than why they arise. This blurs two distinct cases: (i) an input that is genuinely indeterminate and (ii) a solvable problem that the model fails to resolve. We call this phenomenon Vague Perception. And thus we introduce a framework that quantifies the proportion of Unknown responses attributable to model incapacity and tests whether guided stimulation can convert them into either correct Known or correct Unknown with valid reasoning. By separating these sources of uncertainty, our method provides a clearer picture of LLM reasoning limits and their potential for improvement. As we get a theoretical accuracy of reasoning task on different LLMs, we apply different methods to test whether the model can reach the accuracy given a baseline framework. Our work is meaningful in exploring the potential reasoning ability of LLMs and providing a new perspective on solving the Vague Perception phenomenon.

[63] Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

Paul-Andrei Pogăcean, Sanda-Maria Avram

Main category: cs.CL

TL;DR: This paper presents a classical frequency-based language identification algorithm using monograms and bigrams that achieves over 80% accuracy on short texts and 100% accuracy on longer texts, demonstrating that non-AI approaches remain viable alternatives to modern AI-powered language models.

Details

Motivation: With the rapid evolution of AI-powered language models dominating language identification research, non-AI-based approaches have been overshadowed. The authors aim to explore whether classical mathematical approaches can still be effective for language detection tasks.

Method: The paper implements a mathematical algorithm that leverages monograms and bigrams frequency rankings derived from established linguistic research. The approach uses frequency-based analysis to determine language characteristics without relying on AI models.

Result: The method achieves over 80% accuracy on texts shorter than 150 characters and reaches 100% accuracy for longer texts. The algorithm was tested on diverse datasets including texts of varying lengths, historical periods, and genres (short stories, fairy tales, and poems).

Conclusion: Classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection. The results demonstrate that traditional mathematical methods can still compete with modern AI approaches, especially for longer texts.

Abstract: The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80% accuracy on texts shorter than 150 characters and reaches 100% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.

[64] Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

Xiaoyu Zhan, Xinyu Fu, Hao Sun, Yuanqi Li, Jie Guo, Yanwen Guo

Main category: cs.CL

TL;DR: The paper proposes Test-Time-Matching (TTM), a training-free framework that enables LLMs to perform high-fidelity role-playing by automatically decoupling character features into personality, memory, and linguistic style through a three-stage generation pipeline.

Details

Motivation: Current role-playing language agents face limitations: prompt-based approaches lack deep immersion in specific roles (especially well-known fictional/public figures), while fine-tuning approaches are constrained by data collection challenges and high computational costs, limiting their broader applicability.

Method: Test-Time-Matching (TTM) - a training-free role-playing framework that uses LLM agents to automatically decouple character features into three components: personality, memory, and linguistic style. It employs a structured three-stage generation pipeline utilizing these features for controlled role-playing through test-time scaling and context engineering.

Result: The framework achieves high-fidelity role-playing performance and enables seamless combinations across diverse linguistic styles and variations in personality and memory. Human assessment evaluations demonstrate outstanding performance in generating expressive and stylistically consistent character dialogues.

Conclusion: TTM successfully addresses the limitations of existing role-playing approaches by providing a training-free solution that achieves deep character immersion without requiring fine-tuning or extensive computational resources, demonstrating superior performance in character dialogue generation through structured feature decomposition.

Abstract: The rapid advancement of large language models (LLMs) has enabled role-playing language agents to demonstrate significant potential in various applications. However, relying solely on prompts and contextual inputs often proves insufficient for achieving deep immersion in specific roles, particularly well-known fictional or public figures. On the other hand, fine-tuning-based approaches face limitations due to the challenges associated with data collection and the computational resources required for training, thereby restricting their broader applicability. To address these issues, we propose Test-Time-Matching (TTM), a training-free role-playing framework through test-time scaling and context engineering. TTM uses LLM agents to automatically decouple a character’s features into personality, memory, and linguistic style. Our framework involves a structured, three-stage generation pipeline that utilizes these features for controlled role-playing. It achieves high-fidelity role-playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory. We evaluate our framework through human assessment, and the results demonstrate that our method achieves the outstanding performance in generating expressive and stylistically consistent character dialogues.

[65] Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang, Peng Zhang

Main category: cs.CL

TL;DR: The paper introduces Agentar-Fin-R1, a series of financial large language models (8B and 32B parameters) based on Qwen3, designed to enhance reasoning capabilities, reliability, and domain specialization for financial applications through systematic optimization and trustworthiness frameworks.

Details

Motivation: Existing LLMs show limitations in financial applications when dealing with sophisticated reasoning, trustworthiness requirements, and domain-specific adaptation. There's a need for specialized financial LLMs that can handle high-stakes financial scenarios with better reasoning capabilities and reliability.

Method: The approach integrates: (1) a high-quality systematic financial task label system, (2) a multi-layered trustworthiness assurance framework including trustworthy knowledge engineering and multi-agent data synthesis, (3) label-guided automated difficulty-aware optimization, (4) two-stage training pipeline, and (5) dynamic attribution systems for improved training efficiency.

Result: Agentar-Fin-R1 achieves state-of-the-art performance on financial benchmarks (Fineva, FinEval, FinanceIQ) and demonstrates exceptional general reasoning capabilities on datasets like MATH-500 and GPQA-diamond. The models show substantial improvements in training efficiency and real-world deployment capabilities as validated by the newly proposed Finova evaluation benchmark.

Conclusion: Agentar-Fin-R1 successfully addresses the limitations of existing LLMs in financial applications by combining enhanced reasoning capabilities with robust trustworthiness frameworks, making it an effective solution for high-stakes financial applications while maintaining strong general reasoning performance.

Abstract: Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.

cs.CV

[66] Post-Disaster Affected Area Segmentation with a Vision Transformer (ViT)-based EVAP Model using Sentinel-2 and Formosat-5 Imagery

Yi-Shan Chu, Hsuan-Cheng Wei

Main category: cs.CV

TL;DR: A ViT-based framework uses weakly supervised learning with PCA feature analysis and confidence indexing to improve disaster-affected area segmentation from satellite imagery, enhancing Taiwan Space Agency’s emergency mapping products with limited manual annotations.

Details

Motivation: Current disaster mapping systems like TASA's EVAP require accurate segmentation of disaster-affected areas from remote sensing imagery, but obtaining sufficient manually annotated training data is challenging and time-consuming for emergency response scenarios.

Method: The framework starts with small manually annotated regions, applies PCA-based feature space analysis with a confidence index to expand labels into a weakly supervised training set, then trains ViT-based encoder-decoder models using multi-band Sentinel-2 and Formosat-5 imagery with multiple decoder variants and multi-stage loss strategies.

Result: Case studies on 2022 Poyang Lake drought and 2023 Rhodes wildfire showed improved smoothness and reliability of segmentation results compared to higher-resolution EVAP outputs, demonstrating better spatial coherence and segmentation consistency.

Conclusion: The proposed ViT-based framework offers a scalable approach for disaster mapping that can improve segmentation quality under limited supervision, making it valuable for emergency response when accurate ground truth data is unavailable.

Abstract: We propose a vision transformer (ViT)-based deep learning framework to refine disaster-affected area segmentation from remote sensing imagery, aiming to support and enhance the Emergent Value Added Product (EVAP) developed by the Taiwan Space Agency (TASA). The process starts with a small set of manually annotated regions. We then apply principal component analysis (PCA)-based feature space analysis and construct a confidence index (CI) to expand these labels, producing a weakly supervised training set. These expanded labels are then used to train ViT-based encoder-decoder models with multi-band inputs from Sentinel-2 and Formosat-5 imagery. Our architecture supports multiple decoder variants and multi-stage loss strategies to improve performance under limited supervision. During the evaluation, model predictions are compared with higher-resolution EVAP output to assess spatial coherence and segmentation consistency. Case studies on the 2022 Poyang Lake drought and the 2023 Rhodes wildfire demonstrate that our framework improves the smoothness and reliability of segmentation results, offering a scalable approach for disaster mapping when accurate ground truth is unavailable.

[67] Toward a Real-Time Framework for Accurate Monocular 3D Human Pose Estimation with Geometric Priors

Mohamed Adjel

Main category: cs.CV

TL;DR: This paper proposes a real-time monocular 3D human pose estimation framework that combines 2D keypoint detection with geometry-aware 2D-to-3D lifting, using camera intrinsics and anatomical priors to achieve accurate pose estimation without specialized hardware.

Details

Motivation: Monocular 3D human pose estimation is challenging and ill-posed, especially in real-time unconstrained environments. Direct image-to-3D approaches require large datasets and heavy models, while existing methods lack the accuracy and deployability needed for edge devices in real-world scenarios.

Method: The framework combines real-time 2D keypoint detection with geometry-aware 2D-to-3D lifting that explicitly leverages known camera intrinsics and subject-specific anatomical priors. It uses self-calibration and biomechanically-constrained inverse kinematics to generate large-scale plausible 2D-3D training pairs from MoCap and synthetic datasets.

Result: The approach enables fast, personalized, and accurate 3D pose estimation from monocular images without requiring specialized hardware, offering a lightweight and flexible alternative to direct image-to-3D methods.

Conclusion: The proposed framework successfully bridges data-driven learning and model-based priors to improve accuracy, interpretability, and deployability of 3D human motion capture on edge devices in unconstrained environments, demonstrating the potential of combining geometric knowledge with neural approaches.

Abstract: Monocular 3D human pose estimation remains a challenging and ill-posed problem, particularly in real-time settings and unconstrained environments. While direct imageto-3D approaches require large annotated datasets and heavy models, 2D-to-3D lifting offers a more lightweight and flexible alternative-especially when enhanced with prior knowledge. In this work, we propose a framework that combines real-time 2D keypoint detection with geometry-aware 2D-to-3D lifting, explicitly leveraging known camera intrinsics and subject-specific anatomical priors. Our approach builds on recent advances in self-calibration and biomechanically-constrained inverse kinematics to generate large-scale, plausible 2D-3D training pairs from MoCap and synthetic datasets. We discuss how these ingredients can enable fast, personalized, and accurate 3D pose estimation from monocular images without requiring specialized hardware. This proposal aims to foster discussion on bridging data-driven learning and model-based priors to improve accuracy, interpretability, and deployability of 3D human motion capture on edge devices in the wild.

[68] Principled Multimodal Representation Learning

Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, Tat-Seng Chua

Main category: cs.CV

TL;DR: PMRL is a novel multimodal representation learning framework that simultaneously aligns multiple modalities without anchor dependency by optimizing the dominant singular value of the representation matrix, achieving better stability and performance than traditional pairwise contrastive methods.

Details

Motivation: Traditional multimodal representation learning methods rely on pairwise contrastive learning with predefined anchor modalities, which restricts alignment across all modalities. Recent methods attempting simultaneous alignment face challenges including limitations from fixed anchor points and instability from optimizing products of singular values.

Method: PMRL leverages the theoretical insight that full alignment corresponds to a rank-1 Gram matrix and optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. It uses a softmax-based loss function treating singular values as logits to prioritize the largest singular value, plus instance-wise contrastive regularization on leading eigenvectors to maintain inter-instance separability and prevent representation collapse.

Result: Extensive experiments across diverse tasks demonstrate PMRL’s superiority compared to baseline methods in multimodal representation learning tasks.

Conclusion: PMRL successfully addresses the limitations of traditional anchor-dependent methods by providing a more stable framework for simultaneous multimodal alignment, achieving better performance through principled optimization of singular values without relying on fixed anchor points.

Abstract: Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL’s superiority compared to baseline methods. The source code will be publicly available.

[69] Coarse-to-fine crack cue for robust crack detection

Zelong Liu, Yuliang Gu, Zhichao Sun, Huachao Zhu, Xin Xiao, Bo Du, Laurent Najman, Yongchao Xu

Main category: cs.CV

TL;DR: CrackCue is a novel coarse-to-fine method that generates robust crack cues by leveraging thin structure properties to improve crack detection generalization across different domains, achieving this through max-pooling operations and reconstruction networks to create crack-free backgrounds.

Details

Motivation: Deep learning-based crack detection methods show impressive in-dataset performance but struggle with generalization to unseen domains. Previous methods typically overlook the thin structure property of cracks, which is crucial for robust detection across varied conditions like complex backgrounds, shadows, and different lighting.

Method: The method employs a coarse-to-fine approach: (1) applies max-pooling and upsampling operations on crack images to generate a coarse crack-free background, (2) uses a reconstruction network to obtain a fine crack-free background, and (3) calculates the difference between the original image and fine crack-free background to produce a fine crack cue that embeds robust crack prior information.

Result: Extensive experimental results show that CrackCue significantly improves the generalization ability and robustness of baseline crack detection methods. The method works as a plug-and-play component that can be incorporated into three advanced crack detection networks, demonstrating its versatility and effectiveness.

Conclusion: CrackCue successfully addresses the domain generalization problem in crack detection by leveraging thin structure properties to generate robust crack cues. The method provides a practical plug-and-play solution that enhances existing crack detection networks’ performance across different domains and challenging conditions.

Abstract: Crack detection is an important task in computer vision. Despite impressive in-dataset performance, deep learning-based methods still struggle in generalizing to unseen domains. The thin structure property of cracks is usually overlooked by previous methods. In this work, we introduce CrackCue, a novel method for robust crack detection based on coarse-to-fine crack cue generation. The core concept lies on leveraging the thin structure property to generate a robust crack cue, guiding the crack detection. Specifically, we first employ a simple max-pooling and upsampling operation on the crack image. This results in a coarse crack-free background, based on which a fine crack-free background can be obtained via a reconstruction network. The difference between the original image and fine crack-free background provides a fine crack cue. This fine cue embeds robust crack prior information which is unaffected by complex backgrounds, shadow, and varied lighting. As a plug-and-play method, we incorporate the proposed CrackCue into three advanced crack detection networks. Extensive experimental results demonstrate that the proposed CrackCue significantly improves the generalization ability and robustness of the baseline methods. The source code will be publicly available.

[70] HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

Li Jun, Wang Jinpeng, Tan Chaolei, Lian Niu, Chen Long, Zhang Min, Wang Yaowei, Xia Shu-Tao, Chen Bin

Main category: cs.CV

TL;DR: HLFormer introduces the first hyperbolic modeling framework for Partially Relevant Video Retrieval (PRVR), combining hyperbolic and Euclidean spaces to better capture hierarchical video semantics and improve video-text matching performance.

Details

Motivation: Existing PRVR methods suffer from geometric distortion in Euclidean space that misrepresents the intrinsic hierarchical structure of videos and overlooks hierarchical semantics, leading to suboptimal temporal modeling when matching untrimmed videos with partial text queries.

Method: HLFormer integrates Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, uses Mean-Guided Adaptive Interaction Module for dynamic feature fusion, and introduces Partial Order Preservation Loss with Lorentzian cone constraints to enforce “text < video” hierarchy.

Result: Extensive experiments demonstrate that HLFormer outperforms state-of-the-art methods in partially relevant video retrieval tasks, showing improved cross-modal matching capabilities.

Conclusion: The proposed hyperbolic modeling framework successfully addresses the limitations of Euclidean space in capturing video hierarchical semantics, leading to enhanced performance in PRVR through better temporal modeling and cross-modal matching.

Abstract: Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce “text < video” hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICCV25-HLFormer.

[71] CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis

Xiaoqiang He

Main category: cs.CV

TL;DR: This paper introduces CLAMP, an end-to-end contrastive learning framework for multimodal aspect-based sentiment analysis that uses progressive attention fusion, multi-task contrastive learning, and adaptive multi-loss aggregation to better align text and image features for identifying aspect terms and their sentiment polarities.

Details

Motivation: Existing MABSA methods suffer from cross-modal alignment noise and insufficient consistency in fine-grained representations. Global modality alignment methods often overlook the connection between aspect terms and their corresponding local visual regions, creating a representation gap between text and images that needs to be addressed.

Method: The paper proposes CLAMP framework with three key components: (1) Progressive Attention Fusion network for hierarchical multi-stage cross-modal interactions to enhance fine-grained text-image alignment, (2) Multi-task Contrastive Learning combining global modal contrast and local granularity alignment for better cross-modal representation consistency, and (3) Adaptive Multi-loss Aggregation using dynamic uncertainty-based weighting to calibrate loss contributions and mitigate gradient interference.

Result: CLAMP consistently outperforms the vast majority of existing state-of-the-art methods on standard public benchmarks for multimodal aspect-based sentiment analysis, demonstrating improved effectiveness in identifying aspect terms and determining their sentiment polarities.

Conclusion: The CLAMP framework successfully addresses the limitations of existing MABSA methods by effectively bridging the representation gap between text and images through progressive attention fusion, enhanced cross-modal consistency via contrastive learning, and improved training stability through adaptive loss aggregation, resulting in superior performance on benchmark datasets.

Abstract: Multimodal aspect-based sentiment analysis(MABSA) seeks to identify aspect terms within paired image-text data and determine their fine grained sentiment polarities, representing a fundamental task for improving the effectiveness of applications such as product review systems and public opinion monitoring. Existing methods face challenges such as cross modal alignment noise and insufficient consistency in fine-grained representations. While global modality alignment methods often overlook the connection between aspect terms and their corresponding local visual regions, bridging the representation gap between text and images remains a challenge. To address these limitations, this paper introduces an end to end Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion(CLAMP). The framework is composed of three novel modules: Progressive Attention Fusion network, Multi-task Contrastive Learning, and Adaptive Multi-loss Aggregation. The Progressive Attention Fusion network enhances fine-grained alignment between textual features and image regions via hierarchical, multi-stage cross modal interactions, effectively suppressing irrelevant visual noise. Secondly, multi-task contrastive learning combines global modal contrast and local granularity alignment to enhance cross modal representation consistency. Adaptive Multi-loss Aggregation employs a dynamic uncertainty based weighting mechanism to calibrate loss contributions according to each task’s uncertainty, thereby mitigating gradient interference. Evaluation on standard public benchmarks demonstrates that CLAMP consistently outperforms the vast majority of existing state of the art methods.

[72] SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

Youngjin Na, Sangheon Jeong, Youngwan Lee

Main category: cs.CV

TL;DR: The paper proposes SIA (Safety via Intent Awareness), a training-free prompt engineering framework that detects and mitigates harmful intent in vision-language models by using a three-stage reasoning process to understand the combined meaning of image-text inputs and provide safer responses.

Details

Motivation: Vision-language models face new safety risks where seemingly innocuous image-text combinations can reveal harmful intent and trigger unsafe responses. Existing safety approaches using post hoc filtering or static refusal prompts struggle to detect these latent risks that emerge from the subtle interplay between visual and textual inputs.

Method: SIA uses a three-stage reasoning process: (1) visual abstraction via captioning to understand the image content, (2) intent inference through few-shot chain-of-thought prompting to determine underlying intent, and (3) intent-conditioned response refinement to generate safer outputs. The framework is training-free and dynamically adapts to implicit intent rather than using predefined rules or classifiers.

Result: SIA achieves substantial safety improvements and outperforms prior methods on safety-critical benchmarks including SIUO, MM-SafetyBench, and HoliSafe. There is a minor reduction in general reasoning accuracy on MMStar, but the safety gains demonstrate the effectiveness of intent-aware reasoning.

Conclusion: SIA successfully addresses multimodal safety challenges by proactively detecting harmful intent through dynamic reasoning rather than static filtering. The framework shows that intent-aware approaches can significantly improve VLM safety alignment with human values, though with some trade-off in general performance.

Abstract: As vision-language models (VLMs) are increasingly deployed in real-world applications, new safety risks arise from the subtle interplay between images and text. In particular, seemingly innocuous inputs can combine to reveal harmful intent, leading to unsafe model responses. Despite increasing attention to multimodal safety, previous approaches based on post hoc filtering or static refusal prompts struggle to detect such latent risks, especially when harmfulness emerges only from the combination of inputs. We propose SIA (Safety via Intent Awareness), a training-free prompt engineering framework that proactively detects and mitigates harmful intent in multimodal inputs. SIA employs a three-stage reasoning process: (1) visual abstraction via captioning, (2) intent inference through few-shot chain-of-thought prompting, and (3) intent-conditioned response refinement. Rather than relying on predefined rules or classifiers, SIA dynamically adapts to the implicit intent inferred from the image-text pair. Through extensive experiments on safety-critical benchmarks including SIUO, MM-SafetyBench, and HoliSafe, we demonstrate that SIA achieves substantial safety improvements, outperforming prior methods. Although SIA shows a minor reduction in general reasoning accuracy on MMStar, the corresponding safety gains highlight the value of intent-aware reasoning in aligning VLMs with human-centric values.

Nima Fathi, Amar Kumar, Tal Arbel

Main category: cs.CV

TL;DR: AURA is the first visual linguistic explainability agent for medical imaging that uses LLM-based architecture (Qwen-32B) to provide interactive analysis, explanations, and evaluation of medical images through dynamic interactions and hypothesis testing.

Details

Motivation: Current medical imaging systems are static prediction-based, lacking transparency and adaptability. There's a need for more interactive, explainable AI systems that can provide contextual explanations and support clinical decision-making in medical image analysis.

Method: AURA leverages Qwen-32B LLM architecture with a modular toolbox including: (1) segmentation suite for phase grounding, pathology, and anatomy segmentation; (2) counterfactual image-generation module for reasoning through image-level explanations; (3) evaluation tools with pixel-wise analysis, classification, and diagnostic relevance assessment.

Result: AURA enables dynamic interactions, contextual explanations, and hypothesis testing for medical images, representing a shift from static predictions to interactive decision support systems with improved transparency and clinical alignment.

Conclusion: AURA demonstrates the transformative potential of agentic AI in medical imaging, moving beyond static predictions to provide interactive, explainable, and clinically-aligned decision support systems for comprehensive medical image analysis.

Abstract: Recent advancements in Large Language Models (LLMs) have catalyzed a paradigm shift from static prediction systems to agentic AI agents capable of reasoning, interacting with tools, and adapting to complex tasks. While LLM-based agentic systems have shown promise across many domains, their application to medical imaging remains in its infancy. In this work, we introduce AURA, the first visual linguistic explainability agent designed specifically for comprehensive analysis, explanation, and evaluation of medical images. By enabling dynamic interactions, contextual explanations, and hypothesis testing, AURA represents a significant advancement toward more transparent, adaptable, and clinically aligned AI systems. We highlight the promise of agentic AI in transforming medical image analysis from static predictions to interactive decision support. Leveraging Qwen-32B, an LLM-based architecture, AURA integrates a modular toolbox comprising: (i) a segmentation suite with phase grounding, pathology segmentation, and anatomy segmentation to localize clinically meaningful regions; (ii) a counterfactual image-generation module that supports reasoning through image-level explanations; and (iii) a set of evaluation tools including pixel-wise difference-map analysis, classification, and advanced state-of-the-art components to assess diagnostic relevance and visual interpretability.

[74] Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the Wild

Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Main category: cs.CV

TL;DR: The paper proposes Cross-domain Multi-step Thinking (CdMT) framework that uses large multimodal models with three types of descriptions (context, characteristic, and differential) to improve zero-shot fine-grained traffic sign recognition across different countries and domains, achieving superior performance on five benchmark datasets.

Details

Motivation: Zero-shot fine-grained traffic sign recognition in real-world scenarios is challenging due to cross-domain problems between clean template signs and real-world counterparts, particularly in cross-country scenarios where traffic signs differ between countries. Existing approaches struggle with these challenges.

Method: The CdMT framework leverages multi-step reasoning capabilities of large multimodal models using three key components: (1) Context descriptions with center coordinate prompt optimization for precise localization and filtering irrelevant responses, (2) Characteristic descriptions derived from in-context learning with template traffic signs to bridge cross-domain gaps, and (3) Differential descriptions to distinguish subtle differences among similar signs.

Result: CdMT achieved superior performance on all five datasets with recognition accuracies of 0.93 (GTSRB), 0.89 (BTSD), 0.97 (TT-100K), 0.89 (Sapporo), and 0.85 (Yokohama), outperforming other state-of-the-art methods across three benchmark datasets and two real-world datasets from different countries.

Conclusion: The proposed CdMT framework successfully addresses cross-domain and cross-country traffic sign recognition challenges by utilizing multi-step thinking processes in large multimodal models, demonstrating effectiveness without requiring training data and using only simple uniform instructions for cross-country TSR applications.

Abstract: In this study, we propose Cross-domain Multi-step Thinking (CdMT) to improve zero-shot fine-grained traffic sign recognition (TSR) performance in the wild. Zero-shot fine-grained TSR in the wild is challenging due to the cross-domain problem between clean template traffic signs and real-world counterparts, and existing approaches particularly struggle with cross-country TSR scenarios, where traffic signs typically differ between countries. The proposed CdMT framework tackles these challenges by leveraging the multi-step reasoning capabilities of large multimodal models (LMMs). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for LMMs. Context descriptions, which are enhanced by center coordinate prompt optimization, enable the precise localization of target traffic signs in complex road images and filter irrelevant responses via novel prior traffic sign hypotheses. Characteristic descriptions, which are derived from in-context learning with template traffic signs, bridge cross-domain gaps and enhance fine-grained TSR. Differential descriptions refine the multimodal reasoning ability of LMMs by distinguishing subtle differences among similar signs. CdMT is independent of training data and requires only simple and uniform instructions, enabling it to achieve cross-country TSR. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries. The proposed CdMT framework achieved superior performance compared with other state-of-the-art methods on all five datasets, with recognition accuracies of 0.93, 0.89, 0.97, 0.89, and 0.85 on the GTSRB, BTSD, TT-100K, Sapporo, and Yokohama datasets, respectively.

Xiang Li

Main category: cs.CV

TL;DR: This paper addresses misalignment issues between LiDAR and camera features in Bird’s-Eye-View (BEV) representation for autonomous vehicles by proposing a method that uses 2D object priors to pre-align cross-modal features, achieving state-of-the-art performance on nuScenes dataset with 71.5% mAP and 73.6% NDS.

Details

Motivation: Current LiDAR-camera fusion methods suffer from misalignment between camera and LiDAR features caused by projection errors from extrinsic calibration inaccuracies and LiDAR rolling shutter effects. These misalignments are concentrated at object-background boundaries, which can be identified by 2D detectors, motivating the use of 2D object priors to pre-align cross-modal features before fusion.

Method: The paper proposes three key components: (1) Prior Guided Depth Calibration (PGDC) that uses 2D priors to correct local misalignment and preserve correct cross-modal feature pairs, (2) Discontinuity Aware Geometric Fusion (DAGF) that processes calibrated results to suppress noise and enhance object-background boundaries, and (3) Structural Guidance Depth Modulator (SGDM) that uses gated attention mechanism to efficiently fuse aligned depth and image features.

Result: The proposed method achieves state-of-the-art performance on the nuScenes validation dataset, reaching 71.5% mAP (mean Average Precision) and 73.6% NDS (nuScenes Detection Score), demonstrating significant improvements in 3D perception capabilities for autonomous vehicles.

Conclusion: The paper successfully addresses the critical misalignment problem in LiDAR-camera fusion by leveraging 2D object priors to guide the alignment process. The three proposed components work together to achieve better cross-modal feature fusion, resulting in superior 3D perception performance that advances the state-of-the-art in autonomous vehicle perception systems.

Abstract: Integrating LiDAR and camera inputs into a unified Bird’s-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, current methods are often affected by misalignment between camera and LiDAR features. This misalignment leads to inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from minor extrinsic calibration inaccuracies and rolling shutter effect of LiDAR during vehicle motion. In this work, our key insight is that these projection errors are predominantly concentrated at object-background boundaries, which are readily identified by 2D detectors. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to correct local misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to process calibrated results from PGDC, suppressing noise and explicitly enhancing sharp transitions at object-background boundaries. To effectively utilize these transition-aware depth representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our proposed method achieves state-of-the-art performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively.

[76] NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References

Qiang Qu, Yiran Shen, Xiaoming Chen, Yuk Ying Chung, Weidong Cai, Tongliang Liu

Main category: cs.CV

TL;DR: The paper proposes NVS-SQA, a self-supervised no-reference quality assessment method for Neural View Synthesis that outperforms both no-reference and full-reference methods without requiring human labels or dense reference views.

Details

Motivation: Traditional quality assessment methods for Neural View Synthesis (like NeRF and 3D Gaussian Splatting) rely on full-reference metrics that require dense reference views, which are often unavailable. Additionally, acquiring human perceptual labels is challenging, leading to limited datasets and potential overfitting issues.

Method: The authors develop NVS-SQA, a self-supervised learning approach that learns no-reference quality representations without human labels. They use heuristic cues and quality scores as learning objectives, combined with a specialized contrastive pair preparation process to improve learning effectiveness and efficiency.

Result: NVS-SQA significantly outperforms 17 no-reference methods (averaging 109.5% improvement in SRCC, 98.6% in PLCC, and 91.5% in KRCC) and even exceeds 16 full-reference methods across all metrics (22.9% improvement in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).

Conclusion: The proposed NVS-SQA method successfully addresses the limitations of existing quality assessment approaches for neurally synthesized scenes by providing superior performance without requiring human labels or dense reference views, making it more practical for real-world applications.

Abstract: Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the “same instance, similar representation” assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).

[77] Pixels, Patterns, but No Poetry: To See The World like Humans

Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li, Yue Liu, Haoyang Li, Taihang Hu, Minhua Lin, Xinlong Yang, Ge Wu, Balong Bi, Hongyu Chen, Wentao Zhang

Main category: cs.CV

TL;DR: This paper introduces the Turing Eye Test (TET), a perception-focused benchmark that reveals state-of-the-art Multimodal Large Language Models fail catastrophically on basic visual tasks that humans solve intuitively, highlighting a critical gap in vision tower generalization rather than reasoning capabilities.

Details

Motivation: While recent research has focused on enhancing reasoning capabilities in MLLMs, a fundamental question remains unanswered: Can MLLMs truly perceive the world as humans do? This paper shifts focus from reasoning to perception to address this gap.

Method: The authors introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs’ performance on synthetic images that humans process intuitively. They test various approaches including in-context learning, training on language backbone, and fine-tuning the vision tower.

Result: State-of-the-art MLLMs exhibit catastrophic failures on perceptual tasks that are trivial for humans. In-context learning and training on language backbone fail to improve performance, while fine-tuning the vision tower enables rapid adaptation, indicating the challenges lie in vision tower generalization rather than language reasoning capabilities.

Conclusion: The benchmark reveals a key gap between current MLLMs and human perception, specifically in vision tower generalization rather than knowledge and reasoning capabilities of the language backbone. The authors plan to introduce more diverse tasks and methods to enhance visual generalization in future work.

Abstract: Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs’ performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

[78] HIPPO-Video: Simulating Watch Histories with Large Language Models for Personalized Video Highlighting

Jeongeun Lee, Youngjae Yu, Dongha Lee

Main category: cs.CV

TL;DR: This paper introduces HIPPO-Video, a novel dataset for personalized video highlighting that uses LLM-based user simulators to generate realistic watch histories, and proposes HiPHer method that outperforms existing approaches by leveraging personalized preferences for video segment saliency prediction.

Details

Motivation: Existing video datasets lack personalization and fail to capture the complexity of user behavior, relying on isolated videos or simple text queries, while personalized video highlighting is essential due to highly variable and complex user preferences in the exponentially growing video content landscape.

Method: The authors create HIPPO-Video dataset using an LLM-based user simulator to generate realistic watch histories reflecting diverse user preferences, containing 2,040 (watch history, saliency score) pairs across 20,400 videos and 170 semantic categories. They also propose HiPHer method that leverages personalized watch histories to predict preference-conditioned segment-wise saliency scores.

Result: Through extensive experiments, the proposed HiPHer method outperforms existing generic and query-based approaches for video highlighting, demonstrating superior performance in personalized video segment saliency prediction.

Conclusion: The work successfully addresses the lack of personalization in video highlighting by introducing a comprehensive dataset and effective method, showcasing potential for highly user-centric video highlighting applications in real-world scenarios.

Abstract: The exponential growth of video content has made personalized video highlighting an essential task, as user preferences are highly variable and complex. Existing video datasets, however, often lack personalization, relying on isolated videos or simple text queries that fail to capture the intricacies of user behavior. In this work, we introduce HIPPO-Video, a novel dataset for personalized video highlighting, created using an LLM-based user simulator to generate realistic watch histories reflecting diverse user preferences. The dataset includes 2,040 (watch history, saliency score) pairs, covering 20,400 videos across 170 semantic categories. To validate our dataset, we propose HiPHer, a method that leverages these personalized watch histories to predict preference-conditioned segment-wise saliency scores. Through extensive experiments, we demonstrate that our method outperforms existing generic and query-based approaches, showcasing its potential for highly user-centric video highlighting in real-world scenarios.

[79] ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension

Yizhi Hu, Zezhao Tian, Xingqun Qi, Chen Su, Bingkun Yang, Junhui Yin, Muyi Sun, Man Zhang, Zhenan Sun

Main category: cs.CV

TL;DR: This paper addresses multi-entity Referring Expression Comprehension by creating a new dataset (ReMeX) and proposing ReMeREC framework that uses Text-adaptive Multi-entity Perceptron and Entity Inter-relationship Reasoner to better localize multiple entities and model their relationships in images.

Details

Motivation: Existing REC methods focus on single-entity localization and ignore complex inter-entity relationships in multi-entity scenes, which limits their accuracy and reliability. Additionally, there is a lack of high-quality datasets with fine-grained, paired image-text-relation annotations that hinders further progress in this field.

Method: The authors propose ReMeREC framework with two key components: (1) Text-adaptive Multi-entity Perceptron (TMP) that dynamically infers entity quantity and span from textual cues to produce distinctive representations, and (2) Entity Inter-relationship Reasoner (EIR) that enhances relational reasoning and global scene understanding. They also construct ReMeX dataset with detailed relationship annotations and EntityText auxiliary dataset using large language models.

Result: ReMeREC achieves state-of-the-art performance on four benchmark datasets, significantly outperforming existing approaches in multi-entity grounding and relation prediction tasks by a large margin.

Conclusion: The proposed ReMeREC framework successfully addresses the limitations of existing REC methods by effectively modeling inter-entity relationships and handling multi-entity scenarios. The introduction of high-quality datasets (ReMeX and EntityText) and the novel architectural components (TMP and EIR) contribute to substantial improvements in multi-entity referring expression comprehension.

Abstract: Referring Expression Comprehension (REC) aims to localize specified entities or regions in an image based on natural language descriptions. While existing methods handle single-entity localization, they often ignore complex inter-entity relationships in multi-entity scenes, limiting their accuracy and reliability. Additionally, the lack of high-quality datasets with fine-grained, paired image-text-relation annotations hinders further progress. To address this challenge, we first construct a relation-aware, multi-entity REC dataset called ReMeX, which includes detailed relationship and textual annotations. We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities while modeling their inter-relations. To address the semantic ambiguity caused by implicit entity boundaries in language, we introduce the Text-adaptive Multi-entity Perceptron (TMP), which dynamically infers both the quantity and span of entities from fine-grained textual cues, producing distinctive representations. Additionally, our Entity Inter-relationship Reasoner (EIR) enhances relational reasoning and global scene understanding. To further improve language comprehension for fine-grained prompts, we also construct a small-scale auxiliary dataset, EntityText, generated using large language models. Experiments on four benchmark datasets show that ReMeREC achieves state-of-the-art performance in multi-entity grounding and relation prediction, outperforming existing approaches by a large margin.

[80] EndoGen: Conditional Autoregressive Endoscopic Video Generation

Xinyu Liu, Hengyu Liu, Cheng Wang, Tianming Liu, Yixuan Yuan

Main category: cs.CV

TL;DR: EndoGen is the first conditional endoscopic video generation framework that uses autoregressive modeling with Spatiotemporal Grid-Frame Patterning and Semantic-Aware Token Masking to generate high-quality, clinically meaningful endoscopic videos for medical imaging applications.

Details

Motivation: Prior endoscopic video generation efforts focused on static images lacking dynamic context or used unconditional generation that fails to provide meaningful clinical references. There was a need for conditional endoscopic video generation that could advance medical imaging and enhance diagnostic capabilities.

Method: The paper proposes EndoGen, an autoregressive model with two key components: (1) Spatiotemporal Grid-Frame Patterning (SGP) strategy that reformulates multi-frame generation as grid-based image generation to leverage global dependency modeling, and (2) Semantic-Aware Token Masking (SAT) mechanism that selectively focuses on semantically meaningful regions during generation to enhance content diversity.

Result: Extensive experiments demonstrate the framework’s effectiveness in generating high-quality, conditionally guided endoscopic content. The generated videos also improve performance on downstream polyp segmentation tasks, showing practical clinical utility.

Conclusion: EndoGen successfully addresses the limitations of previous approaches by providing the first conditional endoscopic video generation framework that produces clinically meaningful, high-quality endoscopic videos with practical applications in medical diagnosis and downstream medical tasks.

Abstract: Endoscopic video generation is crucial for advancing medical imaging and enhancing diagnostic capabilities. However, prior efforts in this field have either focused on static images, lacking the dynamic context required for practical applications, or have relied on unconditional generation that fails to provide meaningful references for clinicians. Therefore, in this paper, we propose the first conditional endoscopic video generation framework, namely EndoGen. Specifically, we build an autoregressive model with a tailored Spatiotemporal Grid-Frame Patterning (SGP) strategy. It reformulates the learning of generating multiple frames as a grid-based image generation pattern, which effectively capitalizes the inherent global dependency modeling capabilities of autoregressive architectures. Furthermore, we propose a Semantic-Aware Token Masking (SAT) mechanism, which enhances the model’s ability to produce rich and diverse content by selectively focusing on semantically meaningful regions during the generation process. Through extensive experiments, we demonstrate the effectiveness of our framework in generating high-quality, conditionally guided endoscopic content, and improves the performance of downstream task of polyp segmentation. Code released at https://www.github.com/CUHK-AIM-Group/EndoGen.

[81] Infinite Video Understanding

Dell Zhang, Xiangyu Chen, Jixiang Luo, Mengxi Jia, Changzhi Sun, Ruilong Ren, Jingren Liu, Hao Sun, Xuelong Li

Main category: cs.CV

TL;DR: This position paper proposes “Infinite Video Understanding” as a blue-sky research objective, arguing that current video understanding models face significant limitations when processing lengthy video content, and outlines key research directions needed to achieve continuous processing of arbitrarily long video sequences.

Details

Motivation: Current state-of-the-art video understanding models, despite advances in LLMs and MLLMs, still face major computational and memory constraints when processing video content extending beyond minutes or hours. Challenges include maintaining temporal coherence, tracking complex events, and preserving fine-grained details over extended periods, indicating a need for a transformative research direction.

Method: The paper positions Infinite Video Understanding as a research framework and north star objective, drawing inspiration from existing long/ultra-long video understanding work and related fields. It outlines core challenges and proposes key research directions including streaming architectures, persistent memory mechanisms, hierarchical representations, event-centric reasoning, and novel evaluation paradigms.

Result: The paper identifies and articulates the fundamental limitations of current video understanding approaches and establishes a comprehensive research agenda. It provides a structured framework for addressing infinite-duration video processing through multiple technical innovations and evaluation methodologies.

Conclusion: Infinite Video Understanding represents a vital and ambitious research objective that can drive innovation across multiple AI research areas. By framing this as a blue-sky goal, the multimedia and AI research communities can work toward developing transformative capabilities for continuous video processing and understanding of arbitrary duration content.

Abstract: The rapid advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have ushered in remarkable progress in video understanding. However, a fundamental challenge persists: effectively processing and comprehending video content that extends beyond minutes or hours. While recent efforts like Video-XL-2 have demonstrated novel architectural solutions for extreme efficiency, and advancements in positional encoding such as HoPE and VideoRoPE++ aim to improve spatio-temporal understanding over extensive contexts, current state-of-the-art models still encounter significant computational and memory constraints when faced with the sheer volume of visual tokens from lengthy sequences. Furthermore, maintaining temporal coherence, tracking complex events, and preserving fine-grained details over extended periods remain formidable hurdles, despite progress in agentic reasoning systems like Deep Video Discovery. This position paper posits that a logical, albeit ambitious, next frontier for multimedia research is Infinite Video Understanding – the capability for models to continuously process, understand, and reason about video data of arbitrary, potentially never-ending duration. We argue that framing Infinite Video Understanding as a blue-sky research objective provides a vital north star for the multimedia, and the wider AI, research communities, driving innovation in areas such as streaming architectures, persistent memory mechanisms, hierarchical and adaptive representations, event-centric reasoning, and novel evaluation paradigms. Drawing inspiration from recent work on long/ultra-long video understanding and several closely related fields, we outline the core challenges and key research directions towards achieving this transformative capability.

[82] CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos

Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, Wentao Zhang

Main category: cs.CV

TL;DR: CausalStep is a new benchmark that evaluates stepwise causal reasoning in videos through 100 videos with 1,852 QA pairs, revealing significant gaps between current AI models and human-level reasoning capabilities.

Details

Motivation: Existing video benchmarks fail to rigorously evaluate true causal and stepwise reasoning, mainly assessing shallow understanding while allowing models to exploit global context and use shortcut solutions instead of proper sequential reasoning.

Method: The researchers created CausalStep by segmenting videos into causally linked units, enforcing strict stepwise question-answer protocols that require sequential answers, including carefully constructed distractors based on error type taxonomy, and introducing seven diagnostic metrics for comprehensive evaluation.

Result: The benchmark contains 100 videos across six categories with 1,852 multiple-choice QA pairs. Experiments with leading proprietary and open-source models, as well as human baselines, revealed a significant performance gap between current models and human-level stepwise reasoning capabilities.

Conclusion: CausalStep provides a rigorous benchmark that can drive progress in developing more robust and interpretable video reasoning capabilities by properly evaluating stepwise causal reasoning without allowing shortcut solutions.

Abstract: Recent advances in large language models (LLMs) have improved reasoning in text and image domains, yet achieving robust video reasoning remains a significant challenge. Existing video benchmarks mainly assess shallow understanding and reasoning and allow models to exploit global context, failing to rigorously evaluate true causal and stepwise reasoning. We present CausalStep, a benchmark designed for explicit stepwise causal reasoning in videos. CausalStep segments videos into causally linked units and enforces a strict stepwise question-answer (QA) protocol, requiring sequential answers and preventing shortcut solutions. Each question includes carefully constructed distractors based on error type taxonomy to ensure diagnostic value. The benchmark features 100 videos across six categories and 1,852 multiple-choice QA pairs. We introduce seven diagnostic metrics for comprehensive evaluation, enabling precise diagnosis of causal reasoning capabilities. Experiments with leading proprietary and open-source models, as well as human baselines, reveal a significant gap between current models and human-level stepwise reasoning. CausalStep provides a rigorous benchmark to drive progress in robust and interpretable video reasoning.

[83] DFDNet: Dynamic Frequency-Guided De-Flare Network

Minglong Xue, Aoxiang Ning, Shivakumara Palaiahnakote, Mingliang Zhou

Main category: cs.CV

TL;DR: This paper proposes DFDNet, a frequency-domain guided network that removes large-scale flare artifacts from nighttime photos by decoupling content from flare information in the frequency domain, using global dynamic frequency guidance and local detail guidance modules.

Details

Motivation: Existing methods struggle with removing large-scale flare artifacts and repairing structural damage near light sources in nighttime photography. The authors observed that flare artifacts show more significant discrepancies in frequency domain compared to spatial domain, motivating a frequency-domain approach.

Method: DFDNet consists of two main components: (1) Global Dynamic Frequency-domain Guidance (GDFG) module that dynamically optimizes global frequency features to separate flare from content information, and (2) Local Detail Guidance Module (LDGM) using contrastive learning to align local light source features with reference images for fine-grained restoration.

Result: Experimental results show that DFDNet outperforms existing state-of-the-art methods in removing flare artifacts and restoring image quality in nighttime photography.

Conclusion: The proposed frequency-domain approach effectively addresses the limitations of existing spatial-domain methods by leveraging the distinct frequency characteristics of flare artifacts, enabling better separation of content and flare information for superior image restoration performance.

Abstract: Strong light sources in nighttime photography frequently produce flares in images, significantly degrading visual quality and impacting the performance of downstream tasks. While some progress has been made, existing methods continue to struggle with removing large-scale flare artifacts and repairing structural damage in regions near the light source. We observe that these challenging flare artifacts exhibit more significant discrepancies from the reference images in the frequency domain compared to the spatial domain. Therefore, this paper presents a novel dynamic frequency-guided deflare network (DFDNet) that decouples content information from flare artifacts in the frequency domain, effectively removing large-scale flare artifacts. Specifically, DFDNet consists mainly of a global dynamic frequency-domain guidance (GDFG) module and a local detail guidance module (LDGM). The GDFG module guides the network to perceive the frequency characteristics of flare artifacts by dynamically optimizing global frequency domain features, effectively separating flare information from content information. Additionally, we design an LDGM via a contrastive learning strategy that aligns the local features of the light source with the reference image, reduces local detail damage from flare removal, and improves fine-grained image restoration. The experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods in terms of performance. The code is available at \href{https://github.com/AXNing/DFDNet}{https://github.com/AXNing/DFDNet}.

[84] Finding Dori: Memorization in Text-to-Image Diffusion Models Is Less Local Than Assumed

Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch

Main category: cs.CV

TL;DR: This paper exposes vulnerabilities in current pruning-based defenses against data memorization in text-to-image diffusion models, showing that simple text embedding adjustments can re-trigger replication even after pruning, and proposes an adversarial fine-tuning approach as a more robust mitigation strategy.

Details

Motivation: Existing text-to-image diffusion models can memorize and replicate training data, raising data privacy and intellectual property concerns. Current mitigation approaches using weight pruning are based on the assumption that memorization can be localized, but their actual robustness needs evaluation.

Method: The authors assess pruning-based defense robustness by demonstrating replication re-triggering through minor text embedding adjustments. They challenge memorization locality assumptions by showing replication can be triggered from diverse embedding space locations. They propose a novel adversarial fine-tuning method that iteratively searches for replication triggers and updates the model to increase robustness.

Result: The research demonstrates that pruning-based defenses are fragile - minor text embedding adjustments can re-trigger data replication even after pruning. They show that memorization is not localized but can be triggered from multiple locations in the text embedding space through different model pathways, challenging fundamental assumptions of current mitigation strategies.

Conclusion: Existing pruning-based mitigation strategies are insufficient for preventing data replication in text-to-image diffusion models. The paper concludes that methods should focus on truly removing memorized content rather than suppressing retrieval, and provides a foundation for building more trustworthy and compliant generative AI through their proposed adversarial fine-tuning approach.

Abstract: Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering replication, based on the assumption that memorization can be localized. Our research assesses the robustness of these pruning-based approaches. We demonstrate that even after pruning, minor adjustments to text embeddings of input prompts are sufficient to re-trigger data replication, highlighting the fragility of these defenses. Furthermore, we challenge the fundamental assumption of memorization locality, by showing that replication can be triggered from diverse locations within the text embedding space, and follows different paths in the model. Our findings indicate that existing mitigation strategies are insufficient and underscore the need for methods that truly remove memorized content, rather than attempting to suppress its retrieval. As a first step in this direction, we introduce a novel adversarial fine-tuning method that iteratively searches for replication triggers and updates the model to increase robustness. Through our research, we provide fresh insights into the nature of memorization in text-to-image DMs and a foundation for building more trustworthy and compliant generative AI.

[85] STQE: Spatial-Temporal Quality Enhancement for G-PCC Compressed Dynamic Point Clouds

Tian Guo, Hui Yuan, Xiaolong Mao, Shiqi Jiang, Raouf Hamzaoui, Sam Kwong

Main category: cs.CV

TL;DR: This paper proposes STQE, a spatial-temporal attribute quality enhancement network that improves visual quality of G-PCC compressed dynamic point clouds by exploiting spatial-temporal correlations, achieving significant improvements in PSNR and BD-rate reductions.

Details

Motivation: Very few studies have addressed quality enhancement for compressed dynamic point clouds, particularly the effective exploitation of spatial-temporal correlations between point cloud frames remains largely unexplored, creating a gap in improving visual quality of compressed dynamic point clouds.

Method: The method includes four key components: (1) a recoloring-based motion compensation module for precise inter-frame geometric alignment, (2) a channel-aware temporal attention module for highlighting relevant regions across bidirectional reference frames, (3) a Gaussian-guided neighborhood feature aggregation module for capturing spatial dependencies between geometry and color attributes, and (4) a joint loss function based on Pearson correlation coefficient to reduce over-smoothing effects.

Result: When applied to the latest G-PCC test model, STQE achieved improvements of 0.855 dB, 0.682 dB, and 0.828 dB in delta PSNR, with BD-rate reductions of -25.2%, -31.6%, and -32.5% for the Luma, Cb, and Cr components, respectively.

Conclusion: The proposed STQE network successfully exploits spatial-temporal correlations to enhance the visual quality of G-PCC compressed dynamic point clouds, demonstrating significant improvements in both PSNR and compression efficiency across all color components.

Abstract: Very few studies have addressed quality enhancement for compressed dynamic point clouds. In particular, the effective exploitation of spatial-temporal correlations between point cloud frames remains largely unexplored. Addressing this gap, we propose a spatial-temporal attribute quality enhancement (STQE) network that exploits both spatial and temporal correlations to improve the visual quality of G-PCC compressed dynamic point clouds. Our contributions include a recoloring-based motion compensation module that remaps reference attribute information to the current frame geometry to achieve precise inter-frame geometric alignment, a channel-aware temporal attention module that dynamically highlights relevant regions across bidirectional reference frames, a Gaussian-guided neighborhood feature aggregation module that efficiently captures spatial dependencies between geometry and color attributes, and a joint loss function based on the Pearson correlation coefficient, designed to alleviate over-smoothing effects typical of point-wise mean squared error optimization. When applied to the latest G-PCC test model, STQE achieved improvements of 0.855 dB, 0.682 dB, and 0.828 dB in delta PSNR, with Bj{\o}ntegaard Delta rate (BD-rate) reductions of -25.2%, -31.6%, and -32.5% for the Luma, Cb, and Cr components, respectively.

[86] UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Jinting Wang, Shan Yang, Li Liu

Main category: cs.CV

TL;DR: This paper proposes UniCUE, a unified framework that directly converts Cued Speech videos to speech without intermediate text conversion, achieving state-of-the-art performance by integrating visual-semantic alignment and cross-task representations.

Details

Motivation: Current CSV2S approaches suffer from poor performance due to insufficient CS data when using direct generation, while combined CSR+TTS methods lead to error propagation and temporal misalignment between speech and video dynamics due to reliance on intermediate text.

Method: UniCUE framework with three key components: (1) fine-grained semantic alignment pool for precise visual-speech mapping, (2) VisioPhonetic adapter for cross-task representation bridging between CSV2S and CSR, and (3) pose-aware visual processor for enhanced spatiotemporal correlations between lip and hand movements.

Result: UniCUE achieves state-of-the-art performance across various metrics on a newly established Chinese CS dataset, demonstrating superior performance compared to existing approaches.

Conclusion: The proposed UniCUE framework successfully addresses the challenges of CSV2S by directly generating speech from CS videos without intermediate text, providing better performance through integrated visual-semantic alignment and cross-task learning.

Abstract: Cued Speech (CS) enhances lipreading through hand coding, providing precise speech perception support for the hearing-impaired. CS Video-to-Speech generation (CSV2S) task aims to convert the CS visual expressions (CS videos) of hearing-impaired individuals into comprehensible speech signals. Direct generation of speech from CS video (called single CSV2S) yields poor performance due to insufficient CS data. Current research mostly focuses on CS Recognition (CSR), which convert video content into linguistic text. Based on this, one straightforward way of CSV2S is to combine CSR with a Text-to-Speech system. This combined architecture relies on text as an intermediate medium for stepwise cross-modal alignment, which may lead to error propagation and temporal misalignment between speech and video dynamics. To address these challenges, we propose a novel approach that directly generates speech from CS videos without relying on intermediate text. Building upon this, we propose UniCUE, the first unified framework for CSV2S, whose core innovation lies in the integration of the CSR task that provides fine-grained visual-semantic information to facilitate speech generation from CS videos. More precisely, (1) a novel fine-grained semantic alignment pool to ensure precise mapping between visual features and speech contents; (2) a VisioPhonetic adapter to bridge cross-task representations, ensuring seamless compatibility between two distinct tasks (i.e., CSV2S and CSR); (3) a pose-aware visual processor is introduced to enhance fine-grained spatiotemporal correlations between lip and hand movements in CS video. Experiments on our new established Chinese CS dataset show that our UniCUE achieves state-of-the-art performance across various metrics.

[87] Sparser2Sparse: Single-shot Sparser-to-Sparse Learning for Spatial Transcriptomics Imputation with Natural Image Co-learning

Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou

Main category: cs.CV

TL;DR: S2S-ST is a novel framework that accurately imputes spatial transcriptomics data using only sparse, low-cost datasets and natural images for co-training, significantly reducing the need for expensive high-resolution ST data.

Details

Motivation: High-resolution spatial transcriptomics data is expensive and scarce, creating significant barriers for biomedical research applications that require detailed gene expression profiling within tissues.

Method: The framework combines three innovations: (1) sparser-to-sparse self-supervised learning to exploit spatial patterns in ST data, (2) cross-domain co-learning with natural images to improve feature representation, and (3) Cascaded Data Consistent Imputation Network (CDCIN) for iterative prediction refinement while maintaining data fidelity.

Result: Extensive experiments on breast cancer, liver, and lymphoid tissues show the method outperforms state-of-the-art approaches in imputation accuracy, enabling robust ST reconstruction from sparse inputs across diverse tissue types.

Conclusion: The S2S-ST framework successfully reduces dependence on costly high-resolution spatial transcriptomics data while maintaining accuracy, potentially enabling broader adoption of ST technology in biomedical research and clinical applications.

Abstract: Spatial transcriptomics (ST) has revolutionized biomedical research by enabling high resolution gene expression profiling within tissues. However, the high cost and scarcity of high resolution ST data remain significant challenges. We present Single-shot Sparser-to-Sparse (S2S-ST), a novel framework for accurate ST imputation that requires only a single and low-cost sparsely sampled ST dataset alongside widely available natural images for co-training. Our approach integrates three key innovations: (1) a sparser-to-sparse self-supervised learning strategy that leverages intrinsic spatial patterns in ST data, (2) cross-domain co-learning with natural images to enhance feature representation, and (3) a Cascaded Data Consistent Imputation Network (CDCIN) that iteratively refines predictions while preserving sampled gene data fidelity. Extensive experiments on diverse tissue types, including breast cancer, liver, and lymphoid tissue, demonstrate that our method outperforms state-of-the-art approaches in imputation accuracy. By enabling robust ST reconstruction from sparse inputs, our framework significantly reduces reliance on costly high resolution data, facilitating potential broader adoption in biomedical research and clinical applications.

[88] Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts

Chiao-An Yang, Kuan-Chuan Peng, Raymond A. Yeh

Main category: cs.CV

TL;DR: This paper introduces Long-Tailed Online Anomaly Detection (LTOAD), a novel task that combines long-tailed data distribution with online learning for anomaly detection, proposing a class-agnostic framework that outperforms existing methods without requiring class labels.

Details

Motivation: Existing anomaly detection methods face limitations when dealing with real-world scenarios that involve both long-tailed data distributions and online learning constraints. Current long-tailed anomaly detection (LTAD) methods are class-aware and require class labels, which are not available in online settings, creating a gap between offline state-of-the-art methods and practical online applications.

Method: The authors propose a class-agnostic framework for long-tailed anomaly detection that can be adapted to online learning settings. The method eliminates the dependency on class labels that previous LTAD approaches required, making it suitable for online deployment where class information is not available during inference.

Result: The proposed method achieves superior performance compared to state-of-the-art baselines in offline LTAD settings across industrial manufacturing and medical domains, with +4.63% image-AUROC improvement on MVTec dataset. In the challenging long-tailed online setting, it achieves +0.53% image-AUROC improvement over baselines. The method outperforms even approaches that have access to class labels and number of classes.

Conclusion: The paper successfully addresses the challenging problem of long-tailed online anomaly detection by developing a class-agnostic framework that works effectively in both offline and online settings. The method demonstrates consistent improvements over existing approaches and provides a benchmark for future research in this area.

Abstract: Anomaly detection (AD) identifies the defect regions of a given image. Recent works have studied AD, focusing on learning AD without abnormal images, with long-tailed distributed training data, and using a unified model for all classes. In addition, online AD learning has also been explored. In this work, we expand in both directions to a realistic setting by considering the novel task of long-tailed online AD (LTOAD). We first identified that the offline state-of-the-art LTAD methods cannot be directly applied to the online setting. Specifically, LTAD is class-aware, requiring class labels that are not available in the online setting. To address this challenge, we propose a class-agnostic framework for LTAD and then adapt it to our online learning setting. Our method outperforms the SOTA baselines in most offline LTAD settings, including both the industrial manufacturing and the medical domain. In particular, we observe +4.63% image-AUROC on MVTec even compared to methods that have access to class labels and the number of classes. In the most challenging long-tailed online setting, we achieve +0.53% image-AUROC compared to baselines. Our LTOAD benchmark is released here: https://doi.org/10.5281/zenodo.16283852 .

[89] Divisive Decisions: Improving Salience-Based Training for Generalization in Binary Classification Tasks

Jacob Piland, Chris Sweet, Adam Czajka

Main category: cs.CV

TL;DR: This paper proposes new saliency-guided training methods that incorporate both true-class and false-class Class Activation Maps (CAMs) to improve deep learning model generalization, showing superior performance over traditional true-class-only approaches across multiple binary classification tasks.

Details

Motivation: Existing saliency-guided training methods only use true-class CAMs and ignore false-class CAMs. The authors hypothesize that in binary tasks, true and false CAMs should diverge on important classification features identified by humans, which could be leveraged to improve model training.

Method: The paper introduces three new saliency-guided training methods that incorporate both true-class and false-class CAMs into the training strategy, along with a novel post-hoc tool for identifying important features. These methods compare model CAMs against human reference saliency maps for both correct and incorrect label classes.

Result: The proposed methods demonstrate improved generalization capabilities compared to traditional saliency-guided training approaches across several diverse binary classification tasks, including synthetic face detection, biometric presentation attack detection, and chest X-ray anomaly classification in both close-set and open-set scenarios.

Conclusion: Incorporating false-class CAMs alongside true-class CAMs in saliency-guided training significantly enhances model generalization. The divergence between true and false CAMs on human-identified important features is a valuable signal that can be exploited to train more robust deep learning models for binary classification tasks.

Abstract: Existing saliency-guided training approaches improve model generalization by incorporating a loss term that compares the model’s class activation map (CAM) for a sample’s true-class ({\it i.e.}, correct-label class) against a human reference saliency map. However, prior work has ignored the false-class CAM(s), that is the model’s saliency obtained for incorrect-label class. We hypothesize that in binary tasks the true and false CAMs should diverge on the important classification features identified by humans (and reflected in human saliency maps). We use this hypothesis to motivate three new saliency-guided training methods incorporating both true- and false-class model’s CAM into the training strategy and a novel post-hoc tool for identifying important features. We evaluate all introduced methods on several diverse binary close-set and open-set classification tasks, including synthetic face detection, biometric presentation attack detection, and classification of anomalies in chest X-ray scans, and find that the proposed methods improve generalization capabilities of deep learning models over traditional (true-class CAM only) saliency-guided training approaches. We offer source codes and model weights\footnote{GitHub repository link removed to preserve anonymity} to support reproducible research.

[90] Bringing Balance to Hand Shape Classification: Mitigating Data Imbalance Through Generative Models

Gaston Gustavo Rios, Pedro Dal Bianco, Franco Ronchetti, Facundo Quiroga, Oscar Stanchi, Santiago Ponte Ahón, Waldo Hasperué

Main category: cs.CV

TL;DR: This paper addresses the problem of limited and unbalanced sign language handshape datasets by using synthetic data generation with GANs (ReACGAN and SPADE) to augment training data, achieving 5% improvement in accuracy on the RWTH German sign language dataset.

Details

Motivation: Sign language handshape datasets are severely limited and unbalanced, which poses significant challenges for effective model training. This creates a need for data augmentation techniques to improve classifier performance on small, imbalanced datasets.

Method: The authors use an EfficientNet classifier trained on the RWTH German sign language handshape dataset and compare two GAN architectures for synthetic data generation: ReACGAN (which uses label information through an auxiliary classifier) and SPADE (which uses spatially-adaptive normalization conditioned on pose information). They explore different strategies to combine generated and real images for training.

Result: The proposed techniques improve state-of-the-art accuracy on the RWTH dataset by 5%. The method also demonstrates generalization capability across different sign language datasets by leveraging pose-based generation trained on the HaGRID dataset, achieving comparable performance to single-source trained classifiers without retraining the generator.

Conclusion: Synthetic data generation using GANs effectively addresses the limitations of small and unbalanced sign language handshape datasets. The combination of ReACGAN and SPADE architectures provides a viable solution for improving handshape classification performance and demonstrates cross-dataset generalization capabilities.

Abstract: Most sign language handshape datasets are severely limited and unbalanced, posing significant challenges to effective model training. In this paper, we explore the effectiveness of augmenting the training data of a handshape classifier by generating synthetic data. We use an EfficientNet classifier trained on the RWTH German sign language handshape dataset, which is small and heavily unbalanced, applying different strategies to combine generated and real images. We compare two Generative Adversarial Networks (GAN) architectures for data generation: ReACGAN, which uses label information to condition the data generation process through an auxiliary classifier, and SPADE, which utilizes spatially-adaptive normalization to condition the generation on pose information. ReACGAN allows for the generation of realistic images that align with specific handshape labels, while SPADE focuses on generating images with accurate spatial handshape configurations. Our proposed techniques improve the current state-of-the-art accuracy on the RWTH dataset by 5%, addressing the limitations of small and unbalanced datasets. Additionally, our method demonstrates the capability to generalize across different sign language datasets by leveraging pose-based generation trained on the extensive HaGRID dataset. We achieve comparable performance to single-source trained classifiers without the need for retraining the generator.

[91] Transformer Based Building Boundary Reconstruction using Attraction Field Maps

Muhammad Kamran, Mohammad Moein Sheikholeslami, Andreas Wichmann, Gunho Sohn

Main category: cs.CV

TL;DR: This paper presents Decoupled-PolyGCN, a novel deep learning approach using Graph Convolutional Networks for automated building footprint extraction from satellite imagery, achieving 6% improvement in AP and 10% in AR over existing methods.

Details

Motivation: The growing number of satellites provides vast high-resolution visual data for spatial mapping, but reconstructing spatial maps from satellite imagery remains challenging due to the complexity of creating high-level object representations and primitives-based object representation difficulties, leading to reliance on labor-intensive manual processes.

Method: The paper proposes a novel deep learning methodology using Graph Convolutional Networks (GCNs) that incorporates geometric regularity into building boundaries, integrates multi-scale and multi-resolution features, and embeds Attraction Field Maps into the network for building footprint reconstruction from single satellite images.

Result: The Decoupled-PolyGCN model outperforms existing methods by 6% in Average Precision (AP) and 10% in Average Recall (AR), demonstrating effective delivery of accurate and regularized building footprints across diverse and challenging scenarios.

Conclusion: The proposed approach provides a scalable and precise solution for automated building footprint extraction from satellite imagery, with potential applications in urban planning, disaster management, and large-scale spatial analysis.

Abstract: In recent years, the number of remote satellites orbiting the Earth has grown significantly, streaming vast amounts of high-resolution visual data to support diverse applications across civil, public, and military domains. Among these applications, the generation and updating of spatial maps of the built environment have become critical due to the extensive coverage and detailed imagery provided by satellites. However, reconstructing spatial maps from satellite imagery is a complex computer vision task, requiring the creation of high-level object representations, such as primitives, to accurately capture the built environment. While the past decade has witnessed remarkable advancements in object detection and representation using visual data, primitives-based object representation remains a persistent challenge in computer vision. Consequently, high-quality spatial maps often rely on labor-intensive and manual processes. This paper introduces a novel deep learning methodology leveraging Graph Convolutional Networks (GCNs) to address these challenges in building footprint reconstruction. The proposed approach enhances performance by incorporating geometric regularity into building boundaries, integrating multi-scale and multi-resolution features, and embedding Attraction Field Maps into the network. These innovations provide a scalable and precise solution for automated building footprint extraction from a single satellite image, paving the way for impactful applications in urban planning, disaster management, and large-scale spatial analysis. Our model, Decoupled-PolyGCN, outperforms existing methods by 6% in AP and 10% in AR, demonstrating its ability to deliver accurate and regularized building footprints across diverse and challenging scenarios.

[92] Controllable Hybrid Captioner for Improved Long-form Video Understanding

Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy

Main category: cs.CV

TL;DR: This paper presents a video understanding system that converts long-form videos into text-based summaries using progressive captioning, enriched with static scene descriptions, to enable LLM-based question answering about video content.

Details

Motivation: Long-form videos are extremely dense and high-dimensional, making direct processing computationally challenging. Text-based summaries provide a compact representation that can be easily processed by LLMs for complex video question answering, but existing video captions focus mainly on human actions and miss other scene information.

Method: The approach uses progressive construction of text-based memory through: (1) partitioning videos into meaningful segments, (2) using LaViLa video captioner for action descriptions, (3) incorporating LLaVA VLM for static scene descriptions, and (4) fine-tuning a controllable hybrid captioner that can produce both action and scene captions based on input tokens signaling scene changes.

Result: The system successfully creates more detailed and complete caption logs by combining action and scene descriptions. The fine-tuned controllable hybrid captioner significantly improves efficiency compared to using separate models, and expands the range of answerable questions from textual memory.

Conclusion: The proposed video understanding system effectively addresses the challenge of processing long-form videos by creating enriched textual representations that capture both actions and scene information, enabling better LLM-based reasoning and question answering while improving computational efficiency through the unified hybrid captioner.

Abstract: Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.

[93] Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models

Tz-Ying Wu, Tahani Trigui, Sharath Nittur Sridhar, Anand Bodas, Subarna Tripathi

Main category: cs.CV

TL;DR: VideoNarrator is a training-free pipeline that generates dense video captions with precise timestamps by combining multiple MLLMs and VLMs to address hallucination and temporal alignment issues in video narration.

Details

Motivation: Current multimodal large language models struggle with temporally aligned video narrations and tend to hallucinate, especially in unfamiliar scenarios, creating a need for more accurate and structured video captioning systems.

Method: A flexible training-free pipeline that leverages off-the-shelf MLLMs and visual-language models functioning as caption generators, context providers, and caption verifiers in a synergistic interaction to produce dense video captions with precise timestamps.

Result: The synergistic interaction of multiple models significantly enhances video narration quality and accuracy, effectively reduces hallucinations, and improves temporal alignment compared to existing approaches.

Conclusion: VideoNarrator provides a structured approach that enhances video understanding and facilitates downstream tasks like video summarization and question answering, with potential applications in advertising and marketing.

Abstract: In this paper, we introduce VideoNarrator, a novel training-free pipeline designed to generate dense video captions that offer a structured snapshot of video content. These captions offer detailed narrations with precise timestamps, capturing the nuances present in each segment of the video. Despite advancements in multimodal large language models (MLLMs) for video comprehension, these models often struggle with temporally aligned narrations and tend to hallucinate, particularly in unfamiliar scenarios. VideoNarrator addresses these challenges by leveraging a flexible pipeline where off-the-shelf MLLMs and visual-language models (VLMs) can function as caption generators, context providers, or caption verifiers. Our experimental results demonstrate that the synergistic interaction of these components significantly enhances the quality and accuracy of video narrations, effectively reducing hallucinations and improving temporal alignment. This structured approach not only enhances video understanding but also facilitates downstream tasks such as video summarization and video question answering, and can be potentially extended for advertising and marketing applications.

[94] Few-Shot Learning in Video and 3D Object Detection: A Survey

Md Meftahul Ferdaus, Kendall N. Niles, Joe Tom, Mahdi Abdelguerfi, Elias Ioup

Main category: cs.CV

TL;DR: This survey examines few-shot learning (FSL) advances for video and 3D object detection, showing how FSL can reduce expensive manual annotation by enabling models to recognize novel classes from just a few examples, with applications in autonomous driving and video analysis.

Details

Motivation: Manual data labeling for object detection is expensive and laborious, especially for video (requiring annotation across frames) and 3D data (requiring costly 3D annotations). Few-shot learning offers a solution by enabling models to detect novel classes with minimal annotated examples, making practical deployment more feasible.

Method: The survey examines FSL techniques across two domains: (1) Video detection using tube proposals and temporal matching networks that leverage spatiotemporal structure across frames, and (2) 3D detection integrating FSL with specialized point cloud networks and losses designed for class imbalance, handling challenges like data sparsity and lack of texture in LiDAR/depth data.

Result: FSL demonstrates promise in both video and 3D object detection by efficiently leveraging information across feature, temporal, and data modalities. The techniques successfully address domain-specific challenges while maintaining balance between generalization and overfitting through prototype matching approaches.

Conclusion: Few-shot learning shows significant potential for reducing annotation requirements and enabling real-world deployment across video, 3D, and other applications. By minimizing supervision needs through efficient cross-modal information leverage, FSL makes practical autonomous driving deployment and video analysis more accessible and cost-effective.

Abstract: Few-shot learning (FSL) enables object detection models to recognize novel classes given only a few annotated examples, thereby reducing expensive manual data labeling. This survey examines recent FSL advances for video and 3D object detection. For video, FSL is especially valuable since annotating objects across frames is more laborious than for static images. By propagating information across frames, techniques like tube proposals and temporal matching networks can detect new classes from a couple examples, efficiently leveraging spatiotemporal structure. FSL for 3D detection from LiDAR or depth data faces challenges like sparsity and lack of texture. Solutions integrate FSL with specialized point cloud networks and losses tailored for class imbalance. Few-shot 3D detection enables practical autonomous driving deployment by minimizing costly 3D annotation needs. Core issues in both domains include balancing generalization and overfitting, integrating prototype matching, and handling data modality properties. In summary, FSL shows promise for reducing annotation requirements and enabling real-world video, 3D, and other applications by efficiently leveraging information across feature, temporal, and data modalities. By comprehensively surveying recent advancements, this paper illuminates FSL’s potential to minimize supervision needs and enable deployment across video, 3D, and other real-world applications.

[95] SDGOCC: Semantic and Depth-Guided Bird’s-Eye View Transformation for 3D Multimodal Occupancy Prediction

Zaipeng Duan, Chenxu Dang, Xuzhong Hu, Pei An, Junfeng Ding, Jie Zhan, Yunbiao Xu, Jie Ma

Main category: cs.CV

TL;DR: The paper proposes SDG-OCC, a multimodal 3D occupancy prediction network that combines camera and LiDAR data through semantic and depth-guided view transformation and fusion-to-occupancy-driven active distillation, achieving state-of-the-art real-time performance on autonomous driving datasets.

Details

Motivation: Existing 3D occupancy prediction methods are limited by single-modality approaches: camera-based methods lack depth information while LiDAR-based methods struggle with occlusions. Current lightweight methods using Lift-Splat-Shoot pipeline suffer from inaccurate depth estimation and fail to fully exploit geometric and semantic information from 3D LiDAR points.

Method: The authors propose SDG-OCC with two key components: (1) joint semantic and depth-guided view transformation that constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization, and (2) fusion-to-occupancy-driven active distillation that extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. They also introduce SDG-Fusion (fusion only) and SDG-KL (fusion + distillation for faster inference) variants.

Result: The method achieves state-of-the-art performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating effectiveness and robustness across different datasets.

Conclusion: SDG-OCC successfully addresses the limitations of single-modality approaches by effectively combining camera and LiDAR data through novel view transformation and distillation techniques, achieving superior performance in multimodal 3D occupancy prediction for autonomous driving applications.

Abstract: Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/DzpLab/SDGOCC.

[96] FedVLM: Scalable Personalized Vision-Language Models through Federated Learning

Arkajyoti Mitra, Afia Anjum, Paul Agbaje, Mert Pesé, Habeeb Olufowobi

Main category: cs.CV

TL;DR: This paper proposes FedVLM, a federated learning framework for fine-tuning vision-language models using personalized LoRA (pLoRA) that adapts to each client’s unique data distribution, achieving 24.5% improvement over standard LoRA in non-iid federated settings.

Details

Motivation: Vision-language models show impressive zero-shot capabilities but fine-tuning at scale is challenging in federated environments due to decentralized, non-iid data across clients. Existing parameter-efficient methods like LoRA struggle with heterogeneous client data, leading to suboptimal generalization in federated settings.

Method: The authors propose FedVLM, a federated LoRA fine-tuning framework with personalized LoRA (pLoRA) that dynamically adapts LoRA parameters to each client’s unique data distribution while enabling decentralized adaptation and maintaining global model aggregation.

Result: Experiments on the RLAIF-V dataset demonstrate that pLoRA improves client-specific performance by 24.5% compared to standard LoRA, showing superior adaptation capabilities in non-iid federated learning scenarios.

Conclusion: FedVLM provides a scalable and efficient solution for fine-tuning vision-language models in federated settings, advancing personalized adaptation in distributed learning scenarios while preserving model privacy and reducing reliance on centralized training.

Abstract: Vision-language models (VLMs) demonstrate impressive zero-shot and few-shot learning capabilities, making them essential for several downstream tasks. However, fine-tuning these models at scale remains challenging, particularly in federated environments where data is decentralized and non-iid across clients. Existing parameter-efficient tuning methods like LoRA (Low-Rank Adaptation) reduce computational overhead but struggle with heterogeneous client data, leading to suboptimal generalization. To address these challenges, we propose FedVLM, a federated LoRA fine-tuning framework that enables decentralized adaptation of VLMs while preserving model privacy and reducing reliance on centralized training. To further tackle data heterogeneity, we introduce personalized LoRA (pLoRA), which dynamically adapts LoRA parameters to each client’s unique data distribution, significantly improving local adaptation while maintaining global model aggregation. Experiments on the RLAIF-V dataset show that pLoRA improves client-specific performance by 24.5% over standard LoRA, demonstrating superior adaptation in non-iid settings. FedVLM provides a scalable and efficient solution for fine-tuning VLMs in federated settings, advancing personalized adaptation in distributed learning scenarios.

[97] IONext: Unlocking the Next Era of Inertial Odometry

Shanshan Zhang, Siyue Wang, Tianshui Wen, Qi Zhang, Ziheng Zhou, Lingxiang Zheng, Yu Yang

Main category: cs.CV

TL;DR: The paper introduces IONext, a CNN-based inertial odometry system that uses novel Dual-wing Adaptive Dynamic Mixer (DADM) and Spatio-Temporal Gating Unit (STGU) modules to outperform existing Transformer and CNN methods by better capturing both global and local motion patterns.

Details

Motivation: Transformer-based models for inertial odometry have limited sensitivity to local motion variations and lack inherent inductive biases, which hinder localization accuracy and generalization. Recent advances in large-kernel convolutions and CNN architectures offer opportunities to improve global motion perception while maintaining local feature sensitivity.

Method: The authors propose IONext, featuring two key components: (1) Dual-wing Adaptive Dynamic Mixer (DADM) that adaptively captures global and local motion features through dynamic weight generation and multi-scale feature aggregation, and (2) Spatio-Temporal Gating Unit (STGU) that selectively extracts representative motion features in the temporal domain to improve temporal modeling.

Result: IONext consistently outperforms state-of-the-art Transformer and CNN-based methods across six public datasets. Specifically, on the RNIN dataset, it achieves 10% reduction in average Absolute Trajectory Error (ATE) and 12% reduction in average Relative Trajectory Error (RTE) compared to the representative iMOT model.

Conclusion: The proposed CNN-based IONext architecture with DADM and STGU modules successfully addresses the limitations of existing approaches by effectively combining global motion perception with local feature sensitivity, establishing a new state-of-the-art for inertial odometry tasks.

Abstract: Researchers have increasingly adopted Transformer-based models for inertial odometry. While Transformers excel at modeling long-range dependencies, their limited sensitivity to local, fine-grained motion variations and lack of inherent inductive biases often hinder localization accuracy and generalization. Recent studies have shown that incorporating large-kernel convolutions and Transformer-inspired architectural designs into CNN can effectively expand the receptive field, thereby improving global motion perception. Motivated by these insights, we propose a novel CNN-based module called the Dual-wing Adaptive Dynamic Mixer (DADM), which adaptively captures both global motion patterns and local, fine-grained motion features from dynamic inputs. This module dynamically generates selective weights based on the input, enabling efficient multi-scale feature aggregation. To further improve temporal modeling, we introduce the Spatio-Temporal Gating Unit (STGU), which selectively extracts representative and task-relevant motion features in the temporal domain. This unit addresses the limitations of temporal modeling observed in existing CNN approaches. Built upon DADM and STGU, we present a new CNN-based inertial odometry backbone, named Next Era of Inertial Odometry (IONext). Extensive experiments on six public datasets demonstrate that IONext consistently outperforms state-of-the-art (SOTA) Transformer- and CNN-based methods. For instance, on the RNIN dataset, IONext reduces the average ATE by 10% and the average RTE by 12% compared to the representative model iMOT.

[98] Robust Five-Class and binary Diabetic Retinopathy Classification Using Transfer Learning and Data Augmentation

Faisal Ahmed, Mohammad Alfrad Nobel Bhuiyan

Main category: cs.CV

TL;DR: This paper develops a deep learning framework using transfer learning and data augmentation for diabetic retinopathy classification, achieving 98.9% accuracy for binary classification and 84.6% for five-class severity classification on the APTOS 2019 dataset.

Details

Motivation: Diabetic retinopathy is a leading cause of vision loss worldwide, and early automated diagnosis through retinal image analysis can significantly reduce blindness risk. The challenges include class imbalance and limited training data in medical imaging datasets.

Method: The authors propose a robust deep learning framework that combines transfer learning with extensive data augmentation to handle class imbalance. They evaluate multiple pretrained CNN architectures including ResNet and EfficientNet variants, using class-balanced augmentation techniques on the APTOS 2019 dataset.

Result: For binary DR classification: 98.9% accuracy, 98.6% precision, 99.3% recall, 98.9% F1-score, and 99.4% AUC. For five-class severity classification: 84.6% accuracy and 94.1% AUC. EfficientNet-B0 and ResNet34 provided optimal accuracy-efficiency trade-offs for both tasks.

Conclusion: The combination of class-balanced augmentation with transfer learning proves highly effective for DR diagnosis. The proposed framework offers a scalable and accurate solution for DR screening with potential for real-world clinical deployment, outperforming several existing approaches.

Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, and early diagnosis through automated retinal image analysis can significantly reduce the risk of blindness. This paper presents a robust deep learning framework for both binary and five-class DR classification, leveraging transfer learning and extensive data augmentation to address the challenges of class imbalance and limited training data. We evaluate a range of pretrained convolutional neural network architectures, including variants of ResNet and EfficientNet, on the APTOS 2019 dataset. For binary classification, our proposed model achieves a state-of-the-art accuracy of 98.9%, with a precision of 98.6%, recall of 99.3%, F1-score of 98.9%, and an AUC of 99.4%. In the more challenging five-class severity classification task, our model obtains a competitive accuracy of 84.6% and an AUC of 94.1%, outperforming several existing approaches. Our findings also demonstrate that EfficientNet-B0 and ResNet34 offer optimal trade-offs between accuracy and computational efficiency across both tasks. These results underscore the effectiveness of combining class-balanced augmentation with transfer learning for high-performance DR diagnosis. The proposed framework provides a scalable and accurate solution for DR screening, with potential for deployment in real-world clinical environments.

[99] ScSAM: Debiasing Morphology and Distributional Variability in Subcellular Semantic Segmentation

Bo Fang, Jianan Fan, Dongnan Liu, Hang Chang, Gerald J. Shami, Filip Braet, Weidong Cai

Main category: cs.CV

TL;DR: ScSAM enhances subcellular organelle segmentation by combining SAM with MAE-guided cellular knowledge to address morphological variability and feature bias issues, achieving superior performance over existing methods.

Details

Motivation: Subcellular organelle segmentation faces challenges due to significant morphological and distributional variability that causes biased feature learning in existing models. SAM, while providing rich features, struggles with subcellular scenarios due to label space gaps and insufficient fine-grained spatial detail capture.

Method: The paper introduces ScSAM, which fuses pre-trained SAM with Masked Autoencoder (MAE)-guided cellular prior knowledge. Key components include: (1) a feature alignment and fusion module to combine different representations in the same feature space, and (2) a cosine similarity matrix-based class prompt encoder to activate class-specific features for subcellular recognition.

Result: Extensive experiments on diverse subcellular image datasets demonstrate that ScSAM outperforms state-of-the-art methods in organelle segmentation tasks.

Conclusion: ScSAM successfully addresses the challenges of subcellular organelle segmentation by effectively combining SAM’s global contextual understanding with MAE-guided cellular knowledge, resulting in enhanced feature robustness and reduced training bias from data imbalance.

Abstract: The significant morphological and distributional variability among subcellular components poses a long-standing challenge for learning-based organelle segmentation models, significantly increasing the risk of biased feature learning. Existing methods often rely on single mapping relationships, overlooking feature diversity and thereby inducing biased training. Although the Segment Anything Model (SAM) provides rich feature representations, its application to subcellular scenarios is hindered by two key challenges: (1) The variability in subcellular morphology and distribution creates gaps in the label space, leading the model to learn spurious or biased features. (2) SAM focuses on global contextual understanding and often ignores fine-grained spatial details, making it challenging to capture subtle structural alterations and cope with skewed data distributions. To address these challenges, we introduce ScSAM, a method that enhances feature robustness by fusing pre-trained SAM with Masked Autoencoder (MAE)-guided cellular prior knowledge to alleviate training bias from data imbalance. Specifically, we design a feature alignment and fusion module to align pre-trained embeddings to the same feature space and efficiently combine different representations. Moreover, we present a cosine similarity matrix-based class prompt encoder to activate class-specific features to recognize subcellular categories. Extensive experiments on diverse subcellular image datasets demonstrate that ScSAM outperforms state-of-the-art methods.

[100] UNICE: Training A Universal Image Contrast Enhancer

Ruodai Cui, Lei Zhang

Main category: cs.CV

TL;DR: This paper proposes UNICE, a universal image contrast enhancement method that generates multi-exposure sequences from single images and fuses them for better generalization across different contrast enhancement tasks without requiring manual labeling.

Details

Motivation: Existing image contrast enhancement methods are task-specific and show poor generalization across different tasks and datasets. There is a need for a universal model that can handle various contrast enhancement scenarios effectively.

Method: The authors collect 46,928 HDR raw images and render 328,496 sRGB images to create multi-exposure sequences (MES) with pseudo ground-truths via multi-exposure fusion. They train two networks: one to generate MES from single sRGB images, and another to fuse the generated MES into enhanced images.

Result: UNICE demonstrates significantly stronger generalization performance than existing methods across different contrast enhancement tasks and datasets. It outperforms manually created ground-truths in multiple no-reference image quality metrics, all without requiring costly human labeling.

Conclusion: The proposed UNICE method successfully addresses the generalization problem in image contrast enhancement by leveraging multi-exposure fusion principles, providing a universal solution that works across various enhancement tasks while eliminating the need for manual annotation.

Abstract: Existing image contrast enhancement methods are typically designed for specific tasks such as under-/over-exposure correction, low-light and backlit image enhancement, etc. The learned models, however, exhibit poor generalization performance across different tasks, even across different datasets of a specific task. It is important to explore whether we can learn a universal and generalized model for various contrast enhancement tasks. In this work, we observe that the common key factor of these tasks lies in the need of exposure and contrast adjustment, which can be well-addressed if high-dynamic range (HDR) inputs are available. We hence collect 46,928 HDR raw images from public sources, and render 328,496 sRGB images to build multi-exposure sequences (MES) and the corresponding pseudo sRGB ground-truths via multi-exposure fusion. Consequently, we train a network to generate an MES from a single sRGB image, followed by training another network to fuse the generated MES into an enhanced image. Our proposed method, namely UNiversal Image Contrast Enhancer (UNICE), is free of costly human labeling. However, it demonstrates significantly stronger generalization performance than existing image contrast enhancement methods across and within different tasks, even outperforming manually created ground-truths in multiple no-reference image quality metrics. The dataset, code and model are available at https://github.com/BeyondHeaven/UNICE.

[101] DOOMGAN:High-Fidelity Dynamic Identity Obfuscation Ocular Generative Morphing

Bharath Krishnamurthy, Ajita Rattani

Main category: cs.CV

TL;DR: This paper introduces DOOMGAN, a novel GAN-based method for generating morphing attacks on visible-spectrum ocular biometrics, achieving significantly higher attack success rates while maintaining realistic iris and periocular features.

Details

Motivation: Visible-spectrum ocular biometrics are vulnerable to morphing attacks (synthetic traits blending multiple individuals), but this threat remains underexplored compared to near-infrared iris and face biometrics. Current generation models struggle with uncontrolled conditions while preserving detailed ocular features.

Method: DOOMGAN employs three key components: (1) landmark-driven encoding of visible ocular anatomy, (2) attention-guided generation for realistic morph synthesis, and (3) dynamic weighting of multi-faceted losses for optimized convergence.

Result: DOOMGAN achieves over 20% higher attack success rates than baseline methods under stringent thresholds, 20% better elliptical iris structure generation, and 30% improved gaze consistency. The authors also release the first comprehensive ocular morphing dataset.

Conclusion: DOOMGAN successfully addresses the gap in visible-spectrum ocular morphing attacks, demonstrating superior performance in generating realistic morphed ocular images that can effectively fool biometric systems while maintaining anatomical accuracy.

Abstract: Ocular biometrics in the visible spectrum have emerged as a prominent modality due to their high accuracy, resistance to spoofing, and non-invasive nature. However, morphing attacks, synthetic biometric traits created by blending features from multiple individuals, threaten biometric system integrity. While extensively studied for near-infrared iris and face biometrics, morphing in visible-spectrum ocular data remains underexplored. Simulating such attacks demands advanced generation models that handle uncontrolled conditions while preserving detailed ocular features like iris boundaries and periocular textures. To address this gap, we introduce DOOMGAN, that encompasses landmark-driven encoding of visible ocular anatomy, attention-guided generation for realistic morph synthesis, and dynamic weighting of multi-faceted losses for optimized convergence. DOOMGAN achieves over 20% higher attack success rates than baseline methods under stringent thresholds, along with 20% better elliptical iris structure generation and 30% improved gaze consistency. We also release the first comprehensive ocular morphing dataset to support further research in this domain.

[102] TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition

Guangzhu Xu, Zhi Ke, Pengcheng Zuo, Bangjun Lei

Main category: cs.CV

TL;DR: A unified license plate recognition system that combines a lightweight visual encoder with text decoder and perspective correction network, achieving 99.34-99.58% accuracy on single-line and 98.70% on double-line Chinese license plates at 167 FPS.

Details

Motivation: Existing CNN and CRNN-based license plate recognition approaches face limitations when dealing with diverse license plate types and imaging conditions in open environments, particularly with the scarcity of double-line license plate datasets and varying perspective distortions.

Method: The paper proposes a unified framework integrating: (1) a lightweight visual encoder with text decoder in a pre-training framework for single/double-line Chinese license plates, (2) synthetic dataset construction using texture mapping and blending with real images, and (3) a perspective correction network (PTN) using corner coordinate regression supervised by view classification.

Result: The system achieves 99.34% accuracy on corrected CCPD test set under coarse localization disturbance, 99.58% under fine localization disturbance, and 98.70% on double-line license plates, with processing speeds up to 167 frames per second.

Conclusion: The proposed unified license plate recognition system successfully addresses the challenges of diverse license plate types and imaging conditions through effective integration of visual encoding, text decoding, and perspective correction, demonstrating strong practical applicability with high accuracy and real-time processing capabilities.

Abstract: License plate recognition in open environments is widely applicable across various domains; however, the diversity of license plate types and imaging conditions presents significant challenges. To address the limitations encountered by CNN and CRNN-based approaches in license plate recognition, this paper proposes a unified solution that integrates a lightweight visual encoder with a text decoder, within a pre-training framework tailored for single and double-line Chinese license plates. To mitigate the scarcity of double-line license plate datasets, we constructed a single/double-line license plate dataset by synthesizing images, applying texture mapping onto real scenes, and blending them with authentic license plate images. Furthermore, to enhance the system’s recognition accuracy, we introduce a perspective correction network (PTN) that employs license plate corner coordinate regression as an implicit variable, supervised by license plate view classification information. This network offers improved stability, interpretability, and low annotation costs. The proposed algorithm achieves an average recognition accuracy of 99.34% on the corrected CCPD test set under coarse localization disturbance. When evaluated under fine localization disturbance, the accuracy further improves to 99.58%. On the double-line license plate test set, it achieves an average recognition accuracy of 98.70%, with processing speeds reaching up to 167 frames per second, indicating strong practical applicability.

[103] Multi-Scale PCB Defect Detection with YOLOv8 Network Improved via Pruning and Lightweight Network

Li Pingzhen, Xu Sheng, Chen Jing, Su Chengyue

Main category: cs.CV

TL;DR: This paper presents an improved YOLOv8-based method for PCB defect detection that achieves high accuracy (99.32% mAP0.5) and real-time performance through network lightweighting, adaptive pruning, and specialized components for tiny target detection.

Details

Motivation: Traditional PCB defect detection models struggle to balance accuracy and computational cost for high-density PCB designs and high-speed production, failing to meet requirements for accurate real-time detection of tiny defects.

Method: The authors develop a multi-scale PCB defect detection method using YOLOv8 with: (1) Ghost-HGNetv2 backbone for parameter reduction, (2) C2f-Faster neck for enhanced multi-level feature fusion, (3) GCDetect detection head with shared GroupConv weights, (4) Inner-MPDIoU boundary loss function for tiny target detection, and (5) adaptive pruning for model complexity reduction.

Result: On a publicly available PCB defect dataset, the model achieves 99.32% mAP0.5 and 75.18% mAP0.5:0.9, representing a 10.13% improvement over YOLOv8n while maintaining superior speed performance.

Conclusion: The proposed method successfully balances detection accuracy and computational efficiency for PCB defect detection, demonstrating significant improvements in both metrics compared to baseline YOLOv8n and meeting the requirements for real-time tiny defect detection in high-speed PCB production.

Abstract: With the high density of printed circuit board (PCB) design and the high speed of production, the traditional PCB defect detection model is difficult to take into account the accuracy and computational cost, and cannot meet the requirements of high accuracy and real-time detection of tiny defects. Therefore, in this paper, a multi-scale PCB defect detection method is improved with YOLOv8 using a comprehensive strategy of tiny target sensitivity strategy, network lightweighting and adaptive pruning, which is able to improve the detection speed and accuracy by optimizing the backbone network, the neck network and the detection head, the loss function and the adaptive pruning rate. Firstly, a Ghost-HGNetv2 structure with fewer parameters is used in the backbone network, and multilevel features are used to extract image semantic features to discover accurate defects. Secondly, we integrate C2f-Faster with small number of parameters in the neck section to enhance the ability of multi-level feature fusion. Next, in the Head part, we design a new GCDetect detection head, which allows the prediction of bounding boxes and categories to share the weights of GroupConv, and uses a small number of grouping convolutions to accomplish the regression and classification tasks, which significantly reduces the number of parameters while maintaining the accuracy of detection. We also design the Inner-MPDIoU boundary loss function to improve the detection and localization of tiny targets. Finally, the model was pruned by an optimized adaptive pruning rate to further reduce the complexity of the model. Experimental results show that the model exhibits advantages in terms of accuracy and speed. On the publicly available PCB defect dataset, mAP0.5 reaches 99.32% and mAP0.5:0.9 reaches 75.18%, which is 10.13% higher compared to YOLOv8n.

[104] Hierarchical Fusion and Joint Aggregation: A Multi-Level Feature Representation Method for AIGC Image Quality Assessment

Linghe Meng, Jiarun Song

Main category: cs.CV

TL;DR: This paper proposes a multi-level visual representation paradigm for AI-generated content quality assessment, developing two networks (MGLF-Net and MPEF-Net) that combine local/global features and text-image correspondence to better evaluate AIGC quality across perceptual and semantic dimensions.

Details

Motivation: Existing AIGC quality assessment methods rely on single-level visual features, which limits their ability to capture complex distortions in AI-generated images that span from low-level visual perception to high-level semantic understanding.

Method: A multi-level visual representation paradigm with three stages: (1) multi-level feature extraction, (2) hierarchical fusion, and (3) joint aggregation. Two networks are developed: MGLF-Net uses dual CNN and Transformer backbones for perceptual quality assessment, while MPEF-Net embeds prompt semantics into visual feature fusion for text-to-image correspondence evaluation.

Result: Experiments on benchmarks demonstrate outstanding performance on both perceptual quality assessment and text-to-image correspondence tasks, validating the effectiveness of the proposed approach.

Conclusion: The multi-level visual assessment paradigm effectively addresses the limitations of single-level approaches by capturing both low-level visual and high-level semantic aspects of AIGC quality, with the two proposed networks showing superior performance on benchmark evaluations.

Abstract: The quality assessment of AI-generated content (AIGC) faces multi-dimensional challenges, that span from low-level visual perception to high-level semantic understanding. Existing methods generally rely on single-level visual features, limiting their ability to capture complex distortions in AIGC images. To address this limitation, a multi-level visual representation paradigm is proposed with three stages, namely multi-level feature extraction, hierarchical fusion, and joint aggregation. Based on this paradigm, two networks are developed. Specifically, the Multi-Level Global-Local Fusion Network (MGLF-Net) is designed for the perceptual quality assessment, extracting complementary local and global features via dual CNN and Transformer visual backbones. The Multi-Level Prompt-Embedded Fusion Network (MPEF-Net) targets Text-to-Image correspondence by embedding prompt semantics into the visual feature fusion process at each feature level. The fused multi-level features are then aggregated for final evaluation. Experiments on benchmarks demonstrate outstanding performance on both tasks, validating the effectiveness of the proposed multi-level visual assessment paradigm.

[105] URPO: A Unified Reward & Policy Optimization Framework for Large Language Models

Songshuo Lu, Hua Wang, Zhi Chen, Yaohua Tang

Main category: cs.CV

TL;DR: URPO unifies reward modeling and policy optimization in a single model and training phase, eliminating the need for separate reward models and achieving better performance than traditional alignment pipelines.

Details

Motivation: Traditional alignment pipelines use separate policy and reward models with frozen parameters during RL, creating complex resource-intensive processes with performance limitations due to static reward signals. This separation is inefficient and creates a performance ceiling.

Method: Unified Reward & Policy Optimization (URPO) framework that combines instruction-following and reward modeling in one model. All alignment data (preference pairs, verifiable reasoning, open-ended instructions) is converted to unified generative format and optimized using Group-Relative Policy Optimization (GRPO) loop.

Result: On Qwen2.5-7B model: instruction-following score improved from 42.24 to 44.84 on AlpacaEval, composite reasoning average increased from 32.66 to 35.66, and achieved RewardBench score of 85.15 (surpassing dedicated reward model’s 83.55).

Conclusion: URPO provides a simpler, more efficient, and more effective approach to language model alignment by eliminating separate reward models and enabling co-evolutionary dynamics between generation and evaluation, while also developing superior internal evaluation capabilities.

Abstract: Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following (“player”) and reward modeling (“referee”) within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO’s superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and the composite reasoning average from 32.66 to 35.66. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.

[106] Asymmetric Lesion Detection with Geometric Patterns and CNN-SVM Classification

M. A. Rasel, Sameem Abdul Kareem, Zhenli Kwan, Nik Aimee Azizah Faheem, Winn Hui Han, Rebecca Kai Jan Choong, Shin Shen Yong, Unaizah Obaidellah

Main category: cs.CV

TL;DR: This paper presents a dual approach for analyzing lesion shape asymmetry in dermoscopic images: a geometry-based algorithm achieving 99% detection rate for asymmetric lesions, and a CNN-SVM hybrid method achieving 94% Kappa Score for classifying lesions into asymmetric, half-symmetric, and symmetric categories.

Details

Motivation: Asymmetric lesion shape is a critical criterion for melanoma diagnosis in clinical practice. The study aims to develop automated methods to help non-experts understand and identify asymmetric lesions in dermoscopic images, where surface skin structures invisible to the naked eye can be visualized.

Method: The researchers employed two main approaches: (1) A supervised learning image processing algorithm for geometrical pattern analysis of lesion shapes, and (2) A hybrid system using pre-trained CNNs to extract shape, color, and texture features combined with a multiclass SVM classifier for lesion classification. They also labeled a non-annotated dataset with symmetrical information based on clinical assessments.

Result: The geometry-based experiment achieved a 99.00% detection rate for dermatological asymmetric lesions. The CNN-based approach delivered 94% Kappa Score, 95% Macro F1-score, and 97% Weighted F1-score for classifying lesions into three categories: Asymmetric, Half-Symmetric, and Symmetric. Both methods outperformed existing state-of-the-art approaches.

Conclusion: The proposed dual methodology successfully automates asymmetric lesion detection in dermoscopic images, providing a valuable tool for non-experts in melanoma diagnosis. The high performance metrics demonstrate the effectiveness of both geometry-based and CNN-SVM hybrid approaches, with potential clinical applications for early skin cancer detection.

Abstract: In dermoscopic images, which allow visualization of surface skin structures not visible to the naked eye, lesion shape offers vital insights into skin diseases. In clinically practiced methods, asymmetric lesion shape is one of the criteria for diagnosing melanoma. Initially, we labeled data for a non-annotated dataset with symmetrical information based on clinical assessments. Subsequently, we propose a supporting technique, a supervised learning image processing algorithm, to analyze the geometrical pattern of lesion shape, aiding non-experts in understanding the criteria of an asymmetric lesion. We then utilize a pre-trained convolutional neural network (CNN) to extract shape, color, and texture features from dermoscopic images for training a multiclass support vector machine (SVM) classifier, outperforming state-of-the-art methods from the literature. In the geometry-based experiment, we achieved a 99.00% detection rate for dermatological asymmetric lesions. In the CNN-based experiment, the best performance is found with 94% Kappa Score, 95% Macro F1-score, and 97% Weighted F1-score for classifying lesion shapes (Asymmetric, Half-Symmetric, and Symmetric).

[107] Dual-branch Prompting for Multimodal Machine Translation

Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang

Main category: cs.CV

TL;DR: D2P-MMT is a diffusion-based dual-branch framework for multimodal machine translation that uses reconstructed images from diffusion models instead of original paired images, achieving better robustness and translation performance while requiring only source text at inference.

Details

Motivation: Existing multimodal machine translation approaches rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability in real-world scenarios.

Method: The paper proposes D2P-MMT, which uses a pre-trained diffusion model to generate reconstructed images that filter out distracting visual details while preserving semantic cues. The framework employs a dual-branch prompting strategy during training with both authentic and reconstructed images, and introduces a distributional alignment loss to enforce consistency between output distributions of the two branches.

Result: Extensive experiments on the Multi30K dataset show that D2P-MMT achieves superior translation performance compared to existing state-of-the-art multimodal machine translation approaches.

Conclusion: D2P-MMT successfully addresses the robustness issues in multimodal machine translation by leveraging diffusion-generated images and dual-branch training, demonstrating improved performance while reducing dependency on paired visual inputs during inference.

Abstract: Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.

[108] Vec2Face+ for Face Dataset Generation

Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin W. Bowyer

Main category: cs.CV

TL;DR: The paper introduces Vec2Face+, a generative model that creates synthetic face datasets with improved identity consistency and attribute variation, achieving state-of-the-art face recognition performance that surpasses real-world datasets like CASIA-WebFace for the first time.

Details

Motivation: Existing synthetic face generation methods for training data overlook the necessity of maintaining intra-class identity consistency while increasing intra-class variation, leading to suboptimal face recognition training datasets despite having large inter-class separability.

Method: Vec2Face+ generates images directly from image features with three key strategies: 1) sampling sufficiently different vectors for well-separated identities, 2) AttrOP algorithm for increasing general attribute variations, and 3) LoRA-based pose control for efficient identity-preserving profile head pose generation.

Result: VFace10K (10K identities) achieves state-of-the-art accuracy on seven real-world test sets. Larger datasets VFace100K and VFace300K (4M and 12M images) outperform the real-world CASIA-WebFace dataset on five test sets, marking the first time synthetic data beats CASIA-WebFace in average accuracy.

Conclusion: Vec2Face+ successfully generates high-quality synthetic face datasets that outperform real-world training data, though the authors identify concerning issues with twin verification performance and bias in synthetic datasets that require future investigation.

Abstract: When synthesizing identities as face recognition training data, it is generally believed that large inter-class separability and intra-class attribute variation are essential for synthesizing a quality dataset. % This belief is generally correct, and this is what we aim for. However, when increasing intra-class variation, existing methods overlook the necessity of maintaining intra-class identity consistency. % To address this and generate high-quality face training data, we propose Vec2Face+, a generative model that creates images directly from image features and allows for continuous and easy control of face identities and attributes. Using Vec2Face+, we obtain datasets with proper inter-class separability and intra-class variation and identity consistency using three strategies: 1) we sample vectors sufficiently different from others to generate well-separated identities; 2) we propose an AttrOP algorithm for increasing general attribute variations; 3) we propose LoRA-based pose control for generating images with profile head poses, which is more efficient and identity-preserving than AttrOP. % Our system generates VFace10K, a synthetic face dataset with 10K identities, which allows an FR model to achieve state-of-the-art accuracy on seven real-world test sets. Scaling the size to 4M and 12M images, the corresponding VFace100K and VFace300K datasets yield higher accuracy than the real-world training dataset, CASIA-WebFace, on five real-world test sets. This is the first time a synthetic dataset beats the CASIA-WebFace in average accuracy. In addition, we find that only 1 out of 11 synthetic datasets outperforms random guessing (\emph{i.e., 50%}) in twin verification and that models trained with synthetic identities are more biased than those trained with real identities. Both are important aspects for future investigation.

[109] DesignLab: Designing Slides Through Iterative Detection and Correction

Jooyeol Yun, Heng Wang, Yotaro Shimose, Jaegul Choo, Shingo Takamatsu

Main category: cs.CV

TL;DR: DesignLab introduces a two-role iterative approach for automated slide design, where a design reviewer identifies issues and a design contributor fixes them, outperforming existing methods by mimicking real-world design workflows.

Details

Motivation: Existing automated slide design tools lack the ability to refine their own output, which is crucial in real-world design workflows. Non-experts struggle with the complexity of design choices, and current tools cannot iteratively improve their suggestions.

Method: The paper proposes DesignLab, which decomposes the design process into two roles: (1) a design reviewer that identifies design-related issues, and (2) a design contributor that corrects these issues. They fine-tune large language models for both roles and simulate intermediate drafts using controlled perturbations to train the models on design errors and fixes.

Result: DesignLab outperforms existing design-generation methods, including commercial tools, by enabling iterative refinement that produces polished, professional slides with qualities that were previously unattainable through single-pass generation.

Conclusion: The iterative two-role approach successfully mimics real-world design workflows, enabling automated tools to continuously improve slide designs through multiple iterations, resulting in higher quality outputs than traditional single-pass generation methods.

Abstract: Designing high-quality presentation slides can be challenging for non-experts due to the complexity involved in navigating various design choices. Numerous automated tools can suggest layouts and color schemes, yet often lack the ability to refine their own output, which is a key aspect in real-world workflows. We propose DesignLab, which separates the design process into two roles, the design reviewer, who identifies design-related issues, and the design contributor who corrects them. This decomposition enables an iterative loop where the reviewer continuously detects issues and the contributor corrects them, allowing a draft to be further polished with each iteration, reaching qualities that were unattainable. We fine-tune large language models for these roles and simulate intermediate drafts by introducing controlled perturbations, enabling the design reviewer learn design errors and the contributor learn how to fix them. Our experiments show that DesignLab outperforms existing design-generation methods, including a commercial tool, by embracing the iterative nature of designing which can result in polished, professional slides.

[110] VBCD: A Voxel-Based Framework for Personalized Dental Crown Design

Linda Wei, Chang Liu, Wenran Zhang, Zengji Zhang, Shaoting Zhang, Hongsheng Li

Main category: cs.CV

TL;DR: A novel voxel-based framework (VBCD) automates dental crown design from intraoral scans using a coarse-to-fine approach with distance-aware supervision and specialized loss functions, outperforming existing methods on large-scale datasets.

Details

Motivation: The design of restorative dental crowns from intraoral scans is labor-intensive for dental technicians, requiring automation to reduce manual effort and improve efficiency in personalized dental crown design.

Method: A voxel-based framework with two stages: (1) initial coarse dental crown generation from voxelized intraoral scans, (2) fine-grained refinement with distance-aware supervision. Uses Curvature and Margin line Penalty Loss (CMPL) for better margin line alignment and positional prompts based on FDI tooth numbering system for improved accuracy.

Result: The proposed VBCD framework outperformed existing methods when evaluated on a large-scale dataset of intraoral scans, demonstrating superior performance in automated dental crown design tasks.

Conclusion: The VBCD framework provides a robust solution for personalized dental crown design by successfully automating the labor-intensive process while achieving better performance than existing approaches through its voxel-based coarse-to-fine methodology.

Abstract: The design of restorative dental crowns from intraoral scans is labor-intensive for dental technicians. To address this challenge, we propose a novel voxel-based framework for automated dental crown design (VBCD). The VBCD framework generates an initial coarse dental crown from voxelized intraoral scans, followed by a fine-grained refiner incorporating distance-aware supervision to improve accuracy and quality. During the training stage, we employ the Curvature and Margin line Penalty Loss (CMPL) to enhance the alignment of the generated crown with the margin line. Additionally, a positional prompt based on the FDI tooth numbering system is introduced to further improve the accuracy of the generated dental crowns. Evaluation on a large-scale dataset of intraoral scans demonstrated that our approach outperforms existing methods, providing a robust solution for personalized dental crown design.

[111] A Low-Cost Machine Learning Approach for Timber Diameter Estimation

Fatemeh Hasanzadeh Fard, Sanaz Hasanzadeh Fard, Mehdi Jonoobi

Main category: cs.CV

TL;DR: This study develops a cost-effective machine learning solution using YOLOv5 to automatically estimate timber log diameter from standard RGB images taken in real industrial conditions, achieving 0.64 mAP@0.5 for practical wood processing applications.

Details

Motivation: Traditional wood processing relies on slow, inconsistent, and error-prone manual methods for species and thickness identification. The industry needs accurate, efficient, and cost-effective automation solutions that work under real-world conditions without expensive sensors or controlled environments.

Method: The researchers employed YOLOv5 object detection algorithm fine-tuned on the TimberSeg 1.0 public dataset. The model detects individual timber logs and estimates thickness through bounding-box dimensions using standard RGB images captured in typical industrial sheds during timber delivery.

Result: The model achieved a mean Average Precision (mAP@0.5) of 0.64, demonstrating reliable log detection performance even with modest computing resources. The solution proves to be lightweight and scalable for practical applications.

Conclusion: This lightweight, scalable solution shows promise for practical integration into existing wood processing workflows, particularly for on-site inventory management and preliminary sorting in small and medium-sized operations, offering a cost-effective alternative to expensive sensor-based systems.

Abstract: The wood processing industry, particularly in facilities such as sawmills and MDF production lines, requires accurate and efficient identification of species and thickness of the wood. Although traditional methods rely heavily on expert human labor, they are slow, inconsistent, and prone to error, especially when processing large volumes. This study focuses on practical and cost-effective machine learning frameworks that automate the estimation of timber log diameter using standard RGB images captured under real-world working conditions. We employ the YOLOv5 object detection algorithm, fine-tuned on a public dataset (TimberSeg 1.0), to detect individual timber logs and estimate thickness through bounding-box dimensions. Unlike previous methods that require expensive sensors or controlled environments, this model is trained on images taken in typical industrial sheds during timber delivery. Experimental results show that the model achieves a mean Average Precision (mAP@0.5) of 0.64, demonstrating reliable log detection even with modest computing resources. This lightweight, scalable solution holds promise for practical integration into existing workflows, including on-site inventory management and preliminary sorting, particularly in small and medium-sized operations.

Jiansong Wan, Chengming Zhou, Jinkua Liu, Xiangge Huang, Xiaoyu Chen, Xiaohan Yi, Qisen Yang, Baiting Zhu, Xin-Qiang Cai, Lixing Liu, Rushuai Yang, Chuheng Zhang, Sherif Abdelfattah, Hayong Shin, Pushi Zhang, Li Zhao, Jiang Bian

Main category: cs.CV

TL;DR: PIG-Nav introduces a pretrained vision-based robotic navigation model that achieves 22.6% improvement in zero-shot and 37.5% improvement in fine-tuning settings by using early-fusion ViT architecture, auxiliary tasks, and augmenting training data with game videos.

Details

Motivation: Existing vision-based robotic navigation models struggle with generalization across diverse environments and achieving good zero-shot performance in unseen settings, requiring better pretraining strategies for foundation models.

Method: The approach introduces two key model improvements: (1) early-fusion network structure combining visual observations and goal images via pretrained Vision Transformer (ViT) encoders, and (2) auxiliary tasks for enhanced global navigation representation learning. Additionally, they propose a novel data preprocessing pipeline to efficiently label large-scale game video datasets for training.

Result: PIG-Nav achieves 22.6% average improvement in zero-shot settings and 37.5% improvement in fine-tuning settings compared to existing visual navigation foundation models, tested across two complex simulated environments and one real-world environment. The model maintains competitive performance while requiring significantly less fine-tuning data.

Conclusion: The work advances state-of-the-art in pretrained image-goal navigation models by demonstrating that proper architectural choices and diverse training data from game videos can significantly improve navigation performance with minimal labeled supervision, making it promising for real-world deployment.

Abstract: Recent studies have explored pretrained (foundation) models for vision-based robotic navigation, aiming to achieve generalizable navigation and positive transfer across diverse environments while enhancing zero-shot performance in unseen settings. In this work, we introduce PIG-Nav (Pretrained Image-Goal Navigation), a new approach that further investigates pretraining strategies for vision-based navigation models and contributes in two key areas. Model-wise, we identify two critical design choices that consistently improve the performance of pretrained navigation models: (1) integrating an early-fusion network structure to combine visual observations and goal images via appropriately pretrained Vision Transformer (ViT) image encoder, and (2) introducing suitable auxiliary tasks to enhance global navigation representation learning, thus further improving navigation performance. Dataset-wise, we propose a novel data preprocessing pipeline for efficiently labeling large-scale game video datasets for navigation model training. We demonstrate that augmenting existing open navigation datasets with diverse gameplay videos improves model performance. Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models in two complex simulated environments and one real-world environment. These results advance the state-of-the-art in pretrained image-goal navigation models. Notably, our model maintains competitive performance while requiring significantly less fine-tuning data, highlighting its potential for real-world deployment with minimal labeled supervision.

[113] MaskedCLIP: Bridging the Masked and CLIP Space for Semi-Supervised Medical Vision-Language Pre-training

Lei Zhu, Jun Zhou, Rick Siow Mong Goh, Yong Liu

Main category: cs.CV

TL;DR: The paper proposes MaskedCLIP, a semi-supervised vision-language pre-training framework that combines masked image modeling with contrastive language-image pre-training to leverage both paired and unpaired medical image data for learning more generalizable foundation models.

Details

Motivation: Existing foundation models in medical image analysis use either paired image-text data or unpaired image data exclusively, which limits their ability to learn comprehensive image features. There's a need to harness both types of data simultaneously for better foundation model learning.

Method: The authors propose MaskedCLIP, which uses a bridge transformer to connect masked feature space with CLIP feature space, addressing incompatible feature spaces from paired and unpaired data. They also introduce masked knowledge distillation loss to transfer semantic knowledge between feature spaces, creating a mutually interactive framework.

Result: Extensive experiments on retinal image analysis demonstrate the effectiveness and data efficiency of the proposed method, showing that the framework successfully leverages both paired and unpaired image data to learn more generalizable features.

Conclusion: MaskedCLIP effectively combines semi-supervised learning with vision-language pre-training to create foundation models that can utilize both paired and unpaired medical image data, resulting in improved generalizability and performance for downstream medical image analysis tasks.

Abstract: Foundation models have recently gained tremendous popularity in medical image analysis. State-of-the-art methods leverage either paired image-text data via vision-language pre-training or unpaired image data via self-supervised pre-training to learn foundation models with generalizable image features to boost downstream task performance. However, learning foundation models exclusively on either paired or unpaired image data limits their ability to learn richer and more comprehensive image features. In this paper, we investigate a novel task termed semi-supervised vision-language pre-training, aiming to fully harness the potential of both paired and unpaired image data for foundation model learning. To this end, we propose MaskedCLIP, a synergistic masked image modeling and contrastive language-image pre-training framework for semi-supervised vision-language pre-training. The key challenge in combining paired and unpaired image data for learning a foundation model lies in the incompatible feature spaces derived from these two types of data. To address this issue, we propose to connect the masked feature space with the CLIP feature space with a bridge transformer. In this way, the more semantic specific CLIP features can benefit from the more general masked features for semantic feature extraction. We further propose a masked knowledge distillation loss to distill semantic knowledge of original image features in CLIP feature space back to the predicted masked image features in masked feature space. With this mutually interactive design, our framework effectively leverages both paired and unpaired image data to learn more generalizable image features for downstream tasks. Extensive experiments on retinal image analysis demonstrate the effectiveness and data efficiency of our method.

[114] Perceptual Classifiers: Detecting Generative Images using Perceptual Features

Krishna Srikar Durbha, Asvin Kumar Venkataramanan, Rajesh Sureddi, Alan C. Bovik

Main category: cs.CV

TL;DR: This paper proposes using Image Quality Assessment (IQA) models to detect AI-generated images by leveraging their ability to capture the statistical manifold of real images, achieving state-of-the-art performance with a simple two-layer network.

Details

Motivation: With the rapid advancement of generative models leading to increased "GenAI" content on the internet, there is a critical need for robust detection methods. Existing IQA models effectively capture the manifold of real images in a bandpass statistical space, making them potentially valuable for distinguishing between real and AI-generated content.

Method: The authors leverage existing IQA models’ feature representations and train a simple two-layer network on top of these features to classify real versus AI-generated images. They evaluate the generalization capability across different generative models and test robustness against various image degradations.

Result: The proposed approach demonstrates state-of-the-art performance in detecting fake images across different generative models while maintaining significant robustness against image degradations. The simple two-layer network architecture proves effective when built upon IQA model features.

Conclusion: IQA models provide a powerful foundation for GenAI detection by effectively capturing real image statistics. The proposed method achieves superior performance with a lightweight architecture and shows strong generalization across unseen generative models and robustness to image quality degradations.

Abstract: Image Quality Assessment (IQA) models are employed in many practical image and video processing pipelines to reduce storage, minimize transmission costs, and improve the Quality of Experience (QoE) of millions of viewers. These models are sensitive to a diverse range of image distortions and can accurately predict image quality as judged by human viewers. Recent advancements in generative models have resulted in a significant influx of “GenAI” content on the internet. Existing methods for detecting GenAI content have progressed significantly with improved generalization performance on images from unseen generative models. Here, we leverage the capabilities of existing IQA models, which effectively capture the manifold of real images within a bandpass statistical space, to distinguish between real and AI-generated images. We investigate the generalization ability of these perceptual classifiers to the task of GenAI image detection and evaluate their robustness against various image degradations. Our results show that a two-layer network trained on the feature space of IQA models demonstrates state-of-the-art performance in detecting fake images across generative models, while maintaining significant robustness against image degradations.

[115] Unsupervised Exposure Correction

Ruodai Cui, Li Niu, Guosheng Hu

Main category: cs.CV

TL;DR: This paper introduces an Unsupervised Exposure Correction (UEC) method that eliminates manual annotation requirements, improves generalizability, and enhances performance in low-level computer vision tasks using an emulated ISP pipeline and a large-scale Radiometry Correction Dataset.

Details

Motivation: Current exposure correction methods face three key challenges: labor-intensive paired data annotation, limited generalizability, and performance degradation in low-level computer vision tasks. These limitations create barriers to practical deployment and effectiveness.

Method: The authors develop an unsupervised learning approach using freely available paired data from an emulated Image Signal Processing (ISP) pipeline, eliminating the need for manual annotations. They create a large-scale Radiometry Correction Dataset emphasizing exposure variations and develop a transformation function that preserves image details.

Result: The proposed method outperforms state-of-the-art supervised methods while using only 0.01% of their parameters. It demonstrates improved generalizability due to reduced individual style biases and shows effectiveness in mitigating adverse effects of poor exposure on downstream tasks like edge detection.

Conclusion: The UEC method successfully addresses the three main challenges in exposure correction by providing an unsupervised solution that is parameter-efficient, generalizable, and beneficial for low-level computer vision tasks. The approach offers a practical alternative to expensive manual annotation while maintaining superior performance.

Abstract: Current exposure correction methods have three challenges, labor-intensive paired data annotation, limited generalizability, and performance degradation in low-level computer vision tasks. In this work, we introduce an innovative Unsupervised Exposure Correction (UEC) method that eliminates the need for manual annotations, offers improved generalizability, and enhances performance in low-level downstream tasks. Our model is trained using freely available paired data from an emulated Image Signal Processing (ISP) pipeline. This approach does not need expensive manual annotations, thereby minimizing individual style biases from the annotation and consequently improving its generalizability. Furthermore, we present a large-scale Radiometry Correction Dataset, specifically designed to emphasize exposure variations, to facilitate unsupervised learning. In addition, we develop a transformation function that preserves image details and outperforms state-of-the-art supervised methods [12], while utilizing only 0.01% of their parameters. Our work further investigates the broader impact of exposure correction on downstream tasks, including edge detection, demonstrating its effectiveness in mitigating the adverse effects of poor exposure on low-level features. The source code and dataset are publicly available at https://github.com/BeyondHeaven/uec_code.

[116] VisionTrap: Unanswerable Questions On Visual Data

Asir Saadat, Syem Aziz, Shahriar Mahmud, Abdullah Ibne Masud Mahi, Sabbir Ahmed

Main category: cs.CV

TL;DR: This paper introduces VisionTrap, a dataset to test whether Visual Question Answering (VQA) models can recognize when questions are unanswerable and appropriately abstain from responding, rather than generating incorrect answers.

Details

Motivation: While VQA has been extensively studied for answerable questions on real images, there's limited research on how models handle unanswerable questions or when they should abstain from answering. Understanding this capability is crucial for evaluating model reliability and knowledge limitations.

Method: The researchers created VisionTrap dataset with three categories of unanswerable questions: (1) hybrid entities combining objects and animals, (2) objects in unconventional/impossible scenarios, and (3) fictional/non-existent figures. They tested VQA models on these logically structured but inherently unanswerable questions to assess abstention behavior.

Result: The study reveals how VQA models perform when faced with unanswerable questions, specifically whether they recognize their knowledge limitations or attempt to generate incorrect responses. The findings demonstrate the models’ tendency to answer even when abstention would be more appropriate.

Conclusion: The research emphasizes the importance of including unanswerable questions in VQA benchmarks to properly evaluate model behavior regarding abstention. This work contributes to understanding VQA model limitations and the need for better evaluation frameworks that test when models should not provide answers.

Abstract: Visual Question Answering (VQA) has been a widely studied topic, with extensive research focusing on how VLMs respond to answerable questions based on real-world images. However, there has been limited exploration of how these models handle unanswerable questions, particularly in cases where they should abstain from providing a response. This research investigates VQA performance on unrealistically generated images or asking unanswerable questions, assessing whether models recognize the limitations of their knowledge or attempt to generate incorrect answers. We introduced a dataset, VisionTrap, comprising three categories of unanswerable questions across diverse image types: (1) hybrid entities that fuse objects and animals, (2) objects depicted in unconventional or impossible scenarios, and (3) fictional or non-existent figures. The questions posed are logically structured yet inherently unanswerable, testing whether models can correctly recognize their limitations. Our findings highlight the importance of incorporating such questions into VQA benchmarks to evaluate whether models tend to answer, even when they should abstain.

[117] PolarAnything: Diffusion-based Polarimetric Image Synthesis

Kailong Zhang, Youwei Lyu, Heng Guo, Si Li, Zhanyu Ma, Boxin Shi

Main category: cs.CV

TL;DR: PolarAnything is a diffusion-based framework that synthesizes photorealistic polarization images from a single RGB input, eliminating the need for extensive 3D assets and enabling broader applications in image enhancement and 3D reconstruction.

Details

Motivation: Polarization images are valuable for image enhancement and 3D reconstruction, but limited accessibility of polarization cameras restricts their application. Existing simulators like Mitsuba require extensive 3D assets and cannot generate large-scale photorealistic images, creating a need for better polarization image synthesis methods.

Method: The authors propose PolarAnything, a diffusion-based generative framework that leverages pretrained diffusion models’ zero-shot capabilities. They develop an effective representation strategy that preserves polarization properties while generating images from single RGB inputs without requiring 3D asset collections.

Result: The model successfully generates high-quality polarization images with both photorealism and physical accuracy. Experiments demonstrate that the synthesized images maintain polarization fidelity and can effectively support downstream applications such as shape from polarization tasks.

Conclusion: PolarAnything successfully addresses the limitations of existing polarization simulators by providing a practical solution for synthesizing polarization images from single RGB inputs, making polarization-based applications more accessible without requiring specialized hardware or extensive 3D datasets.

Abstract: Polarization images facilitate image enhancement and 3D reconstruction tasks, but the limited accessibility of polarization cameras hinders their broader application. This gap drives the need for synthesizing photorealistic polarization images.The existing polarization simulator Mitsuba relies on a parametric polarization image formation model and requires extensive 3D assets covering shape and PBR materials, preventing it from generating large-scale photorealistic images. To address this problem, we propose PolarAnything, capable of synthesizing polarization images from a single RGB input with both photorealism and physical accuracy, eliminating the dependency on 3D asset collections. Drawing inspiration from the zero-shot performance of pretrained diffusion models, we introduce a diffusion-based generative framework with an effective representation strategy that preserves the fidelity of polarization properties. Experiments show that our model generates high-quality polarization images and supports downstream tasks like shape from polarization.

[118] Fully Automated SAM for Single-source Domain Generalization in Medical Image Segmentation

Huanli Zhuo, Leilei Ma, Haifeng Zhao, Shiwei Zhou, Dengdi Sun, Yanping Fu

Main category: cs.CV

TL;DR: FA-SAM is a fully automated SAM-based framework for medical image segmentation that generates automatic prompts and mitigates poor prompt impact through uncertainty modeling and embedding fusion, achieving better cross-domain generalization.

Details

Motivation: SAM-based medical image segmentation models face two major challenges: (1) dependency on domain-specific expert-annotated prompts that prevent fully automated segmentation, and (2) poor prompts can mislead SAM into generating incorrect mask results, limiting clinical applications.

Method: The paper proposes FA-SAM with two key innovations: (1) Auto-prompted Generation Model (AGM) with Shallow Feature Uncertainty Modeling (SUFM) module to automatically generate bounding box prompts by modeling uncertainty distribution of shallow features, and (2) Image-Prompt Embedding Fusion (IPEF) module integrated into SAM mask decoder to combine multiscale information from image and prompt embeddings.

Result: Extensive experiments on publicly available prostate and fundus vessel datasets validate the effectiveness of FA-SAM in achieving fully automated medical image segmentation with improved cross-domain generalization performance.

Conclusion: FA-SAM successfully addresses the automation and poor prompt challenges in SAM-based medical image segmentation by introducing automatic prompt generation and embedding fusion mechanisms, demonstrating potential for clinical applications with validated effectiveness on medical datasets.

Abstract: Although SAM-based single-source domain generalization models for medical image segmentation can mitigate the impact of domain shift on the model in cross-domain scenarios, these models still face two major challenges. First, the segmentation of SAM is highly dependent on domain-specific expert-annotated prompts, which prevents SAM from achieving fully automated medical image segmentation and therefore limits its application in clinical settings. Second, providing poor prompts (such as bounding boxes that are too small or too large) to the SAM prompt encoder can mislead SAM into generating incorrect mask results. Therefore, we propose the FA-SAM, a single-source domain generalization framework for medical image segmentation that achieves fully automated SAM. FA-SAM introduces two key innovations: an Auto-prompted Generation Model (AGM) branch equipped with a Shallow Feature Uncertainty Modeling (SUFM) module, and an Image-Prompt Embedding Fusion (IPEF) module integrated into the SAM mask decoder. Specifically, AGM models the uncertainty distribution of shallow features through the SUFM module to generate bounding box prompts for the target domain, enabling fully automated segmentation with SAM. The IPEF module integrates multiscale information from SAM image embeddings and prompt embeddings to capture global and local details of the target object, enabling SAM to mitigate the impact of poor prompts. Extensive experiments on publicly available prostate and fundus vessel datasets validate the effectiveness of FA-SAM and highlight its potential to address the above challenges.

[119] PointLAMA: Latent Attention meets Mamba for Efficient Point Cloud Pretraining

Xuanyu Lin, Xiaona Zeng, Xianwei Zheng, Xutao Li

Main category: cs.CV

TL;DR: PointLAMA is a point cloud pretraining framework that combines Mamba’s efficient global modeling with local attention mechanisms and conditional diffusion to achieve competitive performance on 3D tasks with minimal computational cost.

Details

Motivation: Mamba has shown promise for point cloud modeling with linear complexity global sequence modeling, but lacks local inductive bias to capture fine-grained geometric structures in 3D data, limiting its effectiveness for detailed point cloud understanding tasks.

Method: The framework consists of three key components: (1) task-aware point cloud serialization using Hilbert/Trans-Hilbert curves and axis-wise sorting for different tasks, (2) a hybrid encoder combining lightweight Latent Attention blocks with Point-wise Multi-head Latent Attention (PMLA) and Mamba blocks for local-global modeling, and (3) conditional diffusion mechanism for enhanced representation learning during pretraining without explicit point-wise reconstruction.

Result: PointLAMA achieves competitive performance on multiple benchmark datasets while maintaining minimal parameter count and FLOPs, demonstrating the effectiveness of combining local attention with Mamba’s global modeling capabilities for efficient point cloud processing.

Conclusion: The proposed PointLAMA framework successfully addresses Mamba’s limitation in local geometric structure modeling by integrating task-aware serialization, hybrid attention-Mamba architecture, and conditional diffusion, resulting in an efficient and effective point cloud pretraining approach.

Abstract: Mamba has recently gained widespread attention as a backbone model for point cloud modeling, leveraging a state-space architecture that enables efficient global sequence modeling with linear complexity. However, its lack of local inductive bias limits its capacity to capture fine-grained geometric structures in 3D data. To address this limitation, we propose \textbf{PointLAMA}, a point cloud pretraining framework that combines task-aware point cloud serialization, a hybrid encoder with integrated Latent Attention and Mamba blocks, and a conditional diffusion mechanism built upon the Mamba backbone. Specifically, the task-aware point cloud serialization employs Hilbert/Trans-Hilbert space-filling curves and axis-wise sorting to structurally align point tokens for classification and segmentation tasks, respectively. Our lightweight Latent Attention block features a Point-wise Multi-head Latent Attention (PMLA) module, which is specifically designed to align with the Mamba architecture by leveraging the shared latent space characteristics of PMLA and Mamba. This enables enhanced local context modeling while preserving overall efficiency. To further enhance representation learning, we incorporate a conditional diffusion mechanism during pretraining, which denoises perturbed feature sequences without relying on explicit point-wise reconstruction. Experimental results demonstrate that PointLAMA achieves competitive performance on multiple benchmark datasets with minimal parameter count and FLOPs, validating its effectiveness for efficient point cloud pretraining.

[120] Learning-based Stage Verification System in Manual Assembly Scenarios

Xingjian Zhang, Yutong Duan, Zaishu Chen

Main category: cs.CV

TL;DR: A novel visual sensor-based monitoring system for Industry 4.0 assembly processes uses multiple machine learning models to achieve 92% accuracy in detecting assembly stages while reducing hardware costs and providing real-time operator guidance.

Details

Motivation: Traditional assembly monitoring methods require multiple sensor types or complex hardware setups that are cost-prohibitive and difficult to implement in dynamic industrial environments. There's a need for effective monitoring solutions that work with minimal visual sensors while maintaining high accuracy.

Method: The approach leverages multiple machine learning models that integrate state information from identical timestamps to detect and confirm assembly process stages using only visual sensors. The system provides enhanced error detection and visualization capabilities for real-time operator guidance.

Result: The method achieves an average accuracy exceeding 92% in detecting current assembly stages. It surpasses conventional methods by offering better error detection and visualization while reducing dependency on expensive hardware solutions.

Conclusion: This visual sensor-based approach provides a more practical and cost-effective solution for assembly monitoring in Industry 4.0 environments, improving both accuracy and efficiency while offering real-time actionable guidance to operators without requiring complex hardware setups.

Abstract: In the context of Industry 4.0, effective monitoring of multiple targets and states during assembly processes is crucial, particularly when constrained to using only visual sensors. Traditional methods often rely on either multiple sensor types or complex hardware setups to achieve high accuracy in monitoring, which can be cost-prohibitive and difficult to implement in dynamic industrial environments. This study presents a novel approach that leverages multiple machine learning models to achieve precise monitoring under the limitation of using a minimal number of visual sensors. By integrating state information from identical timestamps, our method detects and confirms the current stage of the assembly process with an average accuracy exceeding 92%. Furthermore, our approach surpasses conventional methods by offering enhanced error detection and visuali-zation capabilities, providing real-time, actionable guidance to operators. This not only improves the accuracy and efficiency of assembly monitoring but also re-duces dependency on expensive hardware solutions, making it a more practical choice for modern industrial applications.

[121] CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance

Peiqi Chen, Lei Yu, Yi Wan, Yingying Pei, Xinyi Liu, Yongxiang Yao, Yingying Zhang, Lixiang Ru, Liheng Zhong, Jingdong Chen, Ming Yang, Yongjun Zhang

Main category: cs.CV

TL;DR: CasP proposes a cascaded correspondence pipeline for semi-dense feature matching that decomposes matching into two progressive phases with region-based selective cross-attention, achieving ~2.2× speedup compared to ELoFTR while maintaining superior accuracy and cross-domain generalization.

Details

Motivation: Existing semi-dense feature matching methods rely on global search across entire feature maps to establish coarse matches, which limits improvements in both accuracy and efficiency, creating a need for a more targeted and efficient matching approach.

Method: The paper introduces CasP, a cascaded correspondence pipeline that: (1) decomposes matching into two progressive phases, (2) uses region-based selective cross-attention to enhance feature discriminability, (3) restricts search range in the second phase to one-to-many prior areas from the first phase, and (4) incorporates high-level features to reduce computational costs.

Result: CasP achieves ~2.2× speedup at 1152 resolution compared to ELoFTR (the most efficient existing method), with acceleration gains increasing at higher resolutions. The method demonstrates superior performance in geometric estimation and impressive cross-domain generalization capabilities.

Conclusion: CasP successfully addresses the limitations of global search in semi-dense feature matching by introducing a cascaded approach that significantly improves both efficiency and accuracy, making it particularly suitable for latency-sensitive and high-robustness applications like SLAM and UAV systems.

Abstract: Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of $\sim2.2\times$ at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems. Code is available at https://github.com/pq-chen/CasP.

[122] CartoonAlive: Towards Expressive Live2D Modeling from Single Portraits

Chao He, Jianqiang Ren, Jianjing Xiang, Xiejie Shen

Main category: cs.CV

TL;DR: CartoonAlive is a method that generates high-quality Live2D digital humans from a single portrait image in under 30 seconds, using facial blendshapes and keypoint detection to create expressive and interactive 2D cartoon characters.

Details

Motivation: Current digital human approaches focus mainly on 3D models (complex and costly) or 2D video-based representations (inflexible). Interactive 2D cartoon-style digital humans using Live2D technology have received less attention despite offering a more efficient and expressive alternative that simulates 3D-like motion without traditional 3D modeling.

Method: The method leverages shape basis concepts from 3D face modeling to construct facial blendshapes suitable for Live2D, then infers corresponding blendshape weights based on facial keypoints detected from the input portrait image. This enables rapid generation of expressive Live2D models.

Result: CartoonAlive can generate highly expressive and visually accurate Live2D models that closely resemble the input portrait within less than half a minute, providing real-time manipulation capabilities and dynamic interaction.

Conclusion: The work provides a practical and scalable solution for creating interactive 2D cartoon characters, opening new possibilities in digital content creation and virtual character animation by bridging the gap between efficiency and expressiveness in digital human generation.

Abstract: With the rapid advancement of large foundation models, AIGC, cloud rendering, and real-time motion capture technologies, digital humans are now capable of achieving synchronized facial expressions and body movements, engaging in intelligent dialogues driven by natural language, and enabling the fast creation of personalized avatars. While current mainstream approaches to digital humans primarily focus on 3D models and 2D video-based representations, interactive 2D cartoon-style digital humans have received relatively less attention. Compared to 3D digital humans that require complex modeling and high rendering costs, and 2D video-based solutions that lack flexibility and real-time interactivity, 2D cartoon-style Live2D models offer a more efficient and expressive alternative. By simulating 3D-like motion through layered segmentation without the need for traditional 3D modeling, Live2D enables dynamic and real-time manipulation. In this technical report, we present CartoonAlive, an innovative method for generating high-quality Live2D digital humans from a single input portrait image. CartoonAlive leverages the shape basis concept commonly used in 3D face modeling to construct facial blendshapes suitable for Live2D. It then infers the corresponding blendshape weights based on facial keypoints detected from the input image. This approach allows for the rapid generation of a highly expressive and visually accurate Live2D model that closely resembles the input portrait, within less than half a minute. Our work provides a practical and scalable solution for creating interactive 2D cartoon characters, opening new possibilities in digital content creation and virtual character animation. The project homepage is https://human3daigc.github.io/CartoonAlive_webpage/.

[123] PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image

Hyeongjin Nam, Donghwan Kim, Gyeongsik Moon, Kyoung Mu Lee

Main category: cs.CV

TL;DR: PARTE is a 3D human reconstruction method that uses human part segmentation to improve texture alignment, addressing the common issue of misaligned textures across different body parts in existing methods.

Details

Motivation: Existing 3D human reconstruction methods suffer from misaligned human textures across different body parts (like jackets and pants blending together), as they don't explicitly exploit part segmentation information that could serve as crucial cues for inferring textures in invisible regions.

Method: The framework consists of two core components: (1) PartSegmenter - a 3D part segmentation module that reconstructs textureless human surface and predicts part labels, and (2) PartTexturer - a part-guided texturing module that incorporates part information into texture reconstruction using prior knowledge from pre-trained image generation networks.

Result: Extensive experiments demonstrate that PARTE achieves state-of-the-art quality in 3D human reconstruction with improved texture alignment across human parts.

Conclusion: By explicitly utilizing 3D human part information as guidance, PARTE successfully addresses the texture misalignment problem in 3D human reconstruction, achieving superior reconstruction quality compared to existing methods.

Abstract: The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which utilizes 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Extensive experiments demonstrate that our framework achieves state-of-the-art quality in 3D human reconstruction. The project page is available at https://hygenie1228.github.io/PARTE/.

[124] Temporal Point-Supervised Signal Reconstruction: A Human-Annotation-Free Framework for Weak Moving Target Detection

Weihua Gao, Chunxu Ren, Wenlong Niu, Xiaodong Peng

Main category: cs.CV

TL;DR: A novel Temporal Point-Supervised framework with TSRNet for detecting weak moving targets in low-altitude surveillance systems without manual annotations, achieving state-of-the-art performance at over 1000 FPS.

Details

Motivation: Detecting weak moving targets in low-altitude surveillance systems is challenging due to low signal energy, small spatial extent, and complex background clutter. Existing methods struggle with extracting robust features and lack reliable annotations for training.

Method: Proposed Temporal Point-Supervised (TPS) framework that reformulates detection as pixel-wise temporal signal modeling. Developed Temporal Signal Reconstruction Network (TSRNet) with encoder-decoder architecture, Dynamic Multi-Scale Attention module, and graph-based trajectory mining strategy to suppress false alarms.

Result: Outperforms state-of-the-art methods on purpose-built low-SNR dataset while requiring no human annotations. Achieves strong detection performance with processing speed over 1000 FPS, demonstrating real-time capability.

Conclusion: The TPS framework successfully addresses weak target detection challenges by modeling temporal signals without manual annotations, achieving superior performance and real-time processing speeds suitable for practical surveillance deployment.

Abstract: In low-altitude surveillance and early warning systems, detecting weak moving targets remains a significant challenge due to low signal energy, small spatial extent, and complex background clutter. Existing methods struggle with extracting robust features and suffer from the lack of reliable annotations. To address these limitations, we propose a novel Temporal Point-Supervised (TPS) framework that enables high-performance detection of weak targets without any manual annotations.Instead of conventional frame-based detection, our framework reformulates the task as a pixel-wise temporal signal modeling problem, where weak targets manifest as short-duration pulse-like responses. A Temporal Signal Reconstruction Network (TSRNet) is developed under the TPS paradigm to reconstruct these transient signals.TSRNet adopts an encoder-decoder architecture and integrates a Dynamic Multi-Scale Attention (DMSAttention) module to enhance its sensitivity to diverse temporal patterns. Additionally, a graph-based trajectory mining strategy is employed to suppress false alarms and ensure temporal consistency.Extensive experiments on a purpose-built low-SNR dataset demonstrate that our framework outperforms state-of-the-art methods while requiring no human annotations. It achieves strong detection performance and operates at over 1000 FPS, underscoring its potential for real-time deployment in practical scenarios.

[125] DeMo++: Motion Decoupling for Autonomous Driving

Bozhou Zhang, Nan Song, Xiatian Zhu, Li Zhang

Main category: cs.CV

TL;DR: DeMo++ is a motion forecasting and planning framework that decouples motion estimation into holistic intentions and fine spatiotemporal states, using a hybrid Attention-Mamba model to achieve state-of-the-art performance across multiple autonomous driving benchmarks.

Details

Motivation: Current one-query-one-trajectory methods struggle to model intricate spatiotemporal evolution of trajectories, leading to potential collisions or suboptimal outcomes in autonomous driving systems despite producing diverse motion intentions.

Method: The paper proposes DeMo++, which decouples motion estimation into two components: (1) holistic motion intentions for capturing diverse movement directions, and (2) fine spatiotemporal states for tracking dynamic progress and enabling self-refinement. It incorporates a cross-scene trajectory interaction mechanism and uses a hybrid Attention-Mamba architecture for efficient scene aggregation and precise trajectory modeling.

Result: DeMo++ achieves state-of-the-art performance across multiple benchmarks including motion forecasting (Argoverse 2 and nuScenes), motion planning (nuPlan), and end-to-end planning (NAVSIM), demonstrating comprehensive modeling of both motion diversity and spatiotemporal evolution.

Conclusion: The decoupled approach successfully addresses limitations of existing methods by comprehensively modeling both motion intention diversity and spatiotemporal trajectory evolution, with the hybrid Attention-Mamba architecture proving effective for autonomous driving applications across various tasks.

Abstract: Motion forecasting and planning are tasked with estimating the trajectories of traffic agents and the ego vehicle, respectively, to ensure the safety and efficiency of autonomous driving systems in dynamically changing environments. State-of-the-art methods typically adopt a one-query-one-trajectory paradigm, where each query corresponds to a unique trajectory for predicting multi-mode trajectories. While this paradigm can produce diverse motion intentions, it often falls short in modeling the intricate spatiotemporal evolution of trajectories, which can lead to collisions or suboptimal outcomes. To overcome this limitation, we propose DeMo++, a framework that decouples motion estimation into two distinct components: holistic motion intentions to capture the diverse potential directions of movement, and fine spatiotemporal states to track the agent’s dynamic progress within the scene and enable a self-refinement capability. Further, we introduce a cross-scene trajectory interaction mechanism to explore the relationships between motions in adjacent scenes. This allows DeMo++ to comprehensively model both the diversity of motion intentions and the spatiotemporal evolution of each trajectory. To effectively implement this framework, we developed a hybrid model combining Attention and Mamba. This architecture leverages the strengths of both mechanisms for efficient scene information aggregation and precise trajectory state sequence modeling. Extensive experiments demonstrate that DeMo++ achieves state-of-the-art performance across various benchmarks, including motion forecasting (Argoverse 2 and nuScenes), motion planning (nuPlan), and end-to-end planning (NAVSIM).

[126] Swin-TUNA : A Novel PEFT Approach for Accurate Food Image Segmentation

Haotian Chen, Zhiyong Xiao

Main category: cs.CV

TL;DR: This paper introduces Swin-TUNA, a parameter-efficient fine-tuning method that achieves high-performance food image segmentation by updating only 4% of parameters while reducing parameter count by 98.7% compared to existing large-scale models like FoodSAM.

Details

Motivation: Existing large-scale Transformer-based models for food image segmentation (like FoodSAM) have massive parameter counts and high computational demands that make them impractical for industrial deployment, creating a need for more efficient alternatives.

Method: The paper proposes TUNable Adapter module (Swin-TUNA), a Parameter Efficient Fine-Tuning (PEFT) method that integrates multiscale trainable adapters into Swin Transformer architecture. The core innovation includes hierarchical feature adaptation with separable convolutions, dimensional mappings of varying scales, and a dynamic balancing strategy for task-agnostic and task-specific features.

Result: Swin-TUNA achieves mIoU of 50.56% on FoodSeg103 dataset and 74.94% on UECFoodPix Complete dataset, surpassing the fully parameterized FoodSAM model while using only 8.13M parameters (98.7% reduction). The method also demonstrates faster convergence and stronger generalization in low-data scenarios.

Conclusion: Swin-TUNA provides an efficient solution for lightweight food image segmentation that maintains high performance while dramatically reducing computational requirements, making it suitable for practical industrial deployment applications.

Abstract: In the field of food image processing, efficient semantic segmentation techniques are crucial for industrial applications. However, existing large-scale Transformer-based models (such as FoodSAM) face challenges in meeting practical deploymentrequirements due to their massive parameter counts and high computational resource demands. This paper introduces TUNable Adapter module (Swin-TUNA), a Parameter Efficient Fine-Tuning (PEFT) method that integrates multiscale trainable adapters into the Swin Transformer architecture, achieving high-performance food image segmentation by updating only 4% of the parameters. The core innovation of Swin-TUNA lies in its hierarchical feature adaptation mechanism: it designs separable convolutions in depth and dimensional mappings of varying scales to address the differences in features between shallow and deep networks, combined with a dynamic balancing strategy for tasks-agnostic and task-specific features. Experiments demonstrate that this method achieves mIoU of 50.56% and 74.94% on the FoodSeg103 and UECFoodPix Complete datasets, respectively, surpassing the fully parameterized FoodSAM model while reducing the parameter count by 98.7% (to only 8.13M). Furthermore, Swin-TUNA exhibits faster convergence and stronger generalization capabilities in low-data scenarios, providing an efficient solution for assembling lightweight food image.

[127] Exploring Active Learning for Label-Efficient Training of Semantic Neural Radiance Field

Yuzhe Zhu, Lile Cai, Kangkang Lu, Fayao Liu, Xulei Yang

Main category: cs.CV

TL;DR: This paper proposes active learning strategies to reduce annotation costs for training semantically-aware Neural Radiance Fields (NeRFs), achieving more than 2X reduction in pixel-level labeling requirements compared to random sampling.

Details

Motivation: Training semantically-aware NeRFs requires expensive pixel-level class labels for semantic scene understanding. The high annotation burden limits the practical deployment of these models, necessitating methods to reduce labeling costs while maintaining performance.

Method: The authors investigate active learning approaches for semantically-aware NeRF training, exploring different selection granularities and strategies. They propose a novel active learning strategy that incorporates 3D geometric constraints in the sample selection process to optimize annotation efficiency.

Result: The proposed active learning approach achieves more than 2X reduction in annotation cost compared to random sampling while maintaining comparable performance in training semantically-aware NeRFs. The method effectively reduces the pixel-level labeling burden.

Conclusion: Active learning is a viable solution for reducing annotation costs in semantically-aware NeRF training. The incorporation of 3D geometric constraints in sample selection significantly improves annotation efficiency, making semantic NeRF training more practical and cost-effective.

Abstract: Neural Radiance Field (NeRF) models are implicit neural scene representation methods that offer unprecedented capabilities in novel view synthesis. Semantically-aware NeRFs not only capture the shape and radiance of a scene, but also encode semantic information of the scene. The training of semantically-aware NeRFs typically requires pixel-level class labels, which can be prohibitively expensive to collect. In this work, we explore active learning as a potential solution to alleviate the annotation burden. We investigate various design choices for active learning of semantically-aware NeRF, including selection granularity and selection strategies. We further propose a novel active learning strategy that takes into account 3D geometric constraints in sample selection. Our experiments demonstrate that active learning can effectively reduce the annotation cost of training semantically-aware NeRF, achieving more than 2X reduction in annotation cost compared to random sampling.

[128] Exploring Active Learning for Semiconductor Defect Segmentation

Lile Cai, Ramanpreet Singh Pahwa, Xun Xu, Jie Wang, Richard Chang, Lining Zhang, Chuan-Sheng Foo

Main category: cs.CV

TL;DR: This paper proposes an active learning approach for semiconductor X-Ray microscopy defect detection that addresses domain shift and class imbalance through contrastive pretraining and rareness-aware sample selection, achieving state-of-the-art performance while reducing annotation requirements.

Details

Motivation: Deep learning models for semiconductor XRM defect detection require large amounts of annotated data which is time-consuming and expensive to obtain, especially for dense prediction tasks like semantic segmentation. The authors aim to reduce annotation burden through active learning.

Method: The approach combines contrastive pretraining on unlabelled data to initialize weights for each active learning cycle, with a rareness-aware acquisition function that prioritizes selecting samples containing rare classes to address severe class imbalance in semiconductor XRM data.

Result: The method achieves state-of-the-art performance on a semiconductor dataset compiled from XRM scans of high bandwidth memory structures, demonstrating effective handling of both large domain shift and severe class imbalance challenges.

Conclusion: Active learning with contrastive pretraining and rareness-aware sample selection successfully reduces annotation requirements for semiconductor XRM defect detection while maintaining high performance, providing a practical solution for industrial semiconductor inspection applications.

Abstract: The development of X-Ray microscopy (XRM) technology has enabled non-destructive inspection of semiconductor structures for defect identification. Deep learning is widely used as the state-of-the-art approach to perform visual analysis tasks. However, deep learning based models require large amount of annotated data to train. This can be time-consuming and expensive to obtain especially for dense prediction tasks like semantic segmentation. In this work, we explore active learning (AL) as a potential solution to alleviate the annotation burden. We identify two unique challenges when applying AL on semiconductor XRM scans: large domain shift and severe class-imbalance. To address these challenges, we propose to perform contrastive pretraining on the unlabelled data to obtain the initialization weights for each AL cycle, and a rareness-aware acquisition function that favors the selection of samples containing rare classes. We evaluate our method on a semiconductor dataset that is compiled from XRM scans of high bandwidth memory structures composed of logic and memory dies, and demonstrate that our method achieves state-of-the-art performance.

[129] Exploring Spatial Diversity for Region-based Active Learning

Lile Cai, Xun Xu, Lining Zhang, Chuan-Sheng Foo

Main category: cs.CV

TL;DR: This paper proposes a region-based active learning framework for semantic segmentation that incorporates spatial diversity alongside traditional uncertainty measures to reduce annotation costs while maintaining high performance.

Details

Motivation: Large-scale labeled datasets for semantic segmentation require expensive pixel-level annotations. The paper aims to reduce annotation costs while maintaining performance by strategically selecting informative image regions for labeling instead of entire images.

Method: The authors propose a unified optimization framework that combines spatial diversity with traditional active selection criteria (uncertainty and feature diversity) for region-based active learning. The method enforces local spatial diversity when selecting batches of informative image regions for annotation.

Result: The framework achieves 95% performance of fully supervised methods using only 5-9% of labeled pixels on Cityscapes and PASCAL VOC datasets. It outperforms all state-of-the-art region-based active learning methods for semantic segmentation.

Conclusion: Incorporating spatial diversity into region-based active learning significantly improves the effectiveness of uncertainty-based and feature diversity-based methods, enabling substantial reduction in annotation costs while maintaining competitive performance for semantic segmentation tasks.

Abstract: State-of-the-art methods for semantic segmentation are based on deep neural networks trained on large-scale labeled datasets. Acquiring such datasets would incur large annotation costs, especially for dense pixel-level prediction tasks like semantic segmentation. We consider region-based active learning as a strategy to reduce annotation costs while maintaining high performance. In this setting, batches of informative image regions instead of entire images are selected for labeling. Importantly, we propose that enforcing local spatial diversity is beneficial for active learning in this case, and to incorporate spatial diversity along with the traditional active selection criterion, e.g., data sample uncertainty, in a unified optimization framework for region-based active learning. We apply this framework to the Cityscapes and PASCAL VOC datasets and demonstrate that the inclusion of spatial diversity effectively improves the performance of uncertainty-based and feature diversity-based active learning methods. Our framework achieves $95%$ performance of fully supervised methods with only $5-9%$ of the labeled pixels, outperforming all state-of-the-art region-based active learning methods for semantic segmentation.

[130] SFUOD: Source-Free Unknown Object Detection

Keon-Hee Park, Seun-An Choe, Gyeong-Moon Park

Main category: cs.CV

TL;DR: This paper introduces Source-Free Unknown Object Detection (SFUOD), which enables object detectors to adapt to new domains without source data access while detecting both known and unknown objects. The proposed CollaPAUL framework uses collaborative tuning and principal axis-based labeling to achieve state-of-the-art performance.

Details

Motivation: Existing source-free object detection methods operate under a restrictive closed-set assumption where only pre-defined objects from the source domain can be detected in the target domain. This prevents detectors from identifying undefined objects that may exist in real-world scenarios, limiting their practical applicability.

Method: The paper proposes CollaPAUL (Collaborative tuning and Principal Axis-based Unknown Labeling) framework with two key components: (1) Collaborative tuning that integrates target-dependent knowledge from an auxiliary encoder with source-dependent knowledge from the pre-trained detector using cross-domain attention mechanism, and (2) Principal axes-based unknown labeling that assigns pseudo-labels to unknown objects by estimating objectness through principal axes projection and confidence scores.

Result: CollaPAUL achieves state-of-the-art performance on SFUOD benchmarks, demonstrating its effectiveness in detecting both known and unknown objects in target domains without access to source data during adaptation.

Conclusion: The proposed SFUOD scenario and CollaPAUL framework successfully address the limitations of closed-set assumptions in source-free domain adaptation, enabling practical object detection systems that can handle unknown objects while maintaining strong performance on known objects.

Abstract: Source-free object detection adapts a detector pre-trained on a source domain to an unlabeled target domain without requiring access to labeled source data. While this setting is practical as it eliminates the need for the source dataset during domain adaptation, it operates under the restrictive assumption that only pre-defined objects from the source domain exist in the target domain. This closed-set setting prevents the detector from detecting undefined objects. To ease this assumption, we propose Source-Free Unknown Object Detection (SFUOD), a novel scenario which enables the detector to not only recognize known objects but also detect undefined objects as unknown objects. To this end, we propose CollaPAUL (Collaborative tuning and Principal Axis-based Unknown Labeling), a novel framework for SFUOD. Collaborative tuning enhances knowledge adaptation by integrating target-dependent knowledge from the auxiliary encoder with source-dependent knowledge from the pre-trained detector through a cross-domain attention mechanism. Additionally, principal axes-based unknown labeling assigns pseudo-labels to unknown objects by estimating objectness via principal axes projection and confidence scores from model predictions. The proposed CollaPAUL achieves state-of-the-art performances on SFUOD benchmarks, and extensive experiments validate its effectiveness.

[131] A Conditional Probability Framework for Compositional Zero-shot Learning

Peng Wu, Qiuxia Lai, Hao Fang, Guo-Sen Xie, Yilong Yin, Xiankai Lu, Wenguan Wang

Main category: cs.CV

TL;DR: This paper proposes a Conditional Probability Framework (CPF) for Compositional Zero-Shot Learning that models attribute-object dependencies by decomposing composition probability into object likelihood and conditional attribute likelihood, achieving better generalization to unseen attribute-object combinations.

Details

Motivation: Traditional CZSL approaches treat attributes and objects as independent entities, overlooking semantic constraints and contextual dependencies within compositions. For example, certain attributes naturally pair with specific objects, and the same attribute can manifest differently in different contexts, making attribute-object interdependence a fundamental challenge in CZSL.

Method: The authors adopt a Conditional Probability Framework (CPF) that decomposes composition probability into two components: object likelihood and conditional attribute likelihood. They incorporate textual descriptors to enhance object feature learning by highlighting semantically relevant image regions, then use enhanced object features to guide attribute learning through a cross-attention mechanism for better contextual alignment.

Result: Extensive experiments on multiple CZSL benchmarks demonstrate the superiority of the proposed approach, showing that explicitly modeling attribute-object dependencies leads to better generalization to unseen compositions compared to traditional methods.

Conclusion: By jointly optimizing object likelihood and conditional attribute likelihood, the method effectively captures compositional dependencies and generalizes well to unseen compositions, addressing the long-ignored challenge of attribute-object interdependence in CZSL.

Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions. Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning. However, this assumption overlooks the semantic constraints and contextual dependencies inside a composition. For example, certain attributes naturally pair with specific objects (e.g., “striped” applies to “zebra” or “shirts” but not “sky” or “water”), while the same attribute can manifest differently depending on context (e.g., “young” in “young tree” vs. “young dog”). Thus, capturing attribute-object interdependence remains a fundamental yet long-ignored challenge in CZSL. In this paper, we adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies. We decompose the probability of a composition into two components: the likelihood of an object and the conditional likelihood of its attribute. To enhance object feature learning, we incorporate textual descriptors to highlight semantically relevant image regions. These enhanced object features then guide attribute learning through a cross-attention mechanism, ensuring better contextual alignment. By jointly optimizing object likelihood and conditional attribute likelihood, our method effectively captures compositional dependencies and generalizes well to unseen compositions. Extensive experiments on multiple CZSL benchmarks demonstrate the superiority of our approach. Code is available at here.

[132] HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs

Zhaolin Cai, Fan Li, Ziwei Zheng, Yanjun Qin

Main category: cs.CV

TL;DR: HiProbe-VAD is a novel training-free framework that uses pre-trained Multimodal Large Language Models (MLLMs) for video anomaly detection by extracting informative hidden states from intermediate layers rather than output layers, achieving better performance than existing methods while requiring no fine-tuning.

Details

Motivation: Traditional Video Anomaly Detection methods face significant challenges including substantial computational demands and heavy reliance on extensive labeled datasets, which restricts their practical applicability. There is a need for more efficient and scalable solutions that can work without requiring fine-tuning or large training datasets.

Method: The paper proposes HiProbe-VAD framework with a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from optimal intermediate layers of pre-trained MLLMs during reasoning. The framework includes a lightweight anomaly scorer and temporal localization module that processes these extracted hidden states to detect anomalies and generate explanations.

Result: HiProbe-VAD outperforms existing training-free methods and most traditional approaches on UCF-Crime and XD-Violence datasets. The framework demonstrates remarkable cross-model generalization capabilities across different MLLMs without any tuning, showing that intermediate hidden states exhibit higher sensitivity and linear separability for anomalies compared to output layers.

Conclusion: The research successfully unlocks the potential of pre-trained MLLMs for video anomaly detection by leveraging intermediate hidden states rather than output layers. This approach paves the way for more practical and scalable VAD solutions that don’t require fine-tuning, demonstrating strong generalization across different models and datasets.

Abstract: Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning. Then a lightweight anomaly scorer and temporal localization module efficiently detects anomalies using these extracted hidden states and finally generate explanations. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that HiProbe-VAD outperforms existing training-free and most traditional approaches. Furthermore, our framework exhibits remarkable cross-model generalization capabilities in different MLLMs without any tuning, unlocking the potential of pre-trained MLLMs for video anomaly detection and paving the way for more practical and scalable solutions.

[133] Content-based 3D Image Retrieval and a ColBERT-inspired Re-ranking for Tumor Flagging and Staging

Farnaz Khun Jush, Steffen Vogler, Matthias Lenga

Main category: cs.CV

TL;DR: This study introduces C-MIR, a novel content-based image retrieval system for 3D medical images that adapts ColBERT’s contextualized late interaction mechanism to enable efficient tumor detection and staging without requiring pre-segmented data, demonstrating improved performance particularly for colon and lung tumors.

Details

Motivation: The increasing volume of medical images creates challenges for radiologists in retrieving relevant cases. Existing content-based image retrieval (CBIR) systems lack standardized evaluation and comprehensive studies, and current approaches rely on pre-segmented data and organ-specific datasets, which don't align with real-world clinical practice using large, unstructured image archiving systems like PACS.

Method: The researchers developed C-MIR, a volumetric re-ranking method that adapts ColBERT’s contextualized late interaction mechanism for 3D medical imaging. They created a framework that eliminates reliance on pre-segmented data and organ-specific datasets, making it compatible with clinical PACS systems. The method enables context-aware re-ranking and effective localization of regions of interest without pre-segmentation.

Result: C-MIR demonstrated significant advantages across evaluations using four tumor sites, three feature extractors, and three database configurations. The system showed promising improvements in tumor flagging with improved performance, particularly for colon and lung tumors (p<0.05). C-MIR also showed potential for improving tumor staging and successfully adapted the late interaction principle to volumetric medical images.

Conclusion: C-MIR successfully bridges the gap between advanced retrieval techniques and practical healthcare applications. The system offers a computationally efficient alternative to expensive data enrichment approaches by eliminating the need for pre-segmentation while maintaining effective tumor detection and staging capabilities, paving the way for improved diagnostic processes in clinical practice.

Abstract: The increasing volume of medical images poses challenges for radiologists in retrieving relevant cases. Content-based image retrieval (CBIR) systems offer potential for efficient access to similar cases, yet lack standardized evaluation and comprehensive studies. Building on prior studies for tumor characterization via CBIR, this study advances CBIR research for volumetric medical images through three key contributions: (1) a framework eliminating reliance on pre-segmented data and organ-specific datasets, aligning with large and unstructured image archiving systems, i.e. PACS in clinical practice; (2) introduction of C-MIR, a novel volumetric re-ranking method adapting ColBERT’s contextualized late interaction mechanism for 3D medical imaging; (3) comprehensive evaluation across four tumor sites using three feature extractors and three database configurations. Our evaluations highlight the significant advantages of C-MIR. We demonstrate the successful adaptation of the late interaction principle to volumetric medical images, enabling effective context-aware re-ranking. A key finding is C-MIR’s ability to effectively localize the region of interest, eliminating the need for pre-segmentation of datasets and offering a computationally efficient alternative to systems relying on expensive data enrichment steps. C-MIR demonstrates promising improvements in tumor flagging, achieving improved performance, particularly for colon and lung tumors (p<0.05). C-MIR also shows potential for improving tumor staging, warranting further exploration of its capabilities. Ultimately, our work seeks to bridge the gap between advanced retrieval techniques and their practical applications in healthcare, paving the way for improved diagnostic processes.

[134] Physics-based Human Pose Estimation from a Single Moving RGB Camera

Ayce Idil Aytekin, Chuqiao Li, Diogo Luvizon, Rishabh Dabral, Martin Oswald, Marc Habermann, Christian Theobalt

Main category: cs.CV

TL;DR: This paper introduces MoviCam, the first dataset with ground-truth camera trajectories and 3D human motion in dynamic scenes, and proposes PhysDynPose, a physics-based method that improves human pose tracking in moving camera and non-flat environments by incorporating scene geometry and physical constraints.

Details

Motivation: Existing monocular and physics-based human pose tracking methods suffer from artifacts when dealing with non-flat ground planes or moving cameras, and are typically evaluated on datasets that fail to model real-world conditions like dynamic camera motion and complex scene geometry.

Method: PhysDynPose combines a state-of-the-art kinematics estimator for human pose estimation with a robust SLAM method to capture dynamic camera trajectories, enabling recovery of human pose in world coordinates. The kinematic pose estimates are then refined using a scene-aware physics optimizer that incorporates scene geometry and physical constraints.

Result: The new MoviCam benchmark reveals that state-of-the-art methods struggle with moving cameras and non-planar environments, while PhysDynPose robustly estimates both human and camera poses in world coordinates under these challenging conditions.

Conclusion: The paper successfully addresses limitations of existing human pose tracking methods by introducing a comprehensive dataset and a physics-based approach that handles dynamic camera motion and complex scene geometry, demonstrating improved robustness compared to existing state-of-the-art methods.

Abstract: Most monocular and physics-based human pose tracking methods, while achieving state-of-the-art results, suffer from artifacts when the scene does not have a strictly flat ground plane or when the camera is moving. Moreover, these methods are often evaluated on in-the-wild real world videos without ground-truth data or on synthetic datasets, which fail to model the real world light transport, camera motion, and pose-induced appearance and geometry changes. To tackle these two problems, we introduce MoviCam, the first non-synthetic dataset containing ground-truth camera trajectories of a dynamically moving monocular RGB camera, scene geometry, and 3D human motion with human-scene contact labels. Additionally, we propose PhysDynPose, a physics-based method that incorporates scene geometry and physical constraints for more accurate human motion tracking in case of camera motion and non-flat scenes. More precisely, we use a state-of-the-art kinematics estimator to obtain the human pose and a robust SLAM method to capture the dynamic camera trajectory, enabling the recovery of the human pose in the world frame. We then refine the kinematic pose estimate using our scene-aware physics optimizer. From our new benchmark, we found that even state-of-the-art methods struggle with this inherently challenging setting, i.e. a moving camera and non-planar environments, while our method robustly estimates both human and camera poses in world coordinates.

[135] CAPRI-CT: Causal Analysis and Predictive Reasoning for Image Quality Optimization in Computed Tomography

Sneha George Gnanakalavathy, Hairil Abdul Razak, Robert Meertens, Jonathan E. Fieldsend, Xujiong Ye, Mohammed M. Abdelsamea

Main category: cs.CV

TL;DR: CAPRI-CT is a causal-aware deep learning framework that uses VAEs and imaging metadata to predict CT image quality (SNR) and enable counterfactual simulations for optimizing scan protocols without repeated physical scans.

Details

Motivation: The key clinical challenge in CT imaging is achieving high image quality while minimizing radiation exposure to patients. Current approaches require repeated physical scans to optimize protocols, which is inefficient and increases radiation exposure.

Method: CAPRI-CT integrates CT images with acquisition metadata (tube voltage, tube current, contrast agent types) using an ensemble of Variational Autoencoders (VAEs) to extract features and model causal relationships. The framework employs ensemble learning and feature fusion to predict Signal-to-Noise Ratio (SNR) and support counterfactual inference for what-if simulations.

Result: CAPRI-CT achieved strong predictive performance in modeling image quality relationships and successfully enabled counterfactual simulations for different contrast agents and scan parameters, providing actionable insights for protocol optimization.

Conclusion: CAPRI-CT provides a practical solution for radiologists and technicians to design more efficient CT protocols by combining prediction and interpretability capabilities, potentially reducing the need for repeated physical scans while maintaining or improving image quality.

Abstract: In computed tomography (CT), achieving high image quality while minimizing radiation exposure remains a key clinical challenge. This paper presents CAPRI-CT, a novel causal-aware deep learning framework for Causal Analysis and Predictive Reasoning for Image Quality Optimization in CT imaging. CAPRI-CT integrates image data with acquisition metadata (such as tube voltage, tube current, and contrast agent types) to model the underlying causal relationships that influence image quality. An ensemble of Variational Autoencoders (VAEs) is employed to extract meaningful features and generate causal representations from observational data, including CT images and associated imaging parameters. These input features are fused to predict the Signal-to-Noise Ratio (SNR) and support counterfactual inference, enabling what-if simulations, such as changes in contrast agents (types and concentrations) or scan parameters. CAPRI-CT is trained and validated using an ensemble learning approach, achieving strong predictive performance. By facilitating both prediction and interpretability, CAPRI-CT provides actionable insights that could help radiologists and technicians design more efficient CT protocols without repeated physical scans. The source code and dataset are publicly available at https://github.com/SnehaGeorge22/capri-ct.

[136] Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection

Yehao Lu, Minghe Weng, Zekang Xiao, Rui Jiang, Wei Su, Guangcong Zheng, Ping Lu, Xi Li

Main category: cs.CV

TL;DR: This paper introduces Dynamic-DINO, which applies Mixture of Experts (MoE) architecture to real-time open-vocabulary object detection by extending Grounding DINO 1.5 Edge with dynamic inference and efficient MoE-Tuning, achieving better performance with less training data.

Details

Motivation: While MoE architecture has shown success in Large Vision-Language Models, its potential in real-time open-vocabulary object detectors remains unexplored. The authors aim to investigate how MoE can benefit smaller models that also leverage large-scale vision-language datasets for object detection tasks.

Method: The authors propose Dynamic-DINO with three key components: (1) MoE-Tuning strategy to convert dense models to dynamic inference frameworks, (2) granularity decomposition mechanism that breaks down Feed-Forward Networks into multiple smaller expert networks, and (3) pre-trained weight allocation strategy with specific router initialization to prevent performance degradation during fine-tuning.

Result: Dynamic-DINO outperforms Grounding DINO 1.5 Edge despite being pretrained with only 1.56M open-source data compared to the baseline’s private Grounding20M dataset. The study reveals that experts cooperate diversely in shallow layers but form fixed collaborative structures (2-3 partners) in deeper layers, with different expert combinations specializing in specific patterns.

Conclusion: MoE architecture can effectively enhance real-time open-vocabulary object detection performance. The proposed Dynamic-DINO demonstrates that strategic application of MoE with proper initialization and decomposition mechanisms can achieve superior results with significantly less training data, making it more accessible and efficient for practical applications.

Abstract: The Mixture of Experts (MoE) architecture has excelled in Large Vision-Language Models (LVLMs), yet its potential in real-time open-vocabulary object detectors, which also leverage large-scale vision-language datasets but smaller models, remains unexplored. This work investigates this domain, revealing intriguing insights. In the shallow layers, experts tend to cooperate with diverse peers to expand the search space. While in the deeper layers, fixed collaborative structures emerge, where each expert maintains 2-3 fixed partners and distinct expert combinations are specialized in processing specific patterns. Concretely, we propose Dynamic-DINO, which extends Grounding DINO 1.5 Edge from a dense model to a dynamic inference framework via an efficient MoE-Tuning strategy. Additionally, we design a granularity decomposition mechanism to decompose the Feed-Forward Network (FFN) of base model into multiple smaller expert networks, expanding the subnet search space. To prevent performance degradation at the start of fine-tuning, we further propose a pre-trained weight allocation strategy for the experts, coupled with a specific router initialization. During inference, only the input-relevant experts are activated to form a compact subnet. Experiments show that, pretrained with merely 1.56M open-source data, Dynamic-DINO outperforms Grounding DINO 1.5 Edge, pretrained on the private Grounding20M dataset.

[137] Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls

Elena Pitta, Tom Kouwenhoven, Tessa Verhoef

Main category: cs.CV

TL;DR: This study evaluates Visual Entailment (VE) as a probe for vision-language understanding using LLaMA 3.2 11B Vision, finding that while fine-tuning achieves strong performance (83.3% accuracy), the task has limitations including over-reliance on linguistic priors and questionable visual grounding.

Details

Motivation: To investigate whether Visual Entailment serves as a reliable probe for evaluating vision-language understanding in multimodal language models, and to understand the underlying possibilities and limitations of this evaluation task.

Method: Conducted experiments across zero-shot, few-shot, and fine-tuning settings using LLaMA 3.2 11B Vision model. Explored factors including prompt design, number and order of in-context examples, and access to visual information. Used explanation-based evaluations to probe reasoning processes and compared performance with and without visual information.

Result: Three-shot inference outperformed zero-shot baselines, but additional examples introduced noise. Label order in prompts critically influenced predictions. Without visual information, the model showed strong tendency to hallucinate. Fine-tuning achieved 83.3% accuracy on e-SNLI-VE dataset, outperforming state-of-the-art OFA-X model. Explanation evaluation showed BERTScore F1-score of 89.2%, but comparable scores were found in limited vision experiments.

Conclusion: Visual Entailment has both utility and limitations as a diagnostic task for vision-language understanding. The results reveal concerns about models’ over-reliance on linguistic priors and questionable visual grounding, pointing to the need for refining multimodal evaluation methods.

Abstract: This study investigates the extent to which the Visual Entailment (VE) task serves as a reliable probe of vision-language understanding in multimodal language models, using the LLaMA 3.2 11B Vision model as a test case. Beyond reporting performance metrics, we aim to interpret what these results reveal about the underlying possibilities and limitations of the VE task. We conduct a series of experiments across zero-shot, few-shot, and fine-tuning settings, exploring how factors such as prompt design, the number and order of in-context examples and access to visual information might affect VE performance. To further probe the reasoning processes of the model, we used explanation-based evaluations. Results indicate that three-shot inference outperforms the zero-shot baselines. However, additional examples introduce more noise than they provide benefits. Additionally, the order of the labels in the prompt is a critical factor that influences the predictions. In the absence of visual information, the model has a strong tendency to hallucinate and imagine content, raising questions about the model’s over-reliance on linguistic priors. Fine-tuning yields strong results, achieving an accuracy of 83.3% on the e-SNLI-VE dataset and outperforming the state-of-the-art OFA-X model. Additionally, the explanation evaluation demonstrates that the fine-tuned model provides semantically meaningful explanations similar to those of humans, with a BERTScore F1-score of 89.2%. We do, however, find comparable BERTScore results in experiments with limited vision, questioning the visual grounding of this task. Overall, our results highlight both the utility and limitations of VE as a diagnostic task for vision-language understanding and point to directions for refining multimodal evaluation methods.

[138] VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization

Sania Waheed, Na Min An, Michael Milford, Sarvapali D. Ramchurn, Shoaib Ehsan

Main category: cs.CV

TL;DR: A hybrid geo-localization framework that combines vision-language models (VLMs) with visual place recognition (VPR) to achieve accurate planet-scale image geo-localization by using VLMs to generate geographic priors that guide retrieval-based search.

Details

Motivation: Traditional geo-localization methods face significant challenges: retrieval-based approaches struggle with scalability and perceptual aliasing, while classification-based methods lack generalization and require extensive training data. Although VLMs show promise with contextual understanding, they suffer from hallucinations and lack interpretability, making them unreliable as standalone solutions for the challenging task of planet-scale geo-localization.

Method: A hybrid framework that: (1) uses a VLM to generate geographic priors that effectively guide and constrain the retrieval search space, (2) employs a retrieval step using visual place recognition methods, and (3) implements a re-ranking mechanism that selects the most geographically plausible matches based on feature similarity and proximity to the initially estimated coordinates.

Result: The approach consistently outperforms prior state-of-the-art methods across multiple geo-localization benchmarks, achieving improvements of up to 4.51% at street level and up to 13.52% at city level. The combination demonstrates scalable, robust, and accurate geo-localization performance.

Conclusion: VLM-generated geographic priors combined with visual place recognition methods create an effective solution for planet-scale geo-localization that overcomes the individual limitations of both approaches, resulting in a scalable, robust, and accurate geo-localization system that significantly outperforms existing methods.

Abstract: Geo-localization from a single image at planet scale (essentially an advanced or extreme version of the kidnapped robot problem) is a fundamental and challenging task in applications such as navigation, autonomous driving and disaster response due to the vast diversity of locations, environmental conditions, and scene variations. Traditional retrieval-based methods for geo-localization struggle with scalability and perceptual aliasing, while classification-based approaches lack generalization and require extensive training data. Recent advances in vision-language models (VLMs) offer a promising alternative by leveraging contextual understanding and reasoning. However, while VLMs achieve high accuracy, they are often prone to hallucinations and lack interpretability, making them unreliable as standalone solutions. In this work, we propose a novel hybrid geo-localization framework that combines the strengths of VLMs with retrieval-based visual place recognition (VPR) methods. Our approach first leverages a VLM to generate a prior, effectively guiding and constraining the retrieval search space. We then employ a retrieval step, followed by a re-ranking mechanism that selects the most geographically plausible matches based on feature similarity and proximity to the initially estimated coordinates. We evaluate our approach on multiple geo-localization benchmarks and show that it consistently outperforms prior state-of-the-art methods, particularly at street (up to 4.51%) and city level (up to 13.52%). Our results demonstrate that VLM-generated geographic priors in combination with VPR lead to scalable, robust, and accurate geo-localization systems.

[139] Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection

Francesco Tonini, Lorenzo Vaquero, Alessandro Conti, Cigdem Beyan, Elisa Ricci

Main category: cs.CV

TL;DR: The paper proposes DYSCO, a training-free HOI detection framework that leverages Vision-Language Models to detect human-object interactions without requiring manual annotations, using a multimodal registry and multi-head attention mechanism to achieve competitive performance especially on rare interactions.

Details

Motivation: Existing HOI detection methods rely heavily on large manually annotated datasets which are labor-intensive, inconsistent, and limit scalability to new domains and rare interactions. The authors argue that Vision-Language Models offer untapped potential for enhancing interaction representation that current methods haven't fully exploited.

Method: The authors propose DYSCO (Dynamic Scoring with enhanced semantics), a training-free framework that utilizes textual and visual interaction representations within a multimodal registry. The method incorporates visual cues, uses innovative interaction signatures for better semantic alignment of verbs, and employs a multi-head attention mechanism that adaptively weights visual and textual features.

Result: DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches. The method particularly excels in detecting rare interactions, demonstrating robust and nuanced interaction understanding.

Conclusion: The proposed training-free DYSCO framework successfully leverages Vision-Language Models to achieve competitive HOI detection performance without requiring manual annotations, offering a scalable solution that generalizes well to rare interactions and new domains.

Abstract: Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that our DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions. Code is available at https://github.com/francescotonini/dysco.

[140] Unsupervised anomaly detection using Bayesian flow networks: application to brain FDG PET in the context of Alzheimer’s disease

Hugues Roy, Reuben Dorent, Ninon Burgos

Main category: cs.CV

TL;DR: This paper introduces AnoBFN, a novel unsupervised anomaly detection method based on Bayesian Flow Networks for neuroimaging, specifically designed to detect Alzheimer’s disease-related anomalies in FDG PET images while maintaining subject specificity and reducing false positives.

Details

Motivation: Unsupervised anomaly detection is crucial in neuroimaging for identifying deviations from healthy brain data to facilitate neurological disorder diagnosis. Existing methods have limitations in handling spatially correlated noise and preserving subject-specific features. Bayesian Flow Networks, while promising, have not been applied to medical imaging or anomaly detection tasks.

Method: The authors propose AnoBFN, an extension of Bayesian Flow Networks for unsupervised anomaly detection. The method combines diffusion frameworks with Bayesian inference and includes two key innovations: (1) conditional image generation under high levels of spatially correlated noise, and (2) preservation of subject specificity through recursive feedback from the input image throughout the generative process.

Result: AnoBFN was evaluated on Alzheimer’s disease-related anomaly detection in FDG PET images and outperformed state-of-the-art methods including VAE-based (beta-VAE), GAN-based (f-AnoGAN), and diffusion model-based (AnoDDPM) approaches. The method demonstrated superior effectiveness at detecting anomalies while achieving reduced false positive rates.

Conclusion: AnoBFN successfully demonstrates the potential of Bayesian Flow Networks for medical imaging anomaly detection. The method effectively combines the strengths of diffusion models and Bayesian inference to achieve superior performance in neuroimaging anomaly detection, particularly for Alzheimer’s disease diagnosis, while maintaining subject specificity and reducing false positives compared to existing approaches.

Abstract: Unsupervised anomaly detection (UAD) plays a crucial role in neuroimaging for identifying deviations from healthy subject data and thus facilitating the diagnosis of neurological disorders. In this work, we focus on Bayesian flow networks (BFNs), a novel class of generative models, which have not yet been applied to medical imaging or anomaly detection. BFNs combine the strength of diffusion frameworks and Bayesian inference. We introduce AnoBFN, an extension of BFNs for UAD, designed to: i) perform conditional image generation under high levels of spatially correlated noise, and ii) preserve subject specificity by incorporating a recursive feedback from the input image throughout the generative process. We evaluate AnoBFN on the challenging task of Alzheimer’s disease-related anomaly detection in FDG PET images. Our approach outperforms other state-of-the-art methods based on VAEs (beta-VAE), GANs (f-AnoGAN), and diffusion models (AnoDDPM), demonstrating its effectiveness at detecting anomalies while reducing false positive rates.

[141] ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents

Chang Nie, Guangming Wang, Zhe Lie, Hesheng Wang

Main category: cs.CV

TL;DR: ERMV is a novel data augmentation framework that efficiently edits 4D multi-view sequential robotic data to address data scarcity in robot imitation learning, using innovative attention mechanisms and sparse sampling to maintain consistency while reducing computational costs.

Details

Motivation: Robot imitation learning faces severe constraints due to high data collection costs and scarcity of high-quality 4D multi-view sequential images, which limits the generalization and application of Vision-Language-Action (VLA) models in embodied intelligence.

Method: ERMV introduces three key innovations: (1) Epipolar Motion-Aware Attention (EMA-Attn) mechanism to maintain spatio-temporal consistency by learning pixel shifts before applying geometric constraints, (2) Sparse Spatio-Temporal (STT) module that decouples temporal and spatial views through sparse sampling to reduce computational demands, and (3) feedback intervention mechanism using Multimodal Large Language Models to check editing inconsistencies and request expert guidance when needed.

Result: Extensive experiments demonstrate that ERMV-augmented data significantly boosts the robustness and generalization of VLA models in both simulated and real-world environments, successfully addressing the three core challenges of maintaining consistency, expanding working windows with low computational costs, and preserving semantic integrity.

Conclusion: ERMV successfully addresses the data scarcity problem in robot imitation learning by providing an efficient framework for editing 4D multi-view sequential images while maintaining geometric, appearance, and semantic consistency across dynamic views and long time horizons.

Abstract: Robot imitation learning relies on 4D multi-view sequential images. However, the high cost of data collection and the scarcity of high-quality data severely constrain the generalization and application of embodied intelligence policies like Vision-Language-Action (VLA) models. Data augmentation is a powerful strategy to overcome data scarcity, but methods for editing 4D multi-view sequential images for manipulation tasks are currently lacking. Thus, we propose ERMV (Editing Robotic Multi-View 4D data), a novel data augmentation framework that efficiently edits an entire multi-view sequence based on single-frame editing and robot state conditions. This task presents three core challenges: (1) maintaining geometric and appearance consistency across dynamic views and long time horizons; (2) expanding the working window with low computational costs; and (3) ensuring the semantic integrity of critical objects like the robot arm. ERMV addresses these challenges through a series of innovations. First, to ensure spatio-temporal consistency in motion blur, we introduce a novel Epipolar Motion-Aware Attention (EMA-Attn) mechanism that learns pixel shift caused by movement before applying geometric constraints. Second, to maximize the editing working window, ERMV pioneers a Sparse Spatio-Temporal (STT) module, which decouples the temporal and spatial views and remodels a single-frame multi-view problem through sparse sampling of the views to reduce computational demands. Third, to alleviate error accumulation, we incorporate a feedback intervention Mechanism, which uses a Multimodal Large Language Model (MLLM) to check editing inconsistencies and request targeted expert guidance only when necessary. Extensive experiments demonstrate that ERMV-augmented data significantly boosts the robustness and generalization of VLA models in both simulated and real-world environments.

[142] SRMambaV2: Biomimetic Attention for Sparse Point Cloud Upsampling in Autonomous Driving

Chuang Chen, Xiaolin Qin, Jing Hu, Wenyi Ge

Main category: cs.CV

TL;DR: SRMambaV2 is a novel sparse point cloud upsampling method for LiDAR data in autonomous driving that uses biomimetic 2D selective scanning self-attention and dual-branch architecture to improve reconstruction quality in long-range sparse regions.

Details

Motivation: Existing LiDAR point cloud upsampling methods struggle with inherent sparsity and complex 3D structures in autonomous driving scenarios. Current approaches that convert 3D spatial scenes to 2D image super-resolution tasks face difficulties in accurately reconstructing detailed spatial topologies due to sparse and blurry feature representation of range images.

Method: The paper proposes SRMambaV2 with three key components: (1) a biomimetic 2D selective scanning self-attention (2DSSA) mechanism inspired by human driver visual perception to model feature distribution in distant sparse areas, (2) a dual-branch network architecture to enhance sparse feature representation, and (3) a progressive adaptive loss (PAL) function to refine fine-grained detail reconstruction.

Result: SRMambaV2 achieves superior performance in both qualitative and quantitative evaluations, demonstrating enhanced upsampling accuracy in long-range sparse regions while preserving overall geometric reconstruction quality.

Conclusion: The proposed SRMambaV2 method effectively addresses the challenges of sparse point cloud upsampling in autonomous driving scenarios, showing practical value and effectiveness for automotive applications through its biomimetic attention mechanism and dual-branch architecture.

Abstract: Upsampling LiDAR point clouds in autonomous driving scenarios remains a significant challenge due to the inherent sparsity and complex 3D structures of the data. Recent studies have attempted to address this problem by converting the complex 3D spatial scenes into 2D image super-resolution tasks. However, due to the sparse and blurry feature representation of range images, accurately reconstructing detailed and complex spatial topologies remains a major difficulty. To tackle this, we propose a novel sparse point cloud upsampling method named SRMambaV2, which enhances the upsampling accuracy in long-range sparse regions while preserving the overall geometric reconstruction quality. Specifically, inspired by human driver visual perception, we design a biomimetic 2D selective scanning self-attention (2DSSA) mechanism to model the feature distribution in distant sparse areas. Meanwhile, we introduce a dual-branch network architecture to enhance the representation of sparse features. In addition, we introduce a progressive adaptive loss (PAL) function to further refine the reconstruction of fine-grained details during the upsampling process. Experimental results demonstrate that SRMambaV2 achieves superior performance in both qualitative and quantitative evaluations, highlighting its effectiveness and practical value in automotive sparse point cloud upsampling tasks.

[143] Illicit object detection in X-ray imaging using deep learning techniques: A comparative evaluation

Jorgen Cani, Christos Diou, Spyridon Evangelatos, Vasileios Argyriou, Panagiotis Radoglou-Grammatikis, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: This paper presents a comprehensive comparative evaluation of deep learning methods for X-ray object detection in security screening, testing 10 state-of-the-art detection schemes across 6 large-scale datasets to provide systematic insights into performance, efficiency, and computational complexity.

Details

Motivation: Automated X-ray inspection faces challenges including object occlusion, varying physical properties, device diversity, and limited training data. Previous research evaluations are often incomplete with conflicting outcomes, necessitating a systematic comparative study to understand the research landscape and guide future work.

Method: Developed a comprehensive evaluation framework with: (1) Six large-scale public X-ray datasets (OPIXray, CLCXray, SIXray, EDS, HiXray, PIDray), (2) Ten state-of-the-art object detection schemes covering CNN, transformer, and hybrid architectures, and (3) Multiple evaluation metrics including detection performance (mAP50, mAP50:95) and computational efficiency (inference time, parameters, GFLOPS).

Result: The evaluation provided critical insights on: (1) Overall behavior of detection schemes, (2) Object-level detection performance across different methods, (3) Dataset-specific observations showing how different datasets affect performance, and (4) Time efficiency and computational complexity analysis revealing trade-offs between accuracy and speed.

Conclusion: The systematic evaluation reveals important performance patterns and trade-offs in X-ray object detection methods, providing researchers with comprehensive benchmarks and insights. The publicly available code and model weights support reproducibility and future research in automated X-ray security screening.

Abstract: Automated X-ray inspection is crucial for efficient and unobtrusive security screening in various public settings. However, challenges such as object occlusion, variations in the physical properties of items, diversity in X-ray scanning devices, and limited training data hinder accurate and reliable detection of illicit items. Despite the large body of research in the field, reported experimental evaluations are often incomplete, with frequently conflicting outcomes. To shed light on the research landscape and facilitate further research, a systematic, detailed, and thorough comparative evaluation of recent Deep Learning (DL)-based methods for X-ray object detection is conducted. For this, a comprehensive evaluation framework is developed, composed of: a) Six recent, large-scale, and widely used public datasets for X-ray illicit item detection (OPIXray, CLCXray, SIXray, EDS, HiXray, and PIDray), b) Ten different state-of-the-art object detection schemes covering all main categories in the literature, including generic Convolutional Neural Network (CNN), custom CNN, generic transformer, and hybrid CNN-transformer architectures, and c) Various detection (mAP50 and mAP50:95) and time/computational-complexity (inference time (ms), parameter size (M), and computational load (GFLOPS)) metrics. A thorough analysis of the results leads to critical observations and insights, emphasizing key aspects such as: a) Overall behavior of the object detection schemes, b) Object-level detection performance, c) Dataset-specific observations, and d) Time efficiency and computational complexity analysis. To support reproducibility of the reported experimental results, the evaluation code and model weights are made publicly available at https://github.com/jgenc/xray-comparative-evaluation.

[144] Accelerating Parallel Diffusion Model Serving with Residual Compression

Jiajun Luo, Yicheng Xiao, Jianru Xu, Yangxiu You, Rongwei Lu, Chen Tang, Jingyan Jiang, Zhi Wang

Main category: cs.CV

TL;DR: CompactFusion is a compression framework that reduces communication overhead in parallel diffusion model inference by exploiting temporal redundancy in activations, achieving 3.0x speedup on 4xL20 while improving generation quality.

Details

Motivation: Diffusion models require substantial computational resources and multi-accelerator parallelism for real-time deployment, but parallel inference suffers from significant communication overhead due to exchanging large activations between devices, which limits efficiency and scalability.

Method: CompactFusion uses Residual Compression that transmits only compressed residuals (step-wise activation differences) instead of full activations, exploiting the observation that diffusion activations exhibit strong temporal redundancy. The method also integrates lightweight error feedback to prevent error accumulation.

Result: On 4xL20, CompactFusion achieves 3.0x speedup while greatly improving fidelity. It also supports communication-heavy strategies like sequence parallelism on slow networks, achieving 6.7x speedup over prior overlap-based methods. The framework applies broadly across diffusion models and parallel settings.

Conclusion: CompactFusion establishes a new paradigm for parallel diffusion inference by effectively removing redundant data through residual compression, enabling substantial data reduction while maintaining high fidelity. It integrates easily without requiring pipeline rework and demonstrates superior performance across various settings.

Abstract: Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy-adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information. CompactFusion achieves this via Residual Compression that transmits only compressed residuals (step-wise activation differences). Based on empirical analysis and theoretical justification, we show that it effectively removes redundant data, enabling substantial data reduction while maintaining high fidelity. We also integrate lightweight error feedback to prevent error accumulation. CompactFusion establishes a new paradigm for parallel diffusion inference, delivering lower latency and significantly higher generation quality than prior methods. On 4xL20, it achieves 3.0x speedup while greatly improving fidelity. It also uniquely supports communication-heavy strategies like sequence parallelism on slow networks, achieving 6.7x speedup over prior overlap-based method. CompactFusion applies broadly across diffusion models and parallel settings, and integrates easily without requiring pipeline rework. Portable implementation demonstrated on xDiT is publicly available at https://github.com/Cobalt-27/CompactFusion

[145] PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Maciej K. Wozniak, Lianhang Liu, Yixi Cai, Patric Jensfelt

Main category: cs.CV

TL;DR: PRIX is an efficient end-to-end autonomous driving model that uses only camera data (no LiDAR) and operates directly on raw pixels without BEV representations, achieving state-of-the-art performance while being significantly more efficient than existing multimodal approaches.

Details

Motivation: Current end-to-end autonomous driving models face practical deployment challenges due to large model sizes, expensive LiDAR sensor requirements, and computationally intensive BEV feature representations, limiting their scalability for mass-market vehicles that only have cameras.

Method: The authors propose PRIX (Plan from Raw Pixels), which uses a visual feature extractor coupled with a generative planning head to predict trajectories directly from raw camera pixels. The key innovation is the Context-aware Recalibration Transformer (CaRT) module that enhances multi-level visual features for robust planning without requiring BEV representations or LiDAR data.

Result: PRIX achieves state-of-the-art performance on NavSim and nuScenes benchmarks, matching the capabilities of larger multimodal diffusion planners while being significantly more efficient in inference speed and model size, making it practical for real-world deployment.

Conclusion: PRIX demonstrates that efficient autonomous driving is possible using only camera data without LiDAR or BEV representations, providing a practical solution for mass-market vehicle deployment while maintaining competitive performance with more complex multimodal systems.

Abstract: While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

Liwen Liu, Weidong Yang, Lipeng Ma, Ben Fei

Main category: cs.CV

TL;DR: The paper proposes MMPT, a Multi-modal Multi-task Pre-training framework that combines three pre-training tasks (token-level reconstruction, point-level reconstruction, and multi-modal contrastive learning) to enhance 3D point cloud understanding without requiring 3D annotations.

Details

Motivation: Existing multi-modal pre-training frameworks for 3D applications rely on single pre-training tasks, which limits their ability to gather abundant information from other relevant tasks. This limitation hinders performance in downstream tasks, especially in complex and diverse domains.

Method: The MMPT framework incorporates three complementary pre-training tasks: (1) Token-level reconstruction (TLR) to recover masked point tokens for representative learning, (2) Point-level reconstruction (PLR) to predict masked point positions directly, and (3) Multi-modal contrastive learning (MCL) to combine feature correspondences within and across 3D point cloud and 2D image modalities in a self-supervised manner.

Result: The framework operates without requiring 3D annotations, making it scalable for large datasets. The trained encoder can be effectively transferred to various downstream tasks. Performance evaluation shows effectiveness compared to state-of-the-art methods across various discriminant and generative applications under widely-used benchmarks.

Conclusion: MMPT successfully addresses the limitations of single-task pre-training by integrating multiple complementary tasks, demonstrating improved performance in 3D point cloud understanding tasks while maintaining scalability through annotation-free training.

Abstract: Recent advances in multi-modal pre-training methods have shown promising effectiveness in learning 3D representations by aligning multi-modal features between 3D shapes and their corresponding 2D counterparts. However, existing multi-modal pre-training frameworks primarily rely on a single pre-training task to gather multi-modal data in 3D applications. This limitation prevents the models from obtaining the abundant information provided by other relevant tasks, which can hinder their performance in downstream tasks, particularly in complex and diverse domains. In order to tackle this issue, we propose MMPT, a Multi-modal Multi-task Pre-training framework designed to enhance point cloud understanding. Specifically, three pre-training tasks are devised: (i) Token-level reconstruction (TLR) aims to recover masked point tokens, endowing the model with representative learning abilities. (ii) Point-level reconstruction (PLR) is integrated to predict the masked point positions directly, and the reconstructed point cloud can be considered as a transformed point cloud used in the subsequent task. (iii) Multi-modal contrastive learning (MCL) combines feature correspondences within and across modalities, thus assembling a rich learning signal from both 3D point cloud and 2D image modalities in a self-supervised manner. Moreover, this framework operates without requiring any 3D annotations, making it scalable for use with large datasets. The trained encoder can be effectively transferred to various downstream tasks. To demonstrate its effectiveness, we evaluated its performance compared to state-of-the-art methods in various discriminant and generative applications under widely-used benchmarks.

[147] Vision Transformer attention alignment with human visual perception in aesthetic object evaluation

Miguel Carrasco, César González-Martín, José Aranda, Luis Oliveros

Main category: cs.CV

TL;DR: This study compares human visual attention patterns with Vision Transformer (ViT) attention mechanisms when evaluating handcrafted objects, finding that certain ViT attention heads can approximate human visual behavior, particularly attention head #12 at optimal correlation parameters.

Details

Motivation: Visual attention mechanisms are crucial for human perception and aesthetic evaluation, but the alignment between human visual attention patterns and Vision Transformer attention mechanisms remains underexplored, especially in aesthetic contexts involving handcrafted objects.

Method: Conducted eye-tracking experiment with 30 participants viewing 20 artisanal objects (basketry bags and ginger jars) using Pupil Labs eye-tracker to generate human attention heat maps. Analyzed same objects using pre-trained ViT model with DINO, extracting attention maps from 12 attention heads. Compared human and ViT attention distributions using Kullback-Leibler divergence across varying Gaussian parameters (sigma=0.1 to 3.0).

Result: Optimal correlation found at sigma=2.4 ±0.03, with attention head #12 showing strongest alignment with human visual patterns. Significant differences between attention heads identified, with heads #7 and #9 showing greatest divergence from human attention (p<0.05, Tukey HSD test). ViTs exhibit more global attention patterns compared to human focal attention, but certain heads can approximate human behavior for specific features like buckles.

Conclusion: While ViTs show fundamental differences in attention strategies compared to human perception, certain attention heads can approximate human visual behavior. These findings suggest potential applications in product design and aesthetic evaluation, highlighting both opportunities and limitations of current AI models in mimicking human visual attention patterns.

Abstract: Visual attention mechanisms play a crucial role in human perception and aesthetic evaluation. Recent advances in Vision Transformers (ViTs) have demonstrated remarkable capabilities in computer vision tasks, yet their alignment with human visual attention patterns remains underexplored, particularly in aesthetic contexts. This study investigates the correlation between human visual attention and ViT attention mechanisms when evaluating handcrafted objects. We conducted an eye-tracking experiment with 30 participants (9 female, 21 male, mean age 24.6 years) who viewed 20 artisanal objects comprising basketry bags and ginger jars. Using a Pupil Labs eye-tracker, we recorded gaze patterns and generated heat maps representing human visual attention. Simultaneously, we analyzed the same objects using a pre-trained ViT model with DINO (Self-DIstillation with NO Labels), extracting attention maps from each of the 12 attention heads. We compared human and ViT attention distributions using Kullback-Leibler divergence across varying Gaussian parameters (sigma=0.1 to 3.0). Statistical analysis revealed optimal correlation at sigma=2.4 +-0.03, with attention head #12 showing the strongest alignment with human visual patterns. Significant differences were found between attention heads, with heads #7 and #9 demonstrating the greatest divergence from human attention (p< 0.05, Tukey HSD test). Results indicate that while ViTs exhibit more global attention patterns compared to human focal attention, certain attention heads can approximate human visual behavior, particularly for specific object features like buckles in basketry items. These findings suggest potential applications of ViT attention mechanisms in product design and aesthetic evaluation, while highlighting fundamental differences in attention strategies between human perception and current AI models.

[148] Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, Hongsheng Li, Pheng-Ann Heng

Main category: cs.CV

TL;DR: This paper investigates applying Chain-of-Thought (CoT) reasoning to autoregressive image generation, introducing three techniques including test-time computation scaling, Direct Preference Optimization, and specialized reward models (PARM/PARM++) that incorporate reflection mechanisms, achieving +24% improvement on GenEval benchmark.

Details

Motivation: While Chain-of-Thought reasoning has been extensively explored in large language models for complex understanding tasks, it remains unclear whether such strategies can be effectively applied to image generation scenarios for verification and reinforcement purposes.

Method: The paper employs three main techniques: (1) scaling test-time computation for verification, (2) aligning model preferences with Direct Preference Optimization (DPO), and (3) integrating these approaches. They propose PARM (Potential Assessment Reward Model) that adaptively assesses each generation step, and PARM++ which adds a reflection mechanism for self-correction in autoregressive image generation.

Result: The enhanced Show-o baseline model achieves a significant +24% improvement on the GenEval benchmark and surpasses Stable Diffusion 3 by +15%. The study demonstrates that CoT reasoning strategies can be effectively adapted and combined to significantly improve image generation performance.

Conclusion: CoT reasoning can be successfully integrated with autoregressive image generation, providing a new pathway for enhancing image generation quality. The proposed PARM++ represents the first incorporation of reflection mechanisms in autoregressive image generation, opening new research directions for combining reasoning strategies with generative models.

Abstract: Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image, which is the first to incorporate reflection in autoregressive image generation. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT

[149] An h-space Based Adversarial Attack for Protection Against Few-shot Personalization

Xide Xu, Sandesh Kamath, Muhammad Atif Butt, Bogdan Raducanu

Main category: cs.CV

TL;DR: This paper proposes HAAD, a novel adversarial attack method that targets the semantic latent space (h-space) of diffusion models to prevent unauthorized image customization, offering stronger privacy protection with lower computational cost.

Details

Motivation: The versatility of diffusion models in generating customized images from few samples raises significant privacy concerns regarding unauthorized modifications of private content, necessitating the development of effective protection mechanisms.

Method: The authors propose HAAD (h-space based Adversarial Attack for Diffusion models) that leverages adversarial attacks to craft perturbations based on the semantic latent space (h-space) to degrade image generation. They also introduce HAAD-KV, a more efficient variant that constructs perturbations solely based on the KV parameters of the h-space.

Result: Both HAAD and HAAD-KV methods outperform state-of-the-art adversarial attacks while being computationally less expensive, with HAAD-KV offering stronger protection than the base HAAD method.

Conclusion: The proposed h-space based adversarial attack methods effectively protect against unauthorized image customization in diffusion models, demonstrating superior performance compared to existing approaches while maintaining computational efficiency.

Abstract: The versatility of diffusion models in generating customized images from few samples raises significant privacy concerns, particularly regarding unauthorized modifications of private content. This concerning issue has renewed the efforts in developing protection mechanisms based on adversarial attacks, which generate effective perturbations to poison diffusion models. Our work is motivated by the observation that these models exhibit a high degree of abstraction within their semantic latent space (`h-space’), which encodes critical high-level features for generating coherent and meaningful content. In this paper, we propose a novel anti-customization approach, called HAAD (h-space based Adversarial Attack for Diffusion models), that leverages adversarial attacks to craft perturbations based on the h-space that can efficiently degrade the image generation process. Building upon HAAD, we further introduce a more efficient variant, HAAD-KV, that constructs perturbations solely based on the KV parameters of the h-space. This strategy offers a stronger protection, that is computationally less expensive. Despite their simplicity, our methods outperform state-of-the-art adversarial attacks, highlighting their effectiveness.

[150] Boosting Ray Search Procedure of Hard-label Attacks with Transfer-based Priors

Chen Ma, Xinjie Xu, Shuyu Cheng, Qi Xuan

Main category: cs.CV

TL;DR: This paper proposes a prior-guided approach for hard-label black-box adversarial attacks that uses transfer-based priors from surrogate models to improve gradient estimation and ray search efficiency, significantly outperforming existing methods in query efficiency.

Details

Motivation: Hard-label black-box adversarial attacks are challenging because only the top-1 predicted label is available, making gradient estimation expensive due to high query costs in binary search methods. Existing "sign trick" approaches for gradient estimation need improvement in query efficiency.

Method: The authors develop a prior-guided gradient estimation approach that integrates transfer-based priors from surrogate models with random directions. They approximate the projection of the true gradient onto the subspace spanned by these priors and random directions in a query-efficient manner, transforming the hard-label attack into a continuous optimization problem.

Result: Extensive experiments on ImageNet and CIFAR-10 datasets demonstrate that the proposed approach significantly outperforms 11 state-of-the-art methods in terms of query efficiency. The authors also provide theoretical analysis showing improved expected cosine similarities between their gradient estimators and the true gradient.

Conclusion: The prior-guided approach successfully improves the efficiency of hard-label black-box adversarial attacks by leveraging transfer-based priors from surrogate models, achieving both theoretical and empirical improvements in gradient estimation quality and query efficiency compared to existing methods.

Abstract: One of the most practical and challenging types of black-box adversarial attacks is the hard-label attack, where only the top-1 predicted label is available. One effective approach is to search for the optimal ray direction from the benign image that minimizes the $\ell_p$-norm distance to the adversarial region. The unique advantage of this approach is that it transforms the hard-label attack into a continuous optimization problem. The objective function value is the ray’s radius, which can be obtained via binary search at a high query cost. Existing methods use a “sign trick” in gradient estimation to reduce the number of queries. In this paper, we theoretically analyze the quality of this gradient estimation and propose a novel prior-guided approach to improve ray search efficiency both theoretically and empirically. Specifically, we utilize the transfer-based priors from surrogate models, and our gradient estimators appropriately integrate them by approximating the projection of the true gradient onto the subspace spanned by these priors and random directions, in a query-efficient manner. We theoretically derive the expected cosine similarities between the obtained gradient estimators and the true gradient, and demonstrate the improvement achieved by incorporating priors. Extensive experiments on the ImageNet and CIFAR-10 datasets show that our approach significantly outperforms 11 state-of-the-art methods in terms of query efficiency.

[151] Yume: An Interactive World Generation Model

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang

Main category: cs.CV

TL;DR: Yume is a system that creates interactive, realistic worlds from images, text, or videos, allowing users to explore these worlds using keyboard controls or neural signals through a framework combining camera motion quantization, video diffusion transformers, and advanced sampling techniques.

Details

Motivation: To create an interactive, realistic, and dynamic world generation system that allows users to explore and control virtual environments using various input modalities (images, text, videos) and interaction methods (peripheral devices, neural signals).

Method: A four-component framework consisting of: (1) camera motion quantization for stable training and user-friendly keyboard interaction, (2) Masked Video Diffusion Transformer (MVDT) with memory module for infinite autoregressive video generation, (3) training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) for improved visual quality and control, and (4) model acceleration through adversarial distillation and caching mechanisms.

Result: The system achieves remarkable results in diverse scenes and applications when trained on the high-quality world exploration dataset Sekai. The preview version successfully creates dynamic worlds from input images and enables keyboard-based exploration with high fidelity and interactivity.

Conclusion: Yume demonstrates successful interactive world generation from visual inputs with keyboard control, providing a foundation for immersive virtual world exploration. The system will be updated monthly with plans to achieve its original goal of supporting multiple input modalities and neural signal control.

Abstract: Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.

[152] OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang

Main category: cs.CV

TL;DR: OpenVLThinker is an open-source large vision-language model that combines supervised fine-tuning (SFT) and reinforcement learning (RL) in alternating iterations to achieve sophisticated chain-of-thought reasoning, demonstrating significant performance improvements on visual reasoning benchmarks.

Details

Motivation: While text-based reasoning models show promise, distilling their reasoning into vision-language models through SFT alone causes performance degradation due to imprecise visual grounding, and pure RL methods face large search spaces that hinder reasoning development in smaller models.

Method: The paper proposes alternating between supervised fine-tuning (SFT) and reinforcement learning (RL) over multiple iterations. SFT surfaces latent reasoning behaviors and narrows the RL search space, while each RL stage refines reasoning skills and produces higher-quality data for subsequent SFT rounds.

Result: OpenVLThinker-7B achieves consistent performance improvements across six benchmarks: 3.8% improvement on MathVista, 2.4% on EMMA, and 1.6% on HallusionBench, demonstrating enhanced mathematical and general reasoning capabilities.

Conclusion: The synergy between SFT and RL proves effective for developing complex reasoning in vision-language models, with findings providing early evidence toward achieving R1-style reasoning in multimodal contexts and enabling continued self-improvement through iterative training.

Abstract: We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning, achieving notable performance gains on challenging visual reasoning tasks. While text-based reasoning models (e.g., Deepseek R1) show promising results in text-only tasks, distilling their reasoning into LVLMs via supervised fine-tuning (SFT) often results in performance degradation due to imprecise visual grounding. Conversely, purely reinforcement learning (RL)-based methods face a large search space, hindering the emergence of reflective behaviors in smaller models (e.g., 7B LVLMs). Surprisingly, alternating between SFT and RL ultimately results in significant performance improvements after a few iterations. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model’s reasoning skills, producing higher-quality SFT data for continued self-improvement. OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving MathVista by 3.8%, EMMA by 2.4%, and HallusionBench by 1.6%. Beyond demonstrating the synergy between SFT and RL for complex reasoning tasks, our findings provide early evidence towards achieving R1-style reasoning in multimodal contexts. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

[153] From Scan to Action: Leveraging Realistic Scans for Embodied Scene Understanding

Anna-Maria Halacheva, Jan-Nico Zaech, Sombit Dey, Luc Van Gool, Danda Pani Paudel

Main category: cs.CV

TL;DR: This paper presents a unified methodology using USD (Universal Scene Description) to integrate annotations from real-world 3D scene scans, addressing challenges like data volume and format diversity, and demonstrates effectiveness through LLM-based scene editing (80% success) and robotic simulation (87% success in policy learning).

Details

Motivation: Real-world 3D scene-level scans provide realism and better generalizability for downstream applications, but their utilization is limited by challenges including large data volumes, diverse annotation formats, and tool compatibility issues that hinder their effective use in practical applications.

Method: The authors propose a unified annotation integration methodology using USD (Universal Scene Description) with application-specific USD flavors. They identify key challenges in utilizing holistic real-world scan datasets and develop corresponding mitigation strategies to address these issues.

Result: The approach achieves strong performance in two downstream applications: LLM-based scene editing with 80% success rate in enabling effective LLM understanding and adaptation of the data, and robotic simulation with 87% success rate in policy learning tasks.

Conclusion: The proposed USD-based unified annotation integration methodology successfully addresses the challenges of leveraging real-world 3D scene scans, demonstrating effective performance across different applications and enabling better utilization of realistic scan data for improved real-world generalizability.

Abstract: Real-world 3D scene-level scans offer realism and can enable better real-world generalizability for downstream applications. However, challenges such as data volume, diverse annotation formats, and tool compatibility limit their use. This paper demonstrates a methodology to effectively leverage these scans and their annotations. We propose a unified annotation integration using USD, with application-specific USD flavors. We identify challenges in utilizing holistic real-world scan datasets and present mitigation strategies. The efficacy of our approach is demonstrated through two downstream applications: LLM-based scene editing, enabling effective LLM understanding and adaptation of the data (80% success), and robotic simulation, achieving an 87% success rate in policy learning.

[154] Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention

Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, Guosheng Lin

Main category: cs.CV

TL;DR: Ultra3D is an efficient 3D generation framework that accelerates sparse voxel modeling by 6.7x using VecSet representation and Part Attention mechanism, achieving high-resolution 3D generation at 1024³ resolution without quality loss.

Details

Motivation: Existing sparse voxel 3D generation frameworks suffer from severe computational inefficiencies due to quadratic complexity of attention mechanisms in two-stage diffusion pipelines, limiting their practical application despite improved quality.

Method: The method uses VecSet representation for efficient coarse object layout generation in stage one, and introduces Part Attention - a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions in stage two, supported by a scalable part annotation pipeline.

Result: Ultra3D achieves up to 6.7x speed-up in latent generation, supports high-resolution 3D generation at 1024³ resolution, and demonstrates state-of-the-art performance in both visual fidelity and user preference metrics.

Conclusion: Ultra3D successfully addresses computational bottlenecks in sparse voxel 3D generation by introducing efficient representations and localized attention mechanisms, enabling practical high-resolution 3D content generation with preserved quality.

Abstract: Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.

[155] Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu

Main category: cs.CV

TL;DR: The paper introduces Deep Video Discovery (DVD) agent, an autonomous AI system that uses strategic search over video segments to understand hour-long videos, achieving state-of-the-art performance on long-form video understanding benchmarks.

Details

Motivation: Long-form video understanding is challenging due to extensive temporal-spatial complexity and difficulty in question answering under extended contexts. While Large Language Models show progress in video analysis and long context handling, they still have limitations when processing information-dense hour-long videos.

Method: The DVD agent uses an agentic search strategy over segmented video clips with autonomous decision-making capabilities. It leverages search-centric tools on multi-granular video database, uses LLM’s reasoning to plan based on current observations, strategically selects tools, formulates action parameters, and iteratively refines reasoning based on gathered information.

Result: The DVD agent achieves state-of-the-art performance on long video understanding benchmarks, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive evaluation demonstrates advantages of the entire system design.

Conclusion: The autonomous DVD agent successfully addresses limitations in long-form video understanding through strategic search and reasoning capabilities. The system design shows significant improvements over existing methods, with comprehensive ablation studies providing insights for advancing intelligent agents in long-form video tasks.

Abstract: Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code has been released in https://github.com/microsoft/DeepVideoDiscovery.

[156] RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction

Yuqing Lan, Chenyang Zhu, Shuaifeng Zhi, Jiazhao Zhang, Zhoufeng Wang, Renjiao Yi, Yijie Wang, Kai Xu

Main category: cs.CV

TL;DR: RemixFusion introduces a residual-based mixed representation combining explicit TSDF grids with implicit neural modules for high-quality, large-scale online RGB-D reconstruction, achieving superior mapping and tracking accuracy compared to state-of-the-art methods.

Details

Motivation: Neural implicit representations improve mapping completeness and memory efficiency over traditional explicit methods like TSDF, but suffer from lack of reconstruction details and time-consuming learning, limiting their application to large-scale online reconstruction scenarios.

Method: The paper proposes RemixFusion with: (1) a residual-based mixed representation combining explicit coarse TSDF grid with implicit neural module for fine details, (2) multi-frame joint pose optimization via bundle adjustment by optimizing pose changes rather than poses directly, (3) adaptive gradient amplification technique, and (4) local moving volume with divide-and-conquer design for efficient online learning.

Result: Extensive experiments show that RemixFusion surpasses all state-of-the-art methods (both explicit and implicit representation-based) in terms of accuracy for both mapping and tracking on large-scale scenes, achieving detail-rich reconstruction within bounded time and memory budget.

Conclusion: The residual-based mixed representation successfully addresses the limitations of purely implicit methods by combining the benefits of explicit and implicit representations, enabling high-quality large-scale online RGB-D reconstruction with superior performance in both mapping accuracy and camera tracking.

Abstract: The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of neural-based methods to large-scale online reconstruction. We introduce RemixFusion, a novel residual-based mixed representation for scene reconstruction and camera pose estimation dedicated to high-quality and large-scale online RGB-D reconstruction. In particular, we propose a residual-based map representation comprised of an explicit coarse TSDF grid and an implicit neural module that produces residuals representing fine-grained details to be added to the coarse grid. Such mixed representation allows for detail-rich reconstruction with bounded time and memory budget, contrasting with the overly-smoothed results by the purely implicit representations, thus paving the way for high-quality camera tracking. Furthermore, we extend the residual-based representation to handle multi-frame joint pose optimization via bundle adjustment (BA). In contrast to the existing methods, which optimize poses directly, we opt to optimize pose changes. Combined with a novel technique for adaptive gradient amplification, our method attains better optimization convergence and global optimality. Furthermore, we adopt a local moving volume to factorize the mixed scene representation with a divide-and-conquer design to facilitate efficient online learning in our residual-based framework. Extensive experiments demonstrate that our method surpasses all state-of-the-art ones, including those based either on explicit or implicit representations, in terms of the accuracy of both mapping and tracking on large-scale scenes.

[157] InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling

Xiaoxue Chen, Bhargav Chandaka, Chih-Hao Lin, Ya-Qin Zhang, David Forsyth, Hao Zhao, Shenlong Wang

Main category: cs.CV

TL;DR: InvRGB+L is a novel inverse rendering model that uses both RGB and LiDAR data to reconstruct large, relightable, and dynamic scenes from a single sequence, leveraging LiDAR intensity values for better material estimation and achieving superior results in urban scene reconstruction and LiDAR simulation.

Details

Motivation: Conventional inverse graphics methods primarily rely on RGB observations and use LiDAR mainly for geometry, leading to suboptimal material estimates due to visible light interference. LiDAR intensity values captured with active illumination in different spectral ranges can provide complementary cues for robust material estimation under variable lighting conditions.

Method: The method introduces two key innovations: (1) a novel physics-based LiDAR shading model that incorporates LiDAR intensity information, and (2) RGB-LiDAR material consistency losses that ensure coherent material estimation across both modalities. The approach leverages LiDAR intensity cues to overcome RGB-centric inverse graphics limitations.

Result: The model successfully produces novel-view RGB and LiDAR renderings of urban and indoor scenes, supports advanced applications like relighting, night simulations, and dynamic object insertions. It achieves superior performance compared to current state-of-the-art methods in both scene-level urban inverse rendering and LiDAR simulation tasks.

Conclusion: InvRGB+L demonstrates that combining RGB and LiDAR intensity information through physics-based modeling and consistency constraints significantly improves inverse rendering capabilities, particularly for large-scale dynamic scenes, opening new possibilities for realistic scene reconstruction and simulation applications.

Abstract: We present InvRGB+L, a novel inverse rendering model that reconstructs large, relightable, and dynamic scenes from a single RGB+LiDAR sequence. Conventional inverse graphics methods rely primarily on RGB observations and use LiDAR mainly for geometric information, often resulting in suboptimal material estimates due to visible light interference. We find that LiDAR’s intensity values-captured with active illumination in a different spectral range-offer complementary cues for robust material estimation under variable lighting. Inspired by this, InvRGB+L leverages LiDAR intensity cues to overcome challenges inherent in RGB-centric inverse graphics through two key innovations: (1) a novel physics-based LiDAR shading model and (2) RGB-LiDAR material consistency losses. The model produces novel-view RGB and LiDAR renderings of urban and indoor scenes and supports relighting, night simulations, and dynamic object insertions, achieving results that surpass current state-of-the-art methods in both scene-level urban inverse rendering and LiDAR simulation.

[158] Reusing Attention for One-stage Lane Topology Understanding

Yang Li, Zongzheng Zhang, Xuchong Qiu, Xinrun Li, Ziming Liu, Leichen Wang, Ruikai Li, Zhenxin Zhu, Huan-ang Gao, Xiaojian Lin, Zhiyong Cui, Hang Zhao, Hao Zhao

Main category: cs.CV

TL;DR: A one-stage architecture for autonomous driving that simultaneously predicts traffic elements, lane centerlines, and topology relationships, achieving better accuracy and efficiency than existing two-stage methods by reusing attention resources and enabling knowledge distillation from SD map models to non-SD map models.

Details

Motivation: Existing two-stage methods for lane topology understanding suffer from inefficiencies due to error propagations and increased computational overheads, making them suboptimal for safe autonomous driving applications that require accurate lane topology relationships.

Method: A one-stage architecture that simultaneously predicts traffic elements, lane centerlines and topology relationships by reusing intermediate attention resources within distinct transformer decoders, leveraging inherent relational knowledge from element detection to model topology relationships without requiring additional graph networks, plus knowledge distillation from SD map models to non-SD map models.

Result: The approach outperforms baseline methods on the OpenLane-V2 dataset in both accuracy and efficiency, achieving superior results in lane detection, traffic element identification, and topology reasoning, while demonstrating that knowledge distillation enables superior performance even without SD maps.

Conclusion: The proposed one-stage architecture successfully addresses the limitations of existing two-stage methods by improving both accuracy and inference speed for lane topology understanding, while the knowledge distillation approach enables effective operation without SD maps, making it more practical for real-world autonomous driving applications.

Abstract: Understanding lane toplogy relationships accurately is critical for safe autonomous driving. However, existing two-stage methods suffer from inefficiencies due to error propagations and increased computational overheads. To address these challenges, we propose a one-stage architecture that simultaneously predicts traffic elements, lane centerlines and topology relationship, improving both the accuracy and inference speed of lane topology understanding for autonomous driving. Our key innovation lies in reusing intermediate attention resources within distinct transformer decoders. This approach effectively leverages the inherent relational knowledge within the element detection module to enable the modeling of topology relationships among traffic elements and lanes without requiring additional computationally expensive graph networks. Furthermore, we are the first to demonstrate that knowledge can be distilled from models that utilize standard definition (SD) maps to those operates without using SD maps, enabling superior performance even in the absence of SD maps. Extensive experiments on the OpenLane-V2 dataset show that our approach outperforms baseline methods in both accuracy and efficiency, achieving superior results in lane detection, traffic element identification, and topology reasoning. Our code is available at https://github.com/Yang-Li-2000/one-stage.git.

[159] The Early Bird Identifies the Worm: You Can’t Beat a Head Start in Long-Term Body Re-ID (ECHO-BID)

Thomas M. Metz, Matthew Q. Hill, Alice J. O’Toole

Main category: cs.CV

TL;DR: This paper introduces ECHO-BID, a person re-identification model based on EVA-02 Large backbones that achieves state-of-the-art performance on long-term re-identification tasks, especially in challenging scenarios with clothing changes and occlusions.

Details

Motivation: Person identification in unconstrained environments faces significant challenges due to variations in distance, viewpoint, imaging conditions, and clothing changes. Existing methods struggle with long-term re-identification tasks where people change clothes or appear in occluded scenarios.

Method: The authors developed ECHO-BID (Eva Clothes-Change from Hidden Objects - Body IDentification), a class of long-term re-identification models built on object-pretrained EVA-02 Large backbones. They systematically compared 9 models varying in backbone architecture, model size, scale of object classification pretraining, and transfer learning protocols. The approach uses Masked Image Modeling during pretraining and transfer learning on challenging clothes-change data.

Result: ECHO-BID achieved state-of-the-art results on long-term re-identification benchmarks, substantially outperforming other methods across constrained, unconstrained, and occluded settings. The model particularly excelled in occluded viewing scenarios with wide performance margins. Smaller but more challenging transfer learning datasets generalized better across datasets than larger, less challenging ones, though larger datasets with additional fine-tuning performed best on the most difficult data.

Conclusion: The combination of increased model size and Masked Image Modeling during pretraining underlies ECHO-BID’s strong performance. Selecting the correct pretrained backbone architecture and transfer learning protocols can drive substantial gains in long-term re-identification performance, demonstrating the importance of model design choices for challenging person re-identification tasks.

Abstract: Person identification in unconstrained viewing environments presents significant challenges due to variations in distance, viewpoint, imaging conditions, and clothing. We introduce $\textbf{E}$va $\textbf{C}$lothes-Change from $\textbf{H}$idden $\textbf{O}$bjects - $\textbf{B}$ody $\textbf{ID}$entification (ECHO-BID), a class of long-term re-id models built on object-pretrained EVA-02 Large backbones. We compare ECHO-BID to 9 other models that vary systematically in backbone architecture, model size, scale of object classification pretraining, and transfer learning protocol. Models were evaluated on benchmark datasets across constrained, unconstrained, and occluded settings. ECHO-BID, with transfer learning on the most challenging clothes-change data, achieved state-of-the-art results on long-term re-id – substantially outperforming other methods. ECHO-BID also surpassed other methods by a wide margin in occluded viewing scenarios. A combination of increased model size and Masked Image Modeling during pretraining underlie ECHO-BID’s strong performance on long-term re-id. Notably, a smaller, but more challenging transfer learning dataset, generalized better across datasets than a larger, less challenging one. However, the larger dataset with an additional fine-tuning step proved best on the most difficult data. Selecting the correct pretrained backbone architecture and transfer learning protocols can drive substantial gains in long-term re-id performance.

[160] CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts

Olaf Dünkel, Artur Jesslen, Jiahao Xie, Christian Theobalt, Christian Rupprecht, Adam Kortylewski

Main category: cs.CV

TL;DR: This paper introduces CNS-Bench, a benchmark for evaluating computer vision models’ robustness to continuous out-of-distribution (OOD) scenarios using diffusion models with LoRA adapters to generate realistic image corruptions at varying severities.

Details

Motivation: Existing OOD robustness evaluation methods use simple synthetic corruptions or binary shifts that fail to capture realistic nuisance shifts occurring in real-world scenarios, limiting the ability to properly assess model performance under continuous distribution changes.

Method: The authors develop CNS-Bench using diffusion models with LoRA adapters to generate continuous nuisance shifts at varying severities, implement a filtering mechanism to address failure cases, and conduct large-scale evaluation of 40+ classifiers under various shift conditions.

Result: Model rankings change across different shift types and scales, which binary shifts cannot capture. Continuous evaluation reveals model failure points and provides more nuanced understanding of robustness compared to traditional binary evaluation approaches.

Conclusion: CNS-Bench enables more realistic and comprehensive evaluation of computer vision model robustness by providing continuous nuisance shifts that better reflect real-world distribution changes, revealing important insights about model behavior that binary evaluations miss.

Abstract: An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they often fail to capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To address failure cases, we propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness. Project page including code and data: https://genintel.github.io/CNS.

[161] Attention (as Discrete-Time Markov) Chains

Yotam Erel, Olaf Dünkel, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Amit H. Bermano

Main category: cs.CV

TL;DR: This paper introduces a novel interpretation of attention matrices in visual transformers as discrete-time Markov chains, enabling the analysis of both direct and indirect attention effects to improve token importance measurement and achieve state-of-the-art results in zero-shot segmentation and image generation.

Details

Motivation: Previous studies only model immediate attention effects in transformers, missing the broader context of how attention propagates through multiple steps. There's a need for a unified framework that can capture both direct and indirect attention relationships to better understand token interactions in visual transformers.

Method: The authors reinterpret attention matrices as discrete-time Markov chains, where tokens form metastable states that cluster semantically similar regions. They use matrix multiplication and eigenanalysis to compute metastable states and their prevalence. They also introduce TokenRank, which is the steady state vector of the Markov chain that measures global token importance.

Result: The method achieves state-of-the-art performance in zero-shot segmentation using lightweight computational tools. TokenRank also brings improvements in unconditional image generation tasks, demonstrating the effectiveness of considering indirect attention propagation.

Conclusion: The Markov chain interpretation of attention matrices provides a fresh perspective on token attention in visual transformers, offering a unified framework for understanding both direct and indirect attention effects. This approach enables better token importance measurement and improved performance in downstream tasks like segmentation and image generation.

Abstract: We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our main observation is that tokens corresponding to semantically similar regions form a set of metastable states, where the attention clusters, while noisy attention scores tend to disperse. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank – the steady state vector of the Markov chain, which measures global token importance. We demonstrate that using it brings improvements in unconditional image generation. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.

[162] See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering

Junjie Wang, Yunhan Tang, Yijie Wang, Zhihao Yuan, Huan Wang, Yangfan He, Bin Li

Main category: cs.CV

TL;DR: The paper introduces Synergos-VQA, a framework that improves Knowledge-Based Visual Question Answering by combining three types of evidence (holistic, structural, and causal) to achieve more comprehensive reasoning, establishing new state-of-the-art results on multiple benchmarks.

Details

Motivation: Current Multimodal Large Language Models suffer from a "seeing only the trees, but not the forest" problem in Knowledge-Based Visual Question Answering, where they rely on uni-dimensional evidence that prevents robust, multi-faceted understanding and comprehensive reasoning.

Method: Synergos-VQA framework concurrently generates and fuses three complementary evidence streams: (1) Holistic Evidence to perceive the entire scene, (2) Structural Evidence from a prototype-driven module to identify key objects, and (3) Causal Evidence from a counterfactual probe to ensure robust grounding.

Result: Synergos-VQA achieves new state-of-the-art performance on three challenging benchmarks including OK-VQA and A-OKVQA, and demonstrates strong plug-and-play capabilities that significantly boost various open-source MLLMs, proving that superior methodological design can outperform sheer model scale.

Conclusion: The synergistic fusion of multi-faceted evidence (holistic, structural, and causal) enables more comprehensive and reliable reasoning in Knowledge-Based Visual Question Answering, and superior methodological design can be more effective than simply increasing model scale.

Abstract: Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This “seeing only the trees, but not the forest” approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos-VQA, a novel synergistic reasoning framework. At its core, Synergos-VQA concurrently generates and fuses three complementary evidence streams at inference time: (1) Holistic Evidence to perceive the entire scene (the “forest”), (2) Structural Evidence from a prototype-driven module to identify key objects (the “trees”), and (3) Causal Evidence from a counterfactual probe to ensure the reasoning is robustly grounded. By synergistically fusing this multi-faceted evidence, our framework achieves a more comprehensive and reliable reasoning process. Extensive experiments show that Synergos-VQA decisively establishes a new state-of-the-art on three challenging benchmarks, including OK-VQA and A-OKVQA. Furthermore, our approach demonstrates strong plug-and-play capabilities, significantly boosting various open-source MLLMs and proving that superior methodological design can outperform sheer model scale.

[163] Monocular Semantic Scene Completion via Masked Recurrent Networks

Xuzhi Wang, Xinran Wu, Song Wang, Lingdong Kong, Ziping Zhao

Main category: cs.CV

TL;DR: A two-stage framework (MonoMRN) for monocular semantic scene completion that decomposes the task into coarse prediction followed by refinement using a Masked Recurrent Network with sparse GRU design, achieving state-of-the-art performance on indoor and outdoor datasets.

Details

Motivation: Existing single-stage MSSC methods suffer from suboptimal performance in complex scenes due to simultaneously handling visible region segmentation and occluded region hallucination while being affected by inaccurate depth estimation.

Method: A novel two-stage framework that first performs coarse MSSC then applies a Masked Recurrent Network with: (1) Masked Sparse Gated Recurrent Unit (MS-GRU) focusing on occupied regions through mask updating mechanism, (2) sparse GRU design for computational efficiency, and (3) distance attention projection to reduce projection errors based on distance to observed surface.

Result: Achieves state-of-the-art performance on NYUv2 and SemanticKITTI datasets for both indoor and outdoor scenes. Robustness analysis shows the Masked Recurrent Network enhances model resilience to various disturbances.

Conclusion: The proposed MonoMRN framework effectively addresses MSSC challenges by decomposing the task into two stages, with the Masked Recurrent Network providing improved performance and robustness compared to existing single-stage approaches.

Abstract: Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model’s resilience to such challenges. The source code is publicly available.

[164] Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, Benoit R. Cottereau

Main category: cs.CV

TL;DR: This paper introduces Talk2Event, the first large-scale benchmark for language-driven object grounding in event cameras, along with EventRefer, an attribute-aware framework that uses a Mixture of Event-Attribute Experts to connect asynchronous event streams with human language for dynamic environment understanding.

Details

Motivation: Event cameras provide microsecond-level latency and motion blur robustness, making them ideal for dynamic environments, but connecting these asynchronous streams to human language remains an open challenge. There's a need for benchmarks and methods that can bridge spatial, temporal, and relational reasoning in event-based perception.

Method: The authors propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). The method incorporates four grounding attributes: appearance, status, relation to viewer, and relation to other objects, and adapts to different modalities and scene dynamics.

Result: EventRefer achieves consistent gains over state-of-the-art baselines across three settings: event-only, frame-only, and event-frame fusion. The Talk2Event dataset provides over 30,000 validated referring expressions built from real-world driving data, each enriched with four grounding attributes.

Conclusion: The Talk2Event dataset and EventRefer approach establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy applications, successfully bridging the gap between event-based vision and natural language understanding.

Abstract: Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes – appearance, status, relation to viewer, and relation to other objects – bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.

[165] Perspective-Invariant 3D Object Detection

Ao Liang, Lingdong Kong, Dongyue Lu, Youquan Liu, Jian Fang, Huaici Zhao, Wei Tsang Ooi

Main category: cs.CV

TL;DR: The paper introduces Pi3DET, the first multi-platform LiDAR dataset covering vehicle, quadruped, and drone platforms, and proposes a cross-platform adaptation framework for 3D object detection that transfers knowledge from vehicle platforms to other autonomous platforms.

Details

Motivation: Existing LiDAR-based 3D object detection datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms like quadrupeds and drones underexplored. There is a need to bridge this gap and enable 3D detection across diverse platforms.

Method: The authors propose a cross-platform adaptation framework that transfers knowledge from well-studied vehicle platforms to other platforms (quadruped and drone). The framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels.

Result: Extensive experiments demonstrate substantial gains over existing adaptation methods on challenging cross-platform tasks. The framework shows effectiveness in transferring knowledge across different autonomous platforms and improving 3D detection performance.

Conclusion: The work successfully establishes the first multi-platform LiDAR benchmark and provides an effective cross-platform adaptation framework. It paves the way for developing generalizable and unified 3D perception systems across diverse and complex environments, with the dataset and tools made publicly available.

Abstract: With the rise of robotics, LiDAR-based 3D object detection has garnered significant attention in both academia and industry. However, existing datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms underexplored. To bridge this gap, we introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms: vehicle, quadruped, and drone, thereby facilitating research in 3D object detection for non-vehicle platforms as well as cross-platform 3D detection. Based on Pi3DET, we propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms. This framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels. Additionally, we establish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios, providing valuable insights for developing adaptive 3D perception systems. Extensive experiments validate the effectiveness of our approach on challenging cross-platform tasks, demonstrating substantial gains over existing adaptation methods. We hope this work paves the way for generalizable and unified 3D perception systems across diverse and complex environments. Our Pi3DET dataset, cross-platform benchmark suite, and annotation toolkit have been made publicly available.

[166] BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems

Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger

Main category: cs.CV

TL;DR: This paper evaluates the performance of 3 state-of-the-art Vision Language Models (VLMs) on traffic scenarios from the Waymo Open Dataset, finding that while VLMs show impressive understanding capabilities, they are prone to hallucinations that could be dangerous for autonomous driving systems, leading to the proposal of BetterCheck as a hallucination detection strategy.

Details

Motivation: Large language models and vision language models show promising capabilities for automotive perception systems in understanding complex traffic situations and edge cases, but their tendency to hallucinate (missing real objects or seeing non-existent ones) poses significant safety risks for autonomous driving systems, requiring systematic assessment and mitigation strategies.

Method: Systematically assess 3 state-of-the-art VLMs on a diverse subset of traffic situations sampled from the Waymo Open Dataset to evaluate their performance and identify hallucination patterns. Propose BetterCheck as a hallucination detection strategy for VLM-supported perception systems.

Result: Both proprietary and open VLMs demonstrate remarkable image understanding capabilities with attention to fine details that are sometimes difficult for humans to spot. However, they still exhibit hallucination behaviors, making up elements in their descriptions of traffic scenes.

Conclusion: While VLMs show impressive performance in understanding complex traffic situations and could be valuable components for automotive perception systems, their hallucination tendencies require dedicated detection strategies like BetterCheck to ensure safety guardrails in autonomous driving applications.

Abstract: Large language models (LLMs) are growingly extended to process multimodal data such as text and video simultaneously. Their remarkable performance in understanding what is shown in images is surpassing specialized neural networks (NNs) such as Yolo that is supporting only a well-formed but very limited vocabulary, ie., objects that they are able to detect. When being non-restricted, LLMs and in particular state-of-the-art vision language models (VLMs) show impressive performance to describe even complex traffic situations. This is making them potentially suitable components for automotive perception systems to support the understanding of complex traffic situations or edge case situation. However, LLMs and VLMs are prone to hallucination, which mean to either potentially not seeing traffic agents such as vulnerable road users who are present in a situation, or to seeing traffic agents who are not there in reality. While the latter is unwanted making an ADAS or autonomous driving systems (ADS) to unnecessarily slow down, the former could lead to disastrous decisions from an ADS. In our work, we are systematically assessing the performance of 3 state-of-the-art VLMs on a diverse subset of traffic situations sampled from the Waymo Open Dataset to support safety guardrails for capturing such hallucinations in VLM-supported perception systems. We observe that both, proprietary and open VLMs exhibit remarkable image understanding capabilities even paying thorough attention to fine details sometimes difficult to spot for us humans. However, they are also still prone to making up elements in their descriptions to date requiring hallucination detection strategies such as BetterCheck that we propose in our work.

[167] A Comprehensive Evaluation Framework for the Study of the Effects of Facial Filters on Face Recognition Accuracy

Kagan Ozturk, Louisa Conwill, Jacob Gutierrez, Kevin Bowyer, Walter J. Scheirer

Main category: cs.CV

TL;DR: This paper introduces a framework for large-scale evaluation of how facial filters from social media apps affect automated face recognition performance, demonstrating cross-cultural differences between American and Chinese applications and proposing methods to detect and mitigate filtering effects.

Details

Motivation: Previous studies on facial filters' impact on face recognition were limited to small numbers of hand-picked filters in specific styles, failing to capture the wide variety of filters available across different social media platforms and cultures.

Method: The framework includes: (1) a controlled dataset of face images, (2) a principled filter selection process to choose representative filters for experimentation, and (3) systematic experiments to evaluate filters’ impact on recognition performance across American apps (Instagram, Snapchat) and Chinese apps (Meitu, Pitu).

Result: The study revealed cross-cultural differences in how filters from American versus Chinese applications impact face recognition, and demonstrated that filtering effects in face embedding space can be detected and corrected to improve recognition performance.

Conclusion: The proposed framework enables comprehensive evaluation of facial filters’ effects on automated recognition systems, reveals significant cross-cultural variations in filter impacts, and provides practical solutions for detecting and mitigating filter-induced recognition degradation.

Abstract: Facial filters are now commonplace for social media users around the world. Previous work has demonstrated that facial filters can negatively impact automated face recognition performance. However, these studies focus on small numbers of hand-picked filters in particular styles. In order to more effectively incorporate the wide ranges of filters present on various social media applications, we introduce a framework that allows for larger-scale study of the impact of facial filters on automated recognition. This framework includes a controlled dataset of face images, a principled filter selection process that selects a representative range of filters for experimentation, and a set of experiments to evaluate the filters’ impact on recognition. We demonstrate our framework with a case study of filters from the American applications Instagram and Snapchat and the Chinese applications Meitu and Pitu to uncover cross-cultural differences. Finally, we show how the filtering effect in a face embedding space can easily be detected and restored to improve face recognition performance.

[168] Context Diffusion: In-Context Aware Image Generation

Ivona Najdenkoska, Animesh Sinha, Abhimanyu Dubey, Dhruv Mahajan, Vignesh Ramanathan, Filip Radenovic

Main category: cs.CV

TL;DR: Context Diffusion is a diffusion-based framework that enables image generation models to learn from visual examples in context, separating visual context encoding from layout preservation to improve generation quality even without text prompts.

Details

Motivation: Existing in-context learning models for image generation suffer from deteriorating quality and context fidelity when text prompts are absent, indicating they cannot truly learn from visual context alone. There's a need for models that can effectively learn from visual examples and handle diverse in-context learning scenarios.

Method: The authors propose Context Diffusion, a novel framework that separates the encoding of visual context from the preservation of desired image layout. This separation enables the model to learn from visual context and prompts independently or together, and supports few-shot learning settings for diverse scenarios.

Result: Context Diffusion demonstrates superior performance in both in-domain and out-of-domain tasks through experiments and human evaluation, showing enhanced image quality and improved context fidelity compared to existing counterpart models.

Conclusion: The proposed Context Diffusion framework successfully addresses the limitations of existing in-context learning approaches for image generation by enabling true visual context learning through architectural separation, resulting in better quality and fidelity across various scenarios.

Abstract: We propose Context Diffusion, a diffusion-based framework that enables image generation models to learn from visual examples presented in context. Recent work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and context fidelity of the generated images deteriorate when the prompt is not present, demonstrating that these models cannot truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and the preservation of the desired image layout. This results in the ability to learn from the visual context and prompts, but also from either of them. Furthermore, we enable our model to handle few-shot settings, to effectively address diverse in-context learning scenarios. Our experiments and human evaluation demonstrate that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and context fidelity compared to counterpart models.

[169] ROADWork Dataset: Learning to Recognize, Observe, Analyze and Drive Through Work Zones

Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, Srinivasa G. Narasimhan

Main category: cs.CV

TL;DR: This paper introduces ROADWork dataset for work zone perception and navigation, showing that fine-tuning foundation models on this dataset significantly improves performance in detecting, describing, and navigating through construction work zones compared to state-of-the-art methods.

Details

Motivation: Work zone perception and autonomous navigation is a challenging and underexplored problem with scarce open datasets. Existing state-of-the-art foundation models and open-vocabulary methods fail when applied to work zones, creating a need for specialized datasets and approaches to handle this long-tailed scenario.

Method: The authors propose the ROADWork dataset for learning to recognize, observe, analyze, and drive through work zones. They fine-tune foundation models on this dataset and employ several techniques including video label propagation for instance segmentation, crop-scaling composition of detectors and text spotters for sign reading, and incorporating road work semantics for navigation goal prediction and drivable path computation.

Result: Fine-tuning on ROADWork dataset achieves significant improvements: +32.5% precision in work zone discovery at 12.8× higher rate globally, +32.2 AP for detection performance, and +36.7 SPICE for VLM descriptions. Additional techniques provide further gains: +2.6 AP from video label propagation, +14.2% 1-NED for sign reading, +3.9 SPICE reduction in VLM hallucinations, 53.6% goals with angular error <0.5 (+9.9%), and 75.3% pathways with angular error <0.5 (+8.1%).

Conclusion: The ROADWork dataset and fine-tuning approach successfully addresses the challenging problem of work zone perception and navigation. The research demonstrates that specialized datasets are crucial for handling long-tailed scenarios, and that combining fine-tuning with simple compositional techniques can significantly improve autonomous vehicle performance in construction work zones.

Abstract: Perceiving and autonomously navigating through work zones is a challenging and underexplored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork dataset, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8$\times$) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via crop-scaling improves performance +14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 (+9.9 %) and 75.3% pathways have AE < 0.5 (+8.1 %).

[170] BadHMP: Backdoor Attack against Human Motion Prediction

Chaohui Xu, Si Wang, Chip-Hong Chang

Main category: cs.CV

TL;DR: BadHMP is a novel backdoor attack targeting human motion prediction models that embeds localized triggers in skeleton data to manipulate future motion predictions while maintaining stealth and naturalness.

Details

Motivation: Existing research has limited examination of vulnerability in skeleton-based neural networks for human motion prediction, particularly regarding evasion and backdoor attacks in safety-critical applications requiring precise sub-second motion forecasting.

Method: The attack generates poisoned training samples by embedding localized backdoor triggers in one limb of skeleton data during historical time steps, then globally modifies future sequences to follow predefined target trajectories while ensuring smoothness and naturalness to avoid detection.

Result: Experiments on Human3.6M and CMU-Mocap datasets using LTD and HRI architectures demonstrate high-fidelity, effectiveness, and stealthiness of BadHMP. The attack successfully activates target sequences even with low poisoned sample injection ratios and shows robustness against fine-tuning defenses.

Conclusion: BadHMP successfully demonstrates that human motion prediction models are vulnerable to backdoor attacks through carefully designed triggers that maintain data naturalness while enabling effective manipulation of future motion predictions, highlighting security concerns for safety-critical applications.

Abstract: Precise future human motion prediction over sub-second horizons from past observations is crucial for various safety-critical applications. To date, only a few studies have examined the vulnerability of skeleton-based neural networks to evasion and backdoor attacks. In this paper, we propose BadHMP, a novel backdoor attack that targets specifically human motion prediction tasks. Our approach involves generating poisoned training samples by embedding a localized backdoor trigger in one limb of the skeleton, causing selected joints to follow predefined motion in historical time steps. Subsequently, the future sequences are globally modified that all the joints move following the target trajectories. Our carefully designed backdoor triggers and targets guarantee the smoothness and naturalness of the poisoned samples, making them stealthy enough to evade detection by the model trainer while keeping the poisoned model unobtrusive in terms of prediction fidelity to untainted sequences. The target sequences can be successfully activated by the designed input sequences even with a low poisoned sample injection ratio. Experimental results on two datasets (Human3.6M and CMU-Mocap) and two network architectures (LTD and HRI) demonstrate the high-fidelity, effectiveness, and stealthiness of BadHMP. Robustness of our attack against fine-tuning defense is also verified.

[171] The BabyView dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences

Bria Long, Robert Z. Sparks, Violet Xiang, Stefan Stojanov, Zi Yin, Grace E. Keene, Alvin W. M. Tan, Steven Y. Feng, Chengxu Zhuang, Virginia A. Marchman, Daniel L. K. Yamins, Michael C. Frank

Main category: cs.CV

TL;DR: Researchers introduce BabyView, an 868-hour egocentric video dataset from children aged 6 months to 3 years, to study the “data gap” between human learning efficiency and AI models, finding that while performance scales with dataset size, models still underperform compared to training on curated datasets.

Details

Motivation: Human children achieve remarkable sample efficiency in learning compared to modern machine learning algorithms, creating a significant "data gap." Understanding how children learn from their natural visual experiences could help bridge this gap and advance both AI development and our understanding of human development, but existing egocentric video datasets are limited in quality, scale, and diversity.

Method: The researchers created the BabyView dataset by recording 868 hours of high-resolution egocentric video from children aged 6 months to 3 years using cameras with large vertical field-of-view and gyroscope/accelerometer data. They provided gold-standard annotations for speech transcription, speaker diarization, and human pose estimation, then trained self-supervised language and vision models on this data and evaluated transfer learning performance on various downstream tasks.

Result: Models trained on the BabyView dataset showed performance that scaled with dataset size across multiple domains including speech transcription, pose estimation, syntactic structure learning, object recognition, depth estimation, and image segmentation. However, overall performance remained relatively lower than models trained on curated datasets, particularly in the visual domain, highlighting the challenge of learning from naturalistic, unfiltered data.

Conclusion: The BabyView dataset represents a significant step toward understanding human-like learning but reveals that current AI systems struggle to match human learning efficiency when trained on the same type of naturalistic data that children experience. This establishes an open challenge for developing robust, human-like AI systems that can achieve human-level performance using similar scales and distributions of training data.

Abstract: Human children far exceed modern machine learning algorithms in their sample efficiency, achieving high performance in key domains with much less data than current models. This ‘‘data gap’’ is a key challenge both for building intelligent artificial systems and for understanding human development. Egocentric video capturing children’s experience–their ‘’training data’’–is a key ingredient for comparison of humans and models and for the development of algorithmic innovations to bridge this gap. Yet there are few such datasets available, and extant data are low-resolution, have limited metadata, and importantly, represent only a small set of children’s experiences. Here, we provide the first release of a large developmental egocentric video dataset–the BabyView dataset–recorded using a high-resolution camera with a large vertical field-of-view and gyroscope/accelerometer data. This 868 hour dataset includes egocentric videos from children spanning 6 months to 3 years of age in longitudinal, at-home contexts. We provide gold-standard annotations for the evaluation of speech transcription, speaker diarization, and human pose estimation, and evaluate models in each of these domains. We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks, including syntactic structure learning, object recognition, depth estimation, and image segmentation. Although performance in each domain scales with dataset size, overall performance is relatively lower than when models are trained on curated datasets, especially in the visual domain. Our dataset stands as an open challenge for robust, human-like AI systems: how can such systems achieve human-levels of success on the same scale and distribution of training data as humans?

[172] Optimizing against Infeasible Inclusions from Data for Semantic Segmentation through Morphology

Shamik Basu, Luc Van Gool, Christos Sakaridis

Main category: cs.CV

TL;DR: InSeIn is a plug-and-play method that improves semantic segmentation by extracting spatial inclusion constraints from training data and enforcing them during training to prevent absurd predictions like “road” segments being included within “sky” segments.

Details

Motivation: State-of-the-art semantic segmentation models often produce absurd segmentations, especially under domain shift, where they may assign semantically impossible spatial relationships (e.g., labeling a segment as "road" when it's spatially included within a "sky" segment). This occurs because current models only optimize per-pixel/per-segment classification objectives without considering spatial semantic constraints.

Method: InSeIn extracts explicit inclusion constraints that govern spatial class relations from the semantic segmentation training dataset in an offline, data-driven manner. It then enforces a morphological yet differentiable loss function during training that penalizes violations of these spatial constraints to promote prediction feasibility.

Result: InSeIn demonstrates consistent and significant performance improvements over diverse state-of-the-art segmentation networks across three benchmark datasets: ADE20K, Cityscapes, and ACDC. The method successfully reduces infeasible semantic inclusions in model predictions.

Conclusion: InSeIn represents a novel step towards minimizing infeasible semantic inclusions in learned segmentation models. As a lightweight plug-and-play method, it effectively addresses spatial semantic inconsistencies by incorporating domain knowledge about feasible spatial relationships between semantic classes during training.

Abstract: State-of-the-art semantic segmentation models are typically optimized in a data-driven fashion, minimizing solely per-pixel or per-segment classification objectives on their training data. This purely data-driven paradigm often leads to absurd segmentations, especially when the domain of input images is shifted from the one encountered during training. For instance, state-of-the-art models may assign the label “road” to a segment that is included by another segment that is respectively labeled as “sky”. However, the ground truth of the existing dataset at hand dictates that such inclusion is not feasible. Our method, Infeasible Semantic Inclusions (InSeIn), first extracts explicit inclusion constraints that govern spatial class relations from the semantic segmentation training set at hand in an offline, data-driven fashion, and then enforces a morphological yet differentiable loss that penalizes violations of these constraints during training to promote prediction feasibility. InSeIn is a light-weight plug-and-play method, constitutes a novel step towards minimizing infeasible semantic inclusions in the predictions of learned segmentation models, and yields consistent and significant performance improvements over diverse state-of-the-art networks across the ADE20K, Cityscapes, and ACDC datasets. https://github.com/SHAMIK-97/InSeIn/tree/main

[173] Fractal Signatures: Securing AI-Generated Pollock-Style Art via Intrinsic Watermarking and Blockchain

Yiquan Wang

Main category: cs.CV

TL;DR: This paper presents an integrated framework combining neural style transfer, fractal analysis, and blockchain technology to create robust watermarks for digital art authentication, achieving 76.2% detection rate against attacks compared to 27.8-44.0% for traditional methods.

Details

Motivation: The digital art market faces significant challenges in authenticity verification and copyright protection, requiring a robust solution to ensure trust and security in the digital art ecosystem for artists and collectors.

Method: The framework combines neural style transfer to generate Jackson Pollock-inspired abstract artworks, fractal analysis and turbulence features to create imperceptible watermarks embedded into the artwork’s structure, and blockchain technology (NFT metadata) to secure immutable proof of ownership.

Result: The feature-based watermarking system achieved a 76.2% average detection rate against common attacks, significantly outperforming traditional watermarking methods which only achieved 27.8-44.0% detection rates.

Conclusion: The study provides a practical solution for digital artists and collectors by offering enhanced security and trust in the digital art ecosystem through the integration of advanced mathematical complexity, robust watermarking, and blockchain technology for authenticity verification and copyright protection.

Abstract: The digital art market faces unprecedented challenges in authenticity verification and copyright protection. This study introduces an integrated framework to address these issues by combining neural style transfer, fractal analysis, and blockchain technology. We generate abstract artworks inspired by Jackson Pollock, using their inherent mathematical complexity to create robust, imperceptible watermarks. Our method embeds these watermarks, derived from fractal and turbulence features, directly into the artwork’s structure. This approach is then secured by linking the watermark to NFT metadata, ensuring immutable proof of ownership. Rigorous testing shows our feature-based watermarking achieves a 76.2% average detection rate against common attacks, significantly outperforming traditional methods (27.8-44.0%). This work offers a practical solution for digital artists and collectors, enhancing security and trust in the digital art ecosystem.

[174] Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Zhichao Zhang, Wei Sun, Xinyue Li, Yunhao Li, Qihang Ge, Jun Jia, Zicheng Zhang, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: This paper introduces a comprehensive study on AI-generated human activity video quality assessment, presenting the Human-AGVQA dataset with 6,000 videos from 15 T2V models and developing GHVQ, an objective quality metric that significantly outperforms existing methods.

Details

Motivation: AI-generated videos involving human activities often exhibit substantial visual and semantic distortions, which hinders the practical application of video generation technologies in real-world scenarios. There was a need for systematic evaluation methods to assess these quality issues.

Method: The authors constructed the Human-AGVQA dataset with 6,000 AGVs from 15 T2V models using 400 human activity prompts, conducted subjective studies to evaluate human appearance, action continuity, and overall quality, and developed the GHVQ metric that extracts human-focused features, AI-content-aware features, and temporal continuity features.

Result: GHVQ significantly outperforms existing quality metrics on the Human-AGVQA dataset by a large margin. The study also benchmarked T2V model performance and analyzed their strengths/weaknesses in generating different human activity categories, while identifying semantic issues in human body parts.

Conclusion: The research successfully addresses the quality assessment challenge for AI-generated human activity videos through a comprehensive dataset and an effective objective metric. The Human-AGVQA dataset and GHVQ metric provide valuable tools for evaluating and improving video generation technologies.

Abstract: AI-driven video generation techniques have made significant progress in recent years. However, AI-generated videos (AGVs) involving human activities often exhibit substantial visual and semantic distortions, hindering the practical application of video generation technologies in real-world scenarios. To address this challenge, we conduct a pioneering study on human activity AGV quality assessment, focusing on visual quality evaluation and the identification of semantic distortions. First, we construct the AI-Generated Human activity Video Quality Assessment (Human-AGVQA) dataset, consisting of 6,000 AGVs derived from 15 popular text-to-video (T2V) models using 400 text prompts that describe diverse human activities. We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. Based on Human-AGVQA, we benchmark the performance of T2V models and analyze their strengths and weaknesses in generating different categories of human activities. Second, we develop an objective evaluation metric, named AI-Generated Human activity Video Quality metric (GHVQ), to automatically analyze the quality of human activity AGVs. GHVQ systematically extracts human-focused quality features, AI-generated content-aware quality features, and temporal continuity features, making it a comprehensive and explainable quality metric for human activity AGVs. The extensive experimental results show that GHVQ outperforms existing quality metrics on the Human-AGVQA dataset by a large margin, demonstrating its efficacy in assessing the quality of human activity AGVs. The Human-AGVQA dataset and GHVQ metric will be released at https://github.com/zczhang-sjtu/GHVQ.git.

[175] Transformer-Based Auxiliary Loss for Face Recognition Across Age Variations

Pritesh Prakash, S Umamaheswaran

Main category: cs.CV

TL;DR: This paper proposes a transformer-metric loss function for age-invariant face recognition that combines transformer networks with traditional metric loss to better handle facial changes due to aging by processing sequential spatial features from CNN outputs.

Details

Motivation: Aging significantly challenges face recognition systems as skin texture and tone changes alter facial features over time, making it difficult to match images of the same person taken years apart. Traditional metric loss functions struggle with age-related variations like wrinkles and sagging skin.

Method: The authors develop a transformer-metric loss that combines transformer-loss with standard metric-loss functions. The transformer encoder processes contextual vectors from the final convolution layer of a CNN backbone, treating spatial features as sequential data. This allows the network to learn more age-invariant features while maintaining discriminative power.

Result: The proposed method achieves state-of-the-art results on standard face recognition datasets (LFW) and age-variant datasets (CA-LFW and AgeDB), demonstrating improved performance in handling age-related facial variations compared to traditional approaches.

Conclusion: The transformer-metric loss successfully enhances face recognition robustness against aging effects by leveraging transformer networks’ ability to preserve sequential spatial relationships. This approach expands transformer applications in computer vision and opens new research directions for using transformers as loss functions in machine learning.

Abstract: Aging presents a significant challenge in face recognition, as changes in skin texture and tone can alter facial features over time, making it particularly difficult to compare images of the same individual taken years apart, such as in long-term identification scenarios. Transformer networks have the strength to preserve sequential spatial relationships caused by aging effect. This paper presents a technique for loss evaluation that uses a transformer network as an additive loss in the face recognition domain. The standard metric loss function typically takes the final embedding of the main CNN backbone as its input. Here, we employ a transformer-metric loss, a combined approach that integrates both transformer-loss and metric-loss. This research intends to analyze the transformer behavior on the convolution output when the CNN outcome is arranged in a sequential vector. These sequential vectors have the potential to overcome the texture or regional structure referred to as wrinkles or sagging skin affected by aging. The transformer encoder takes input from the contextual vectors obtained from the final convolution layer of the network. The learned features can be more age-invariant, complementing the discriminative power of the standard metric loss embedding. With this technique, we use transformer loss with various base metric-loss functions to evaluate the effect of the combined loss functions. We observe that such a configuration allows the network to achieve SoTA results in LFW and age-variant datasets (CA-LFW and AgeDB). This research expands the role of transformers in the machine vision domain and opens new possibilities for exploring transformers as a loss function.

[176] RALAD: Bridging the Real-to-Sim Domain Gap in Autonomous Driving with Retrieval-Augmented Learning

Jiacheng Zuo, Haibo Hu, Zikang Zhou, Yufei Cui, Ziquan Liu, Jianping Wang, Nan Guan, Jin Wang, Chun Jason Xue

Main category: cs.CV

TL;DR: RALAD is a novel framework that bridges the real-to-sim gap in autonomous driving by using enhanced Optimal Transport for domain adaptation and efficient fine-tuning, achieving significant performance improvements in simulated environments while reducing computational costs by 88.1%.

Details

Motivation: Autonomous driving models trained on real-world data struggle to adapt to new environments, especially corner cases like extreme weather. Collecting real-world corner cases is difficult, requiring simulators for validation, but high computational costs and domain gaps between real and simulated data hinder effective transition.

Method: RALAD employs three key designs: (1) domain adaptation using enhanced Optimal Transport that considers both individual and grouped image distances, (2) a unified framework applicable to various models, and (3) efficient fine-tuning that freezes computationally expensive layers while maintaining robustness.

Result: RALAD maintains real-world performance while significantly improving simulation performance across three different models. For Cross View, real-world mIOU and mAP remain stable, while simulation environments show 10.30% and 12.29% improvements respectively. Re-training costs are reduced by approximately 88.1%.

Conclusion: RALAD successfully addresses the real-to-sim domain gap in autonomous driving at low computational cost, demonstrating effective performance compensation in simulated environments while preserving real-world accuracy, making it a practical solution for robust autonomous driving system development.

Abstract: In the pursuit of robust autonomous driving systems, models trained on real-world datasets often struggle to adapt to new environments, particularly when confronted with corner cases such as extreme weather conditions. Collecting these corner cases in the real world is non-trivial, which necessitates the use of simulators for validation. However,the high computational cost and the domain gap in data distribution have hindered the seamless transition between real and simulated driving scenarios. To tackle this challenge, we propose Retrieval-Augmented Learning for Autonomous Driving (RALAD), a novel framework designed to bridge the real-to-sim gap at a low cost. RALAD features three primary designs, including (1) domain adaptation via an enhanced Optimal Transport (OT) method that accounts for both individual and grouped image distances, (2) a simple and unified framework that can be applied to various models, and (3) efficient fine-tuning techniques that freeze the computationally expensive layers while maintaining robustness. Experimental results demonstrate that RALAD compensates for the performance degradation in simulated environments while maintaining accuracy in real-world scenarios across three different models. Taking Cross View as an example, the mIOU and mAP metrics in real-world scenarios remain stable before and after RALAD fine-tuning, while in simulated environments,the mIOU and mAP metrics are improved by 10.30% and 12.29%, respectively. Moreover, the re-training cost of our approach is reduced by approximately 88.1%. Our code is available at https://github.com/JiachengZuo/RALAD.git.

[177] Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation

Raphael Ruschel, Md Awsafur Rahman, Hardik Prajapati, Suya You, B. S. Manjuanth

Main category: cs.CV

TL;DR: TCDSG introduces an end-to-end framework for generating temporally consistent dynamic scene graphs in videos, achieving over 60% improvement in temporal recall through novel bipartite matching with adaptive decoder queries and feedback loops.

Details

Motivation: Understanding video content is crucial for real-world applications like activity recognition, autonomous systems, and human-computer interaction. While scene graphs effectively capture spatial relationships in individual frames, extending these representations to capture dynamic interactions across video sequences remains a significant challenge.

Method: The paper presents TCDSG (Temporally Consistent Dynamic Scene Graphs), an end-to-end framework that detects, tracks, and links subject-object relationships across time. The approach uses a novel bipartite matching mechanism enhanced by adaptive decoder queries and feedback loops to ensure temporal coherence and robust tracking over extended sequences.

Result: The method achieves over 60% improvement in temporal recall@k on three datasets: Action Genome, OpenPVSG, and MEVA. The work also pioneers the augmentation of MEVA dataset with persistent object ID annotations for comprehensive tracklet generation, establishing a new benchmark in multi-frame video analysis.

Conclusion: By seamlessly integrating spatial and temporal dynamics, this work sets a new standard in multi-frame video analysis and opens new avenues for high-impact applications in surveillance, autonomous navigation, and other domains requiring sophisticated video understanding capabilities.

Abstract: Understanding video content is pivotal for advancing real-world applications like activity recognition, autonomous systems, and human-computer interaction. While scene graphs are adept at capturing spatial relationships between objects in individual frames, extending these representations to capture dynamic interactions across video sequences remains a significant challenge. To address this, we present TCDSG, Temporally Consistent Dynamic Scene Graphs, an innovative end-to-end framework that detects, tracks, and links subject-object relationships across time, generating action tracklets, temporally consistent sequences of entities and their interactions. Our approach leverages a novel bipartite matching mechanism, enhanced by adaptive decoder queries and feedback loops, ensuring temporal coherence and robust tracking over extended sequences. This method not only establishes a new benchmark by achieving over 60% improvement in temporal recall@k on the Action Genome, OpenPVSG, and MEVA datasets but also pioneers the augmentation of MEVA with persistent object ID annotations for comprehensive tracklet generation. By seamlessly integrating spatial and temporal dynamics, our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.

[178] Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning

Maomao Li, Lijian Lin, Yunfei Liu, Ye Zhu, Yu Li

Main category: cs.CV

TL;DR: Qffusion is a dual-frame-guided framework for portrait video editing that uses a “animation for editing” design principle, employing a Quadrant-grid Arrangement scheme to organize reference images and facial conditions for stable video editing without additional networks.

Details

Motivation: Existing portrait video editing methods require complex training stages and additional networks. The authors aim to develop a simpler approach that can achieve stable video editing by leveraging the generative power of Stable Diffusion with only input format modifications.

Method: The method introduces Qffusion with two key components: 1) Quadrant-grid Arrangement (QGA) scheme that arranges latent codes of two reference images and four facial conditions in a four-grid fashion, and 2) Quadrant-grid Propagation (QGP) inference strategy for processing reference and condition frames recursively. The framework uses self-attention for both appearance and temporal learning under the QGA structure.

Result: Through extensive experiments, Qffusion consistently outperforms state-of-the-art techniques on portrait video editing tasks. The method achieves stable arbitrary-length video generation and enables effective portrait video editing using the proposed dual-frame guidance approach.

Conclusion: Qffusion successfully demonstrates that portrait video editing can be achieved through a simplified framework that modifies only the input format of Stable Diffusion. The quadrant-grid based approach provides an effective solution for stable video editing without requiring additional networks or complex training procedures.

Abstract: This paper presents Qffusion, a dual-frame-guided framework for portrait video editing. Specifically, we consider a design principle of ``animation for editing’’, and train Qffusion as a general animation framework from two still reference images while we can use it for portrait video editing easily by applying modified start and end frames as references during inference. Leveraging the powerful generative power of Stable Diffusion, we propose a Quadrant-grid Arrangement (QGA) scheme for latent re-arrangement, which arranges the latent codes of two reference images and that of four facial conditions into a four-grid fashion, separately. Then, we fuse features of these two modalities and use self-attention for both appearance and temporal learning, where representations at different times are jointly modeled under QGA. Our Qffusion can achieve stable video editing without additional networks or complex training stages, where only the input format of Stable Diffusion is modified. Further, we propose a Quadrant-grid Propagation (QGP) inference strategy, which enjoys a unique advantage on stable arbitrary-length video generation by processing reference and condition frames recursively. Through extensive experiments, Qffusion consistently outperforms state-of-the-art techniques on portrait video editing. Project page: https://qffusion.github.io/page/.

[179] FE-UNet: Frequency Domain Enhanced U-Net for Low-Frequency Information-Rich Image Segmentation

Guohao Huo, Ruiting Dai, Ling Shao, Jinliang Liu, Hao Tang

Main category: cs.CV

TL;DR: The paper proposes FE-UNet, a deep learning model that addresses frequency feature attenuation in challenging visual environments (deep-sea and surgical robotics) by incorporating wavelet adaptive spectrum fusion (WASF) inspired by human vision mechanisms, achieving state-of-the-art performance in cross-domain segmentation tasks.

Details

Motivation: Deep-sea exploration and surgical robotics face environmental lighting and device resolution limitations that cause high-frequency feature attenuation. Existing CNNs have different frequency sensitivity compared to human visual systems, which have mid-frequency sensitivity with low-frequency sensitivity surpassing high-frequency sensitivity.

Method: The authors experimentally quantified CNN contrast sensitivity function and developed a wavelet adaptive spectrum fusion (WASF) method inspired by biological vision mechanisms. They designed a perception frequency block (PFB) integrating WASF for enhanced frequency-domain feature extraction, and created FE-UNet model using SAM2 backbone with fine-tuned Hiera-Large modules.

Result: FE-UNet achieves state-of-the-art performance in cross-domain tasks including marine organism segmentation and polyp segmentation, demonstrating robust adaptability and strong generalization capability across different domains.

Conclusion: The proposed FE-UNet model successfully addresses frequency feature challenges in challenging visual environments by balancing cross-frequency image features through biologically-inspired mechanisms, showing significant application potential for deep-sea exploration and surgical robotics scenarios.

Abstract: In deep-sea exploration and surgical robotics scenarios, environmental lighting and device resolution limitations often cause high-frequency feature attenuation. Addressing the differences in frequency band sensitivity between CNNs and the human visual system (mid-frequency sensitivity with low-frequency sensitivity surpassing high-frequency), we experimentally quantified the CNN contrast sensitivity function and proposed a wavelet adaptive spectrum fusion (WASF) method inspired by biological vision mechanisms to balance cross-frequency image features. Furthermore, we designed a perception frequency block (PFB) that integrates WASF to enhance frequency-domain feature extraction. Based on this, we developed the FE-UNet model, which employs a SAM2 backbone network and incorporates fine-tuned Hiera-Large modules to ensure segmentation accuracy while improving generalization capability. Experiments demonstrate that FE-UNet achieves state-of-the-art performance in cross-domain tasks such as marine organism segmentation and polyp segmentation, showcasing robust adaptability and significant application potential.

[180] Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models

Yu Pan, Jiahao Chen, Bingrong Dai, Lin Wang, Yi Du, Jiao Liu

Main category: cs.CV

TL;DR: This paper introduces Gungnir, a novel backdoor attack method for Diffusion Models that uses style triggers instead of traditional visual or text triggers, achieving complete evasion of existing defense mechanisms with 0% detection rate.

Details

Motivation: Existing backdoor attacks on Diffusion Models use easily detectable triggers like visual patches or phrases that are vulnerable to current defense strategies. The authors aim to explore more sophisticated attack possibilities by developing harder-to-detect backdoor triggers.

Method: The paper proposes Gungnir, which uses stylistic features as triggers for the first time in DM backdoor attacks. The method incorporates Reconstructing-Adversarial Noise (RAN) and Short-Term Timesteps-Retention (STTR) techniques to embed style-based triggers in images that are perceptually indistinguishable from clean images.

Result: Gungnir successfully bypasses all existing defense methods, achieving a 0% backdoor detection rate (BDR) across current DM defense frameworks. The style-triggered images are imperceptible to both manual inspection and automated detection systems.

Conclusion: The research demonstrates that style-based triggers represent a significant advancement in backdoor attack sophistication for Diffusion Models, completely evading current defense mechanisms and highlighting the need for more robust security measures in generative AI systems.

Abstract: In recent years, Diffusion Models (DMs) have demonstrated significant advances in the field of image generation. However, according to current research, DMs are vulnerable to backdoor attacks, which allow attackers to control the model’s output by inputting data containing covert triggers, such as a specific visual patch or phrase. Existing defense strategies are well equipped to thwart such attacks through backdoor detection and trigger inversion because previous attack methods are constrained by limited input spaces and low-dimensional triggers. For example, visual triggers are easily observed by defenders, text-based or attention-based triggers are more susceptible to neural network detection. To explore more possibilities of backdoor attack in DMs, we propose Gungnir, a novel method that enables attackers to activate the backdoor in DMs through style triggers within input images. Our approach proposes using stylistic features as triggers for the first time and implements backdoor attacks successfully in image-to-image tasks by introducing Reconstructing-Adversarial Noise (RAN) and Short-Term Timesteps-Retention (STTR). Our technique generates trigger-embedded images that are perceptually indistinguishable from clean images, thus bypassing both manual inspection and automated detection neural networks. Experiments demonstrate that Gungnir can easily bypass existing defense methods. Among existing DM defense frameworks, our approach achieves a 0 backdoor detection rate (BDR). Our codes are available at https://github.com/paoche11/Gungnir.

[181] A Deep Learning Approach for Augmenting Perceptional Understanding of Histopathology Images

Xiaoqian Hu

Main category: cs.CV

TL;DR: This paper presents a multi-modal model combining Vision Transformers (ViT) with GPT-2 for generating accurate captions of histopathology images, fine-tuned on the specialized ARCH dataset to enhance diagnostic accuracy and augment healthcare professionals’ cognitive capabilities in medical image analysis.

Details

Motivation: Digital technologies have made significant progress in augmenting human health, cognition, and perception in computational pathology. There is a need to enhance histopathology image analysis by capturing the complexities of pathology images including tissue morphologies, staining variations, and pathological conditions to improve diagnostic accuracy and help detect subtle pathological features that might otherwise go unnoticed.

Method: The authors developed a multi-modal model that combines Vision Transformers (ViT) with GPT-2 for image captioning. The model is fine-tuned on the specialized ARCH dataset, which includes dense image captions derived from clinical and academic resources, to generate accurate and contextually relevant captions for histopathology images.

Result: The model successfully generates accurate, contextually relevant captions for histopathology images and enhances the perception of subtle pathological features that might otherwise go unnoticed. It augments the cognitive capabilities of healthcare professionals, enabling more efficient disease classification, segmentation, and detection, thereby improving diagnostic accuracy.

Conclusion: The approach demonstrates the potential for digital technologies to augment human cognitive abilities in medical image analysis, providing steps toward more personalized and accurate healthcare outcomes. The multi-modal model shows promise for enhancing computational pathology applications and supporting healthcare professionals in diagnostic tasks.

Abstract: In Recent Years, Digital Technologies Have Made Significant Strides In Augmenting-Human-Health, Cognition, And Perception, Particularly Within The Field Of Computational-Pathology. This Paper Presents A Novel Approach To Enhancing The Analysis Of Histopathology Images By Leveraging A Mult-modal-Model That Combines Vision Transformers (Vit) With Gpt-2 For Image Captioning. The Model Is Fine-Tuned On The Specialized Arch-Dataset, Which Includes Dense Image Captions Derived From Clinical And Academic Resources, To Capture The Complexities Of Pathology Images Such As Tissue Morphologies, Staining Variations, And Pathological Conditions. By Generating Accurate, Contextually Captions, The Model Augments The Cognitive Capabilities Of Healthcare Professionals, Enabling More Efficient Disease Classification, Segmentation, And Detection. The Model Enhances The Perception Of Subtle Pathological Features In Images That Might Otherwise Go Unnoticed, Thereby Improving Diagnostic Accuracy. Our Approach Demonstrates The Potential For Digital Technologies To Augment Human Cognitive Abilities In Medical Image Analysis, Providing Steps Toward More Personalized And Accurate Healthcare Outcomes.

[182] Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation

Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: A training-free method for text-to-image generation that improves compositional accuracy by incorporating textual constraint losses and feedback-driven noise refinement, achieving 24% improvement in human evaluation and 25% gain in spatial relationships.

Details

Motivation: Text-to-image models struggle with accurately capturing intricate textual details including entity missing, attribute binding errors, and incorrect spatial relationships. Existing layout-based approaches are too rigid and limit diversity in generated scenes.

Method: A training-free approach that formulates textual constraints as unified losses (entity missing, entity mixing, attribute binding, spatial relationships) applied during generation. Includes a feedback-driven system with a verifier that evaluates generated images, identifies inconsistencies, and provides corrective feedback for fine-grained initial noise refinement through selective loss optimization.

Result: Achieved 24% improvement in human evaluation and 25% gain in spatial relationships compared to baseline methods. The fine-grained noise refinement component contributed an additional 5% performance boost. The method demonstrates significant enhancement in compositional accuracy while maintaining generation flexibility.

Conclusion: The proposed training-free method successfully addresses compositional challenges in text-to-image generation through constraint-based losses and feedback-driven refinement, offering a flexible alternative to rigid layout-based approaches while significantly improving accuracy in capturing textual details.

Abstract: Text-to-image generative models have made significant advancements in recent years; however, accurately capturing intricate details in textual prompts-such as entity missing, attribute binding errors, and incorrect relationships remains a formidable challenge. In response, we present an innovative, training-free method that directly addresses these challenges by incorporating tailored objectives to account for textual constraints. Unlike layout-based approaches that enforce rigid structures and limit diversity, our proposed approach offers a more flexible arrangement of the scene by imposing just the extracted constraints from the text, without any unnecessary additions. These constraints are formulated as losses-entity missing, entity mixing, attribute binding, and spatial relationships-integrated into a unified loss that is applied in the first generation stage. Furthermore, we introduce a feedback-driven system for fine-grained initial noise refinement. This system integrates a verifier that evaluates the generated image, identifies inconsistencies, and provides corrective feedback. Leveraging this feedback, our refinement method first targets the unmet constraints by refining the faulty attention maps caused by initial noise, through the optimization of selective losses associated with these constraints. Subsequently, our unified loss function is reapplied to proceed the second generation phase. Experimental results demonstrate that our method, relying solely on our proposed objective functions, significantly enhances compositionality, achieving a 24% improvement in human evaluation and a 25% gain in spatial relationships. Furthermore, our fine-grained noise refinement proves effective, boosting performance by up to 5%. Code is available at \href{https://github.com/hadi-hosseini/noise-refinement}{https://github.com/hadi-hosseini/noise-refinement}.

[183] Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

Wonwoong Cho, Yan-Ying Chen, Matthew Klenk, David I. Inouye, Yanxia Zhang

Main category: cs.CV

TL;DR: The paper introduces Att-Adapter, a plug-and-play module that enables fine-grained control of multiple continuous attributes in text-to-image diffusion models without requiring paired training data.

Details

Motivation: Text-to-image diffusion models struggle with precise control of continuous attributes (like eye openness or car width) and simultaneous control of multiple attributes in new domains using only text guidance, which limits their practical applications.

Method: The authors propose Att-Adapter, which uses a decoupled cross attention module to harmonize multiple domain attributes with text conditioning, and incorporates a Conditional Variational Autoencoder (CVAE) to prevent overfitting and handle the diverse nature of visual data.

Result: Att-Adapter outperforms LoRA-based baselines in controlling continuous attributes on two public datasets, enables broader control range, improves disentanglement across multiple attributes compared to StyleGAN-based techniques, and works with unpaired training data.

Conclusion: Att-Adapter successfully addresses the challenge of multi-attribute control in diffusion models by providing a flexible, scalable solution that doesn’t require paired synthetic data and can handle multiple attributes within a single model.

Abstract: Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.

[184] TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting

Jianchuan Chen, Jingchuan Hu, Gaige Wang, Zhonghua Jiang, Tiansong Zhou, Zhiwen Chen, Chengfei Lv

Main category: cs.CV

TL;DR: TaoAvatar presents a high-fidelity, lightweight 3D Gaussian Splatting-based system for creating realistic full-body talking avatars that can run in real-time on mobile devices, achieving 90 FPS on devices like Apple Vision Pro.

Details

Motivation: Existing 3D Gaussian Splatting methods for avatar creation struggle with fine-grained control of facial expressions and body movements in full-body talking tasks, lack sufficient details, and cannot run in real-time on mobile devices, limiting their practical applications in AR, e-commerce, and holographic communication.

Method: The approach creates a personalized clothed human parametric template that binds Gaussians for appearance representation, pre-trains a StyleUnet-based network for complex pose-dependent non-rigid deformation, then “bakes” these deformations into a lightweight MLP-based network using distillation technique and develops blend shapes to compensate for details.

Result: TaoAvatar achieves state-of-the-art rendering quality while maintaining real-time performance across various devices, specifically running at 90 FPS on high-definition stereo devices such as the Apple Vision Pro.

Conclusion: The research successfully develops a practical solution for high-fidelity full-body talking avatars that balances quality and performance, making real-time avatar applications feasible on mobile and AR devices through efficient 3D Gaussian Splatting and network distillation techniques.

Abstract: Realistic 3D full-body talking avatars hold great potential in AR, with applications ranging from e-commerce live streaming to holographic communication. Despite advances in 3D Gaussian Splatting (3DGS) for lifelike avatar creation, existing methods struggle with fine-grained control of facial expressions and body movements in full-body talking tasks. Additionally, they often lack sufficient details and cannot run in real-time on mobile devices. We present TaoAvatar, a high-fidelity, lightweight, 3DGS-based full-body talking avatar driven by various signals. Our approach starts by creating a personalized clothed human parametric template that binds Gaussians to represent appearances. We then pre-train a StyleUnet-based network to handle complex pose-dependent non-rigid deformation, which can capture high-frequency appearance details but is too resource-intensive for mobile devices. To overcome this, we “bake” the non-rigid deformations into a lightweight MLP-based network using a distillation technique and develop blend shapes to compensate for details. Extensive experiments show that TaoAvatar achieves state-of-the-art rendering quality while running in real-time across various devices, maintaining 90 FPS on high-definition stereo devices such as the Apple Vision Pro.

Kai Huang, Hao Zou, Bochen Wang, Ye Xi, Zhen Xie, Hao Wang

Main category: cs.CV

TL;DR: AirCache is a KV cache compression method that accelerates Large Visual Language Models (LVLMs) inference by strategically eliminating redundant visual tokens while preserving model performance, achieving 29-66% latency reduction with only 10% visual cache retention.

Details

Motivation: Large Visual Language Models face computational bottlenecks due to excessive key-value (KV) cache demands when processing numerous visual tokens and generating long-context outputs, creating a need for efficient cache compression methods to accelerate inference.

Method: The approach introduces AirCache with two key components: (1) an elite observation window that assesses visual token importance in KV cache through stable inter-modal relevancy modeling with multi-perspective consistency, and (2) an adaptive layer-wise budget allocation strategy that leverages token importance distribution strength and skewness for superior efficiency compared to uniform allocation.

Result: Comprehensive evaluations across multiple LVLMs and benchmarks show that AirCache achieves comparable performance to full cache while retaining only 10% of visual KV cache, reducing decoding latency by 29% to 66% across various batch sizes and prompt lengths, with increasing performance advantages as cache retention rates decrease.

Conclusion: AirCache successfully addresses the computational bottleneck in LVLMs by systematically identifying and eliminating redundant visual tokens in the KV cache, demonstrating that strategic token elimination can maintain model performance while significantly improving inference efficiency.

Abstract: Recent advancements in Large Visual Language Models (LVLMs) have gained significant attention due to their remarkable reasoning capabilities and proficiency in generalization. However, processing a large number of visual tokens and generating long-context outputs impose substantial computational overhead, leading to excessive demands for key-value (KV) cache. To address this critical bottleneck, we propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference. This work systematically investigates the correlations between visual and textual tokens within the attention mechanisms of LVLMs. Our empirical analysis reveals considerable redundancy in cached visual tokens, wherein strategically eliminating these tokens preserves model performance while significantly accelerating context generation. Inspired by these findings, we introduce an elite observation window for assessing the importance of visual components in the KV cache, focusing on stable inter-modal relevancy modeling with enhanced multi-perspective consistency. Additionally, we develop an adaptive layer-wise budget allocation strategy that capitalizes on the strength and skewness of token importance distribution, showcasing superior efficiency compared to uniform allocation. Comprehensive evaluations across multiple LVLMs and benchmarks demonstrate that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache, thereby reducing decoding latency by 29% to 66% across various batch size and prompt length of inputs. Notably, as cache retention rates decrease, our method exhibits increasing performance advantages over existing approaches.

[186] Parasite: A Steganography-based Backdoor Attack Framework for Diffusion Models

Jiahao Chen, Yu Pan, Yi Du, Chunkai Wu, Lin Wang

Main category: cs.CV

TL;DR: This paper introduces “Parasite”, a novel backdoor attack method for image-to-image diffusion models that uses steganography to hide triggers and allows flexible target content embedding, achieving 0% detection rate against existing defense frameworks.

Details

Motivation: Existing backdoor attacks on diffusion models have limitations: they mainly target noise-to-image and text-to-image tasks with limited work on image-to-image tasks, and traditional attacks use conspicuous single triggers to generate fixed targets, lacking concealability and flexibility.

Method: The paper proposes “Parasite”, a backdoor attack method that leverages steganography to hide triggers in image-to-image diffusion models and allows attackers to embed target content as backdoor triggers for more flexible attacks.

Result: “Parasite” achieved a 0% backdoor detection rate against mainstream defense frameworks, effectively bypassing existing detection methods. Ablation studies examined the influence of different hiding coefficients on attack performance.

Conclusion: “Parasite” successfully demonstrates a new type of backdoor attack that is both more concealed (using steganography) and more flexible (embedding target content as triggers) while being highly effective at evading current detection systems in image-to-image diffusion models.

Abstract: Recently, the diffusion model has gained significant attention as one of the most successful image generation models, which can generate high-quality images by iteratively sampling noise. However, recent studies have shown that diffusion models are vulnerable to backdoor attacks, allowing attackers to enter input data containing triggers to activate the backdoor and generate their desired output. Existing backdoor attack methods primarily focused on target noise-to-image and text-to-image tasks, with limited work on backdoor attacks in image-to-image tasks. Furthermore, traditional backdoor attacks often rely on a single, conspicuous trigger to generate a fixed target image, lacking concealability and flexibility. To address these limitations, we propose a novel backdoor attack method called “Parasite” for image-to-image tasks in diffusion models, which not only is the first to leverage steganography for triggers hiding, but also allows attackers to embed the target content as a backdoor trigger to achieve a more flexible attack. “Parasite” as a novel attack method effectively bypasses existing detection frameworks to execute backdoor attacks. In our experiments, “Parasite” achieved a 0 percent backdoor detection rate against the mainstream defense frameworks. In addition, in the ablation study, we discuss the influence of different hiding coefficients on the attack results. You can find our code at https://anonymous.4open.science/r/Parasite-1715/.

Guohao Huo, Zibo Lin, Zitong Wang, Ruiting Dai, Hao Tang

Main category: cs.CV

TL;DR: DMS-Net is a dual-modal multi-scale Siamese network that processes paired fundus images to improve ophthalmic disease classification by leveraging binocular pathological correlations, achieving 82.9% accuracy on the ODIR-5K dataset.

Details

Motivation: Traditional diagnosis methods and existing single-eye deep learning approaches fail to account for binocular pathological correlations in ophthalmic diseases, which poses a significant global health challenge and limits diagnostic accuracy.

Method: The authors propose DMS-Net with three key components: (1) weight-shared Siamese ResNet-152 backbones for feature extraction from paired fundus images, (2) Multi-Scale Context-Aware Module (MSCAM) integrating adaptive pooling and attention mechanisms for multi-resolution feature aggregation, and (3) Dual-Modal Feature Fusion (DMFF) module for cross-modal interaction through spatial-semantic recalibration and bidirectional attention.

Result: DMS-Net achieves state-of-the-art performance on the ODIR-5K dataset with 82.9% accuracy, 84.5% recall, and 83.2% Cohen’s kappa, demonstrating superior capability in detecting symmetric pathologies compared to existing approaches.

Conclusion: The proposed DMS-Net effectively addresses limitations of single-eye approaches by leveraging binocular correlations, successfully handling challenges like lesion boundary ambiguity and scattered pathological distributions, ultimately advancing clinical decision-making for ocular diseases.

Abstract: Ophthalmic diseases pose a significant global health challenge, yet traditional diagnosis methods and existing single-eye deep learning approaches often fail to account for binocular pathological correlations. To address this, we propose DMS-Net, a dual-modal multi-scale Siamese network for binocular fundus image classification. Our framework leverages weight-shared Siamese ResNet-152 backbones to extract deep semantic features from paired fundus images. To tackle challenges such as lesion boundary ambiguity and scattered pathological distributions, we introduce a Multi-Scale Context-Aware Module (MSCAM) that integrates adaptive pooling and attention mechanisms for multi-resolution feature aggregation. Additionally, a Dual-Modal Feature Fusion (DMFF) module enhances cross-modal interaction through spatial-semantic recalibration and bidirectional attention, effectively combining global context and local edge features. Evaluated on the ODIR-5K dataset, DMS-Net achieves state-of-the-art performance with 82.9% accuracy, 84.5% recall, and 83.2% Cohen’s kappa, demonstrating superior capability in detecting symmetric pathologies and advancing clinical decision-making for ocular diseases.

[188] Mapping of Weed Management Methods in Orchards using Sentinel-2 and PlanetScope Data

Ioannis Kontogiorgakis, Iason Tsardanidis, Dimitrios Bormpoudakis, Ilias Tsoumas, Dimitra A. Loka, Christos Noulas, Alexandros Tsitouras, Charalampos Kontoes

Main category: cs.CV

TL;DR: This paper develops machine learning models using satellite data (Sentinel-2 and PlanetScope) to classify four weed management methods in orchards, offering a cost-effective alternative to ground-based surveys for monitoring agricultural practices.

Details

Motivation: Traditional ground-based field surveys for monitoring weed management methods are costly, time-consuming, and subject to delays. Accurate mapping of weed management practices is essential for policymakers to assess farmer practices, evaluate environmental impacts, and ensure policy compliance, but current monitoring approaches are inadequate.

Method: The researchers developed separate machine learning models using satellite time series data from two sources: Sentinel-2 and PlanetScope. These models were trained to classify four distinct weed management methods: Mowing, Tillage, Chemical-spraying, and No practice in orchard environments.

Result: The ML models successfully demonstrated the ability to classify different weed management methods using satellite data. The findings show that remote sensing combined with machine learning can effectively identify and map weed management practices in orchards.

Conclusion: Machine learning-driven remote sensing presents a promising solution for enhancing the efficiency and accuracy of weed management mapping in orchards, potentially replacing costly and time-consuming ground-based surveys while providing valuable information for agricultural policy and environmental monitoring.

Abstract: Effective weed management is crucial for improving agricultural productivity, as weeds compete with crops for vital resources like nutrients and water. Accurate maps of weed management methods are essential for policymakers to assess farmer practices, evaluate impacts on vegetation health, biodiversity, and climate, as well as ensure compliance with policies and subsidies. However, monitoring weed management methods is challenging as they commonly rely on ground-based field surveys, which are often costly, time-consuming and subject to delays. In order to tackle this problem, we leverage earth observation data and Machine Learning (ML). Specifically, we developed separate ML models using Sentinel-2 and PlanetScope satellite time series data, respectively, to classify four distinct weed management methods (Mowing, Tillage, Chemical-spraying, and No practice) in orchards. The findings demonstrate the potential of ML-driven remote sensing to enhance the efficiency and accuracy of weed management mapping in orchards.

[189] Monitoring digestate application on agricultural crops using Sentinel-2 Satellite imagery

Andreas Kalogeras, Dimitrios Bormpoudakis, Iason Tsardanidis, Dimitra A. Loka, Charalampos Kontoes

Main category: cs.CV

TL;DR: This study demonstrates the use of Sentinel-2 satellite imagery combined with machine learning to detect digestate application in agricultural fields, achieving F1-scores up to 0.85 for monitoring soil fertility enhancement practices.

Details

Motivation: The widespread use of Exogenous Organic Matter (EOM) in agriculture requires monitoring to assess its effects on soil and crop health, particularly digestate application which enhances soil fertility but poses environmental risks like microplastic contamination and nitrogen losses.

Method: The study used Sentinel-2 satellite image time series (SITS) analysis with specific indices (EOMI, NDVI, EVI) to characterize EOM’s spectral behavior, and applied Machine Learning models including Random Forest, k-NN, Gradient Boosting, and Feed-Forward Neural Networks to detect digestate presence across four different crop types in Thessaly, Greece.

Result: The machine learning models achieved F1-scores up to 0.85 for digestate presence detection, successfully demonstrating the capability to monitor EOM applications using remote sensing data.

Conclusion: The findings highlight the potential of combining remote sensing and machine learning for scalable and cost-effective monitoring of EOM applications, supporting precision agriculture and sustainability goals.

Abstract: The widespread use of Exogenous Organic Matter in agriculture necessitates monitoring to assess its effects on soil and crop health. This study evaluates optical Sentinel-2 satellite imagery for detecting digestate application, a practice that enhances soil fertility but poses environmental risks like microplastic contamination and nitrogen losses. In the first instance, Sentinel-2 satellite image time series (SITS) analysis of specific indices (EOMI, NDVI, EVI) was used to characterize EOM’s spectral behavior after application on the soils of four different crop types in Thessaly, Greece. Furthermore, Machine Learning (ML) models (namely Random Forest, k-NN, Gradient Boosting and a Feed-Forward Neural Network), were used to investigate digestate presence detection, achieving F1-scores up to 0.85. The findings highlight the potential of combining remote sensing and ML for scalable and cost-effective monitoring of EOM applications, supporting precision agriculture and sustainability.

Eliraz Orfaig, Inna Stainvas, Igal Bilik

Main category: cs.CV

TL;DR: RGBX-DiffusionDet is a multimodal object detection framework that extends DiffusionDet to fuse RGB imagery with heterogeneous 2D data (depth, polarimetric, infrared) using adaptive encoders and novel attention mechanisms, achieving superior performance over RGB-only baselines while maintaining efficiency.

Details

Motivation: Existing object detection models primarily rely on RGB imagery alone, missing the opportunity to leverage complementary information from other 2D sensing modalities like depth, polarimetric, and infrared data that could improve detection performance in challenging scenarios.

Method: The paper proposes RGBX-DiffusionDet with three key components: (1) Dynamic Channel Reduction within Convolutional Block Attention Module (DCR-CBAM) for cross-modal interaction, (2) Dynamic Multi-Level Aggregation Block (DMLAB) for adaptive multiscale spatial feature fusion, and (3) novel regularization losses that enforce channel saliency and spatial selectivity to create compact and discriminative feature embeddings.

Result: Extensive experiments on RGB-Depth (KITTI), RGB-Polarimetric, and RGB-Infrared (M³FD) datasets demonstrate consistent superiority over baseline RGB-only DiffusionDet while maintaining the original decoding complexity and efficiency of the framework.

Conclusion: RGBX-DiffusionDet establishes a flexible and efficient multimodal object detection approach that successfully integrates diverse 2D sensing modalities into diffusion-based detection pipelines, providing new insights for multimodal fusion in object detection tasks.

Abstract: This work introduces RGBX-DiffusionDet, an object detection framework extending the DiffusionDet model to fuse the heterogeneous 2D data (X) with RGB imagery via an adaptive multimodal encoder. To enable cross-modal interaction, we design the dynamic channel reduction within a convolutional block attention module (DCR-CBAM), which facilitates cross-talk between subnetworks by dynamically highlighting salient channel features. Furthermore, the dynamic multi-level aggregation block (DMLAB) is proposed to refine spatial feature representations through adaptive multiscale fusion. Finally, novel regularization losses that enforce channel saliency and spatial selectivity are introduced, leading to compact and discriminative feature embeddings. Extensive experiments using RGB-Depth (KITTI), a novel annotated RGB-Polarimetric dataset, and RGB-Infrared (M$^3$FD) benchmark dataset were conducted. We demonstrate consistent superiority of the proposed approach over the baseline RGB-only DiffusionDet. The modular architecture maintains the original decoding complexity, ensuring efficiency. These results establish the proposed RGBX-DiffusionDet as a flexible multimodal object detection approach, providing new insights into integrating diverse 2D sensing modalities into diffusion-based detection pipelines.

[191] Application of YOLOv8 in monocular downward multiple Car Target detection

Shijie Lyu

Main category: cs.CV

TL;DR: This paper presents an improved YOLOv8-based object detection network for autonomous driving that integrates structural reparameterization, bidirectional pyramid structure, and novel detection pipeline to achieve 65% detection accuracy for multi-scale objects, particularly excelling at small and remote object detection for autonomous driving applications.

Details

Motivation: Current autonomous driving object detection technologies face significant challenges including high costs, vulnerability to weather and lighting conditions, and limited resolution. Traditional methods using radar, cameras, and vehicle sensor networks have limitations that hinder effective object detection, particularly for small and remote objects in autonomous driving scenarios.

Method: The paper proposes an improved autonomous target detection network based on YOLOv8 framework by integrating three key components: (1) structural reparameterization technology, (2) bidirectional pyramid structure network model, and (3) a novel detection pipeline. These enhancements are designed to enable highly efficient and precise detection of multi-scale, small, and remote objects.

Result: The enhanced model achieves 65% detection accuracy and can effectively detect both large and small objects. Experimental results demonstrate significant advancements over traditional methods, showing the model’s capability to handle multi-scale object detection with improved precision and efficiency.

Conclusion: The improved YOLOv8-based model shows substantial potential for real-world autonomous driving applications and is particularly well-suited for autonomous driving competitions like Formula Student Autonomous China (FSAC). The model excels in scenarios involving single-target and small-object detection, representing a significant advancement in autonomous driving object detection technology.

Abstract: Autonomous driving technology is progressively transforming traditional car driving methods, marking a significant milestone in modern transportation. Object detection serves as a cornerstone of autonomous systems, playing a vital role in enhancing driving safety, enabling autonomous functionality, improving traffic efficiency, and facilitating effective emergency responses. However, current technologies such as radar for environmental perception, cameras for road perception, and vehicle sensor networks face notable challenges, including high costs, vulnerability to weather and lighting conditions, and limited resolution.To address these limitations, this paper presents an improved autonomous target detection network based on YOLOv8. By integrating structural reparameterization technology, a bidirectional pyramid structure network model, and a novel detection pipeline into the YOLOv8 framework, the proposed approach achieves highly efficient and precise detection of multi-scale, small, and remote objects. Experimental results demonstrate that the enhanced model can effectively detect both large and small objects with a detection accuracy of 65%, showcasing significant advancements over traditional methods.This improved model holds substantial potential for real-world applications and is well-suited for autonomous driving competitions, such as the Formula Student Autonomous China (FSAC), particularly excelling in scenarios involving single-target and small-object detection.

[192] ORL-LDM: Offline Reinforcement Learning Guided Latent Diffusion Model Super-Resolution Reconstruction

Shijie Lyu

Main category: cs.CV

TL;DR: This paper proposes a reinforcement learning-based latent diffusion model (LDM) fine-tuning method for remote sensing image super-resolution, using proximal policy optimization (PPO) to optimize the reverse denoising process and achieving significant improvements in image quality metrics.

Details

Motivation: Existing deep learning methods for remote sensing image super-resolution face limitations in handling complex scenes and preserving image details, despite the rapid advancement of remote sensing technology making super-resolution reconstruction increasingly important for research and practical applications.

Method: The method constructs a reinforcement learning environment with states, actions, and rewards, and uses proximal policy optimization (PPO) to optimize decision objectives during the reverse denoising process of the latent diffusion model (LDM) for remote sensing image super-resolution.

Result: Experiments on the RESISC45 dataset show significant improvements over baseline models: PSNR increased by 3-4dB, SSIM improved by 0.08-0.11, and LPIPS reduced by 0.06-0.10, with particularly strong performance in structured and complex natural scenes.

Conclusion: The proposed reinforcement learning-based LDM fine-tuning method effectively enhances super-resolution quality and demonstrates improved adaptability across different scene types for remote sensing image reconstruction.

Abstract: With the rapid advancement of remote sensing technology, super-resolution image reconstruction is of great research and practical significance. Existing deep learning methods have made progress but still face limitations in handling complex scenes and preserving image details. This paper proposes a reinforcement learning-based latent diffusion model (LDM) fine-tuning method for remote sensing image super-resolution. The method constructs a reinforcement learning environment with states, actions, and rewards, optimizing decision objectives through proximal policy optimization (PPO) during the reverse denoising process of the LDM model. Experiments on the RESISC45 dataset show significant improvements over the baseline model in PSNR, SSIM, and LPIPS, with PSNR increasing by 3-4dB, SSIM improving by 0.08-0.11, and LPIPS reducing by 0.06-0.10, particularly in structured and complex natural scenes. The results demonstrate the method’s effectiveness in enhancing super-resolution quality and adaptability across scenes.

[193] SurgXBench: Explainable Vision-Language Model Benchmark for Surgery

Jiajun Cheng, Xianwu Zhao, Sainan Liu, Xiaofan Yu, Ravi Prakash, Patrick J. Codd, Jonathan Elliott Katz, Shan Lin

Main category: cs.CV

TL;DR: This paper benchmarks vision-language models (VLMs) for surgical instrument and action recognition in robotic surgery, revealing that despite their potential, current surgical VLMs rely on weak contextual cues rather than clinically relevant features, indicating need for improved training approaches.

Details

Motivation: Real-time awareness of surgical instruments and actions is essential for intelligent robotic surgery systems. While VLMs show great generalization potential, surgical VLMs remain under-explored with limited performance, creating a need for comprehensive benchmark studies to assess their capabilities and guide future development.

Method: The authors benchmark zero-shot performance of advanced VLMs on two public robotic-assisted laparoscopic datasets for instrument and action classification. They integrate explainable AI techniques to visualize VLM attention patterns and uncover causal explanations behind predictions, proposing new explainability-based evaluation metrics beyond standard performance measures.

Result: The analysis shows that surgical VLMs, despite domain-specific training, often rely on weak contextual cues rather than clinically relevant visual evidence. The explainable AI analysis reveals reliability issues in model predictions and provides insights into the decision-making process of these models.

Conclusion: Current surgical VLMs demonstrate limited performance and questionable reliability, as they depend on weak contextual information rather than meaningful clinical visual features. This highlights the critical need for stronger visual and reasoning supervision in surgical applications to improve model robustness and clinical applicability.

Abstract: Innovations in digital intelligence are transforming robotic surgery with more informed decision-making. Real-time awareness of surgical instrument presence and actions (e.g., cutting tissue) is essential for such systems. Yet, despite decades of research, most machine learning models for this task are trained on small datasets and still struggle to generalize. Recently, vision-Language Models (VLMs) have brought transformative advances in reasoning across visual and textual modalities. Their unprecedented generalization capabilities suggest great potential for advancing intelligent robotic surgery. However, surgical VLMs remain under-explored, and existing models show limited performance, highlighting the need for benchmark studies to assess their capabilities and limitations and to inform future development. To this end, we benchmark the zero-shot performance of several advanced VLMs on two public robotic-assisted laparoscopic datasets for instrument and action classification. Beyond standard evaluation, we integrate explainable AI to visualize VLM attention and uncover causal explanations behind their predictions. This provides a previously underexplored perspective in this field for evaluating the reliability of model predictions. We also propose several explainability analysis-based metrics to complement standard evaluations. Our analysis reveals that surgical VLMs, despite domain-specific training, often rely on weak contextual cues rather than clinically relevant visual evidence, highlighting the need for stronger visual and reasoning supervision in surgical applications.

[194] JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models

Eric Tillmann Bill, Enis Simsar, Thomas Hofmann

Main category: cs.CV

TL;DR: JEDI is a test-time adaptation method that improves subject separation and compositional alignment in diffusion models by minimizing semantic entanglement in attention maps using Jensen-Shannon divergence, without requiring retraining or external supervision.

Details

Motivation: Diffusion models struggle with subject separation and compositional alignment in complex scenes, particularly when generating images with multiple subjects or compositional elements that can become semantically entangled.

Method: JEDI uses a Jensen-Shannon divergence based objective to minimize semantic entanglement in attention maps during test-time adaptation. The method leverages adversarial optimization to reduce the number of updating steps required and is model-agnostic, working with architectures like Stable Diffusion 1.5 and 3.5.

Result: JEDI consistently improves prompt alignment and disentanglement in complex scenes across different diffusion model architectures. The method also provides a lightweight, CLIP-free disentanglement score derived from internal attention distributions for benchmarking compositional alignment.

Conclusion: JEDI offers an efficient, model-agnostic solution for enhancing compositional understanding in diffusion models at test-time, providing both improved generation quality and a principled evaluation metric for compositional alignment without requiring model retraining.

Abstract: We introduce JEDI, a test-time adaptation method that enhances subject separation and compositional alignment in diffusion models without requiring retraining or external supervision. JEDI operates by minimizing semantic entanglement in attention maps using a novel Jensen-Shannon divergence based objective. To improve efficiency, we leverage adversarial optimization, reducing the number of updating steps required. JEDI is model-agnostic and applicable to architectures such as Stable Diffusion 1.5 and 3.5, consistently improving prompt alignment and disentanglement in complex scenes. Additionally, JEDI provides a lightweight, CLIP-free disentanglement score derived from internal attention distributions, offering a principled benchmark for compositional alignment under test-time conditions. Code and results are available at https://ericbill21.github.io/JEDI/.

[195] Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards

Aakash Garg, Libing Zeng, Andrii Tsarov, Nima Khademi Kalantari

Main category: cs.CV

TL;DR: A novel diffusion-based approach that fine-tunes Stable Diffusion on stereo image datasets to generate high-quality stereo images from text prompts, using prompt alignment and stereo consistency reward functions to improve performance.

Details

Motivation: Stereo image datasets with large baselines are scarce, making it infeasible to train a diffusion model from scratch for stereo image generation from text prompts. The need to leverage existing strong priors from pre-trained models while adapting them for stereo generation tasks.

Method: Fine-tuning Stable Diffusion on stereo image datasets to adapt it for stereo generation, combined with prompt alignment and novel stereo consistency reward functions to enhance stereo consistency and text-to-image alignment.

Result: Comprehensive experiments show the approach generates high-quality stereo images across diverse scenarios and outperforms existing methods in stereo image generation tasks.

Conclusion: The proposed diffusion-based approach successfully addresses the challenge of stereo image generation from text by leveraging pre-trained models and specialized fine-tuning techniques, demonstrating superior performance compared to existing methods.

Abstract: In this paper, we propose a novel diffusion-based approach to generate stereo images given a text prompt. Since stereo image datasets with large baselines are scarce, training a diffusion model from scratch is not feasible. Therefore, we propose leveraging the strong priors learned by Stable Diffusion and fine-tuning it on stereo image datasets to adapt it to the task of stereo generation. To improve stereo consistency and text-to-image alignment, we further tune the model using prompt alignment and our proposed stereo consistency reward functions. Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images across diverse scenarios, outperforming existing methods.

Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong

Main category: cs.CV

TL;DR: This paper proposes Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method to improve text-image alignment in Multimodal Diffusion Transformers by addressing attention mechanism issues through temperature scaling and timestep-dependent adjustment.

Details

Motivation: State-of-the-art MM-DiT models like FLUX struggle with precise alignment between text prompts and generated content due to two key issues: 1) suppression of cross-modal attention caused by token imbalance between visual and textual modalities, and 2) lack of timestep-aware attention weighting in the attention mechanism.

Method: The authors propose Temperature-Adjusted Cross-modal Attention (TACA), which dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. This parameter-efficient method is combined with LoRA fine-tuning to enhance text-image alignment with minimal computational overhead.

Result: TACA significantly enhances text-image alignment on the T2I-CompBench benchmark when tested on state-of-the-art models like FLUX and SD3.5. The method demonstrates improvements in object appearance, attribute binding, and spatial relationships between text and generated images.

Conclusion: The research highlights the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. TACA provides an effective solution for enhancing multimodal alignment in MM-DiT models with minimal computational cost.

Abstract: Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}

[197] InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba

Yuhang Wang, Jun Li, Zhijian Wu, Jifeng Shen, Jianhua Xu, Wankou Yang

Main category: cs.CV

TL;DR: InceptionMamba is a novel backbone architecture that improves upon InceptionNeXt by replacing one-dimensional strip convolutions with orthogonal band convolutions and incorporating a bottleneck Mamba module for better spatial modeling and global context capture, achieving state-of-the-art performance with superior efficiency.

Details

Motivation: InceptionNeXt, while competitive in image classification, has limitations in capturing spatial dependencies along different dimensions, fails to fully explore spatial modeling in local neighborhoods, and suffers from locality constraints that hinder effective global context modeling.

Method: The proposed InceptionMamba replaces traditional one-dimensional strip convolutions with orthogonal band convolutions for cohesive spatial modeling, and incorporates a bottleneck Mamba module to achieve global contextual modeling with enhanced cross-channel information fusion and enlarged receptive field.

Result: Extensive evaluations on classification and various downstream tasks show that InceptionMamba achieves state-of-the-art performance while maintaining superior parameter and computational efficiency compared to existing methods.

Conclusion: InceptionMamba successfully addresses the limitations of InceptionNeXt by providing better spatial dependency modeling and global context capture, resulting in improved performance across multiple tasks with enhanced efficiency.

Abstract: Within the family of convolutional neural networks, InceptionNeXt has shown excellent competitiveness in image classification and a number of downstream tasks. Built on parallel one-dimensional strip convolutions, however, it suffers from limited ability of capturing spatial dependencies along different dimensions and fails to fully explore spatial modeling in local neighborhood. Besides, inherent locality constraints of convolution operations are detrimental to effective global context modeling. To overcome these limitations, we propose a novel backbone architecture termed InceptionMamba in this study. More specifically, the traditional one-dimensional strip convolutions are replaced by orthogonal band convolutions in our InceptionMamba to achieve cohesive spatial modeling. Furthermore, global contextual modeling can be achieved via a bottleneck Mamba module, facilitating enhanced cross-channel information fusion and enlarged receptive field. Extensive evaluations on classification and various downstream tasks demonstrate that the proposed InceptionMamba achieves state-of-the-art performance with superior parameter and computational efficiency. The source code will be available at https://github.com/Wake1021/InceptionMamba.

[198] Rethinking Range-View LiDAR Segmentation in Adverse Weather

Longyu Yang, Lu Zhang, Jun Liu, Yap-Peng Tan, Heng Tao Shen, Xiaofeng Zhu, Ping Hu

Main category: cs.CV

TL;DR: This paper proposes a lightweight framework to improve LiDAR segmentation performance under adverse weather conditions by separately processing geometric and reflectance features through specialized modules that suppress weather-induced noise and distortions.

Details

Motivation: Range-view LiDAR segmentation methods, while computationally efficient, suffer from poor generalization under adverse weather conditions due to weather-induced spatial noise and reflectance distortions, limiting their reliability in real-world deployments.

Method: The approach reformulates the stem block of range-view networks into two branches: a Geometric Abnormality Suppression (GAS) module to reduce weather-induced spatial noise, and a Reflectance Distortion Calibration (RDC) module using memory-guided adaptive instance normalization to correct reflectance distortions. The processed features are then fused for segmentation.

Result: Extensive experiments across different benchmarks and baseline models show significant improvement in generalization to adverse weather conditions with minimal inference overhead, demonstrating the framework’s practical effectiveness.

Conclusion: The proposed modular framework successfully enhances the robustness of range-view LiDAR segmentation under severe weather without altering core architectures, providing a practical solution for real-world applications with maintained computational efficiency.

Abstract: LiDAR segmentation has emerged as an important task to enrich scene perception and understanding. Range-view-based methods have gained popularity due to their high computational efficiency and compatibility with real-time deployment. However, their generalized performance under adverse weather conditions remains underexplored, limiting their reliability in real-world environments. In this work, we identify and analyze the unique challenges that affect the generalization of range-view LiDAR segmentation in severe weather. To address these challenges, we propose a modular and lightweight framework that enhances robustness without altering the core architecture of existing models. Our method reformulates the initial stem block of standard range-view networks into two branches to process geometric attributes and reflectance intensity separately. Specifically, a Geometric Abnormality Suppression (GAS) module reduces the influence of weather-induced spatial noise, and a Reflectance Distortion Calibration (RDC) module corrects reflectance distortions through memory-guided adaptive instance normalization. The processed features are then fused and passed to the original segmentation pipeline. Extensive experiments on different benchmarks and baseline models demonstrate that our approach significantly improves generalization to adverse weather with minimal inference overhead, offering a practical and effective solution for real-world LiDAR segmentation.

[199] Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss

Yuxiao Wang, Yu Lei, Zhenao Wei, Weiying Xue, Xinyu Jiang, Nan Zhuang, Qi Liu

Main category: cs.CV

TL;DR: P3HOT is a novel framework for Human-Object conTact (HOT) detection that combines prompt guidance and human proximal perception to identify specific areas where human body touches objects, achieving state-of-the-art performance across multiple metrics on benchmark datasets.

Details

Motivation: Current HOT detection models are limited to single image types, causing over-segmentation in low-interaction areas and failing to maintain category consistency within regions. There's a need for better handling of human-object spatial relationships and negative samples in contact detection.

Method: The P3HOT framework integrates: (1) a semantic-driven prompt mechanism using image-text correlation to guide network attention, (2) a human proximal perception mechanism with learnable parameters to dynamically perceive key depth ranges and eliminate non-interaction regions, (3) Regional Joint Loss (RJLoss) to prevent abnormal categories in same areas, and (4) a new evaluation metric “AD-Acc.” for better negative sample assessment.

Result: P3HOT achieves state-of-the-art performance on two benchmark datasets with improvements of 0.7↑ in SC-Acc., 2.0↑ in mIoU, 1.6↑ in wIoU, and 11.0↑ in AD-Acc. metrics on the HOT-Annotated dataset, demonstrating superior performance across four evaluation metrics.

Conclusion: The proposed P3HOT framework successfully addresses the limitations of existing HOT detection methods by incorporating prompt guidance and depth-aware human proximal perception, leading to more accurate contact detection and better handling of spatial relationships between humans and objects.

Abstract: The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects. Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions. To tackle this issue, a HOT framework, termed \textbf{P3HOT}, is proposed, which blends \textbf{P}rompt guidance and human \textbf{P}roximal \textbf{P}erception. To begin with, we utilize a semantic-driven prompt mechanism to direct the network’s attention towards the relevant regions based on the correlation between image and text. Then a human proximal perception mechanism is employed to dynamically perceive key depth range around the human, using learnable parameters to effectively eliminate regions where interactions are not expected. Calculating depth resolves the uncertainty of the overlap between humans and objects in a 2D perspective, providing a quasi-3D viewpoint. Moreover, a Regional Joint Loss (RJLoss) has been created as a new loss to inhibit abnormal categories in the same area. A new evaluation metric called ``AD-Acc.’’ is introduced to address the shortcomings of existing methods in addressing negative samples. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art performance in four metrics across two benchmark datasets. Specifically, our model achieves an improvement of \textbf{0.7}$\uparrow$, \textbf{2.0}$\uparrow$, \textbf{1.6}$\uparrow$, and \textbf{11.0}$\uparrow$ in SC-Acc., mIoU, wIoU, and AD-Acc. metrics, respectively, on the HOT-Annotated dataset. The sources code are available at https://github.com/YuxiaoWang-AI/P3HOT.

[200] BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yihan Cao, Renjiao Yi, Yijie Wang, Kai Xu

Main category: cs.CV

TL;DR: A reconstruction-free online framework for open-vocabulary 3D object detection that uses streaming RGB-D video, Cubify Anything VFM, and CLIP to achieve real-time performance without dense point cloud reconstruction, demonstrating state-of-the-art results on ScanNetV2 and CA-1M datasets.

Details

Motivation: Existing 3D object detection methods rely on dense point cloud reconstruction which creates substantial computational overhead and memory constraints, preventing real-time deployment in applications like autonomous driving and embodied AI.

Method: The framework uses Cubify Anything as a pre-trained visual foundation model for single-view 3D object detection with bounding boxes, CLIP for open-vocabulary semantics, an association module with 3D NMS and correspondence matching for multi-view fusion, and an optimization module using IoU-guided particle filtering to ensure multi-view consistency of 3D bounding boxes.

Result: Achieves state-of-the-art performance among online methods on ScanNetV2 and CA-1M datasets, demonstrates great generalization abilities across various scenarios, and enables real-time perception in environments exceeding 1000 square meters.

Conclusion: The reconstruction-free paradigm successfully addresses computational and memory limitations of existing methods, enabling efficient real-time 3D object detection with strong generalization capabilities for large-scale environments.

Abstract: Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.

[201] Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Bob Zhang, Haoran Li, Tao Zhang, Cilin Yan, Jiayin Cai, Yanbin Hao

Main category: cs.CV

TL;DR: This paper proposes a Reinforcement Learning-based post-training strategy to improve Multimodal Large Language Models’ performance on multi-image grounding tasks, achieving significant improvements over supervised fine-tuning baselines across multiple benchmarks.

Details

Motivation: While MLLMs excel at visual grounding in single-image scenarios, they struggle with real-world applications involving complex multi-image compositions and multi-modal instructions, showing limitations in cross-image reasoning and generalization capabilities.

Method: The approach uses a two-stage training strategy: (1) Cold-start initialization with synthesized chain-of-thought data followed by supervised fine-tuning using LoRA, and (2) Reinforcement Learning with rejection sampling to curate high-quality RL data and rule-based RL to guide the model toward optimal reasoning paths.

Result: The method achieves substantial improvements: +9.04% on MIG-Bench, +6.37% on MC-Bench, +4.98% on out-of-domain reasoning grounding benchmarks compared to SFT baseline, and shows strong generalization with +3.1% and +2.4% gains on BLINK and MMIU benchmarks respectively.

Conclusion: The RL-based post-training strategy effectively enhances MLLMs’ multi-image grounding capabilities, demonstrating significant performance improvements and strong generalization across various benchmarks, making it a promising approach for real-world multi-modal applications.

Abstract: Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications that involve complex multi-image compositions and multi-modal instructions, revealing limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, yielding improvements of +9.04% on MIG-Bench, +6.37% on MC-Bench, and +4.98% on several out-of-domain reasoning grounding benchmarks compared to the SFT baseline. Furthermore, our method exhibits strong generalization in multi-image perception, with gains of +3.1% and +2.4% over the base model on BLINK and MMIU benchmarks, respectively.

[202] How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir

Main category: cs.CV

TL;DR: This paper benchmarks popular multimodal foundation models (GPT-4o, Gemini, Claude, etc.) on standard computer vision tasks through a novel prompt-chaining framework, finding that while these models don’t match specialist performance, they serve as respectable generalists with better semantic than geometric understanding.

Details

Motivation: Current multimodal foundation models like GPT-4o show impressive capabilities, but their exact standing in computer vision understanding remains unclear. There's a need to systematically evaluate these models on standard vision tasks despite challenges like text-only outputs and API-only access.

Method: The authors develop a standardized benchmarking framework using prompt chaining to translate standard computer vision tasks (semantic segmentation, object detection, image classification, depth prediction, surface normal prediction) into text-promptable and API-compatible formats. They evaluate models on established datasets like COCO and ImageNet.

Result: Key findings include: (1) Models don’t match state-of-the-art specialists but are respectable generalists, (2) Better performance on semantic vs geometric tasks, (3) GPT-4o leads non-reasoning models in 4/6 tasks, (4) Reasoning models like o3 improve on geometric tasks, (5) Better models show less sensitivity to prompt variations, and (6) Native image generation models exhibit hallucinations and spatial misalignments.

Conclusion: Multimodal foundation models demonstrate remarkable generalist capabilities across vision tasks despite being primarily trained on image-text data, but significant gaps remain compared to specialized models, particularly in geometric understanding tasks.

Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.

[203] AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, Gaoang Wang

Main category: cs.CV

TL;DR: AuroraLong replaces transformer-based LLMs with linear RNN language models in multimodal models to enable efficient long video understanding with constant memory usage, achieving comparable performance to larger transformer models while using only 2B parameters and public data.

Details

Motivation: Long video understanding faces high computational complexity and prohibitive memory costs because transformer-based LLMs require memory and computation that scale quadratically with input sequence length, creating barriers for processing lengthy video sequences.

Method: The authors replace the LLM component in multimodal LLMs (MLLMs) with a linear RNN language model that can handle arbitrary-length input sequences with constant-size hidden states. They also combine visual token merging with linear RNN models by reordering visual tokens by size in ascending order to further improve throughput and efficiency.

Result: Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to transformer-based models of similar size that were trained on private datasets across multiple video benchmarks. This is the first work to use a linear RNN-based LLM backbone in a LLaVA-like model for open-ended video understanding.

Conclusion: The work demonstrates that efficient linear RNNs can democratize long video understanding by significantly lowering computational entry barriers while maintaining competitive performance, showing the potential of alternative architectures to transformers for multimodal video tasks.

Abstract: The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.

[204] Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-Light Semantic Segmentation

Chunyan Wang, Dong Zhang, Jinhui Tang

Main category: cs.CV

TL;DR: This paper proposes DGKD-WLSS, a novel framework that combines diffusion-guided knowledge distillation with depth-guided feature fusion to improve weakly-supervised semantic segmentation performance in low-light environments, addressing issues of image quality degradation and unreliable supervision signals.

Details

Motivation: Existing weakly-supervised semantic segmentation methods perform poorly in low-light environments due to severe image quality degradation (low contrast, noise, color distortion) and inherent constraints of weak supervision, leading to unreliable class activation maps and semantically ambiguous pseudo-labels that compromise discriminative feature learning.

Method: The paper introduces DGKD-WLSS framework that synergistically combines: (1) Diffusion-Guided Knowledge Distillation (DGKD) to align normal-light and low-light features through diffusion-based denoising and knowledge distillation, and (2) Depth-Guided Feature Fusion (DGF2) that integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning.

Result: Extensive experiments demonstrate that DGKD-WLSS achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions, showing the effectiveness of the proposed approach.

Conclusion: The proposed DGKD-WLSS framework successfully addresses the challenges of weakly-supervised semantic segmentation in low-light environments by leveraging diffusion-based knowledge distillation and depth information, achieving superior performance compared to existing methods.

Abstract: Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model’s ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:https://github.com/ChunyanWang1/DGKD-WLSS.

[205] MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models

Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, Yun Xing, Xiaosong Yuan, Feilong Tang, Sinan Fan, Xuhang Chen, Xuyao Zhang, Dahan Wang

Main category: cs.CV

TL;DR: This paper identifies that Rotary Position Encoding (RoPE) causes image alignment bias in Large Vision Language Models, where instruction tokens unevenly perceive image tokens based on their position, and proposes MCA-LLaVA using Manhattan distance-based two-dimensional spatial decay to mitigate hallucinations.

Details

Motivation: Large Vision Language Models suffer from hallucinations due to misalignment between multimodal features. The authors discovered that RoPE's long-term decay causes instruction tokens to have biased perception of image tokens, prioritizing those from the bottom-right region due to their closer positional proximity in the one-dimensional sequence, leading to insufficient image-instruction interaction.

Method: The authors propose MCA-LLaVA, which uses Manhattan distance to extend the traditional one-dimensional long-term decay to a two-dimensional, multi-directional spatial decay. This approach integrates both the one-dimensional sequence order and two-dimensional spatial position of image tokens for improved positional modeling.

Result: MCA-LLaVA demonstrates effectiveness across various hallucination and general benchmarks, showing improved performance in mitigating hallucinations and enhancing multimodal alignment by reducing image alignment bias.

Conclusion: The paper successfully identifies and addresses a fundamental issue in LVLMs where RoPE-induced image alignment bias contributes to hallucinations. MCA-LLaVA’s two-dimensional spatial decay approach effectively mitigates this bias, leading to better multimodal alignment and reduced hallucinations in vision-language tasks.

Abstract: Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction’s perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.

[206] Spatial Frequency Modulation for Semantic Segmentation

Linwei Chen, Ying Fu, Lin Gu, Dezhi Zheng, Jifeng Dai

Main category: cs.CV

TL;DR: This paper proposes Spatial Frequency Modulation (SFM) to preserve high-frequency information in semantic segmentation by modulating high-frequency features to lower frequencies before downsampling and demodulating them back during upsampling, effectively addressing aliasing issues while maintaining fine details.

Details

Motivation: High-frequency information (fine details, textures) is crucial for semantic segmentation accuracy, but according to the Nyquist-Shannon Sampling Theorem, these components are vulnerable to aliasing and distortion when passing through downsampling layers like strided convolutions in neural networks.

Method: The authors propose Spatial Frequency Modulation (SFM) consisting of two key components: (1) Adaptive Resampling (ARS) for modulation - densely samples high-frequency areas to scale up signals and lower their frequency before downsampling; (2) Multi-Scale Adaptive Upsampling (MSAU) for demodulation - recovers high-frequency information through non-uniform upsampling and exploits information interaction between densely and sparsely resampled areas at multiple scales.

Result: Feature visualization and analysis confirm that SFM effectively alleviates aliasing while successfully retaining details after demodulation. The method demonstrates broad applicability across various architectures (CNNs to transformers) and tasks including semantic segmentation, image classification, adversarial robustness, instance segmentation, and panoptic segmentation.

Conclusion: SFM provides an effective solution to preserve high-frequency information in deep neural networks by addressing fundamental aliasing issues through frequency modulation and demodulation. The lightweight add-on modules can be seamlessly integrated into existing architectures and show consistent improvements across multiple computer vision tasks.

Abstract: High spatial frequency information, including fine details like textures, significantly contributes to the accuracy of semantic segmentation. However, according to the Nyquist-Shannon Sampling Theorem, high-frequency components are vulnerable to aliasing or distortion when propagating through downsampling layers such as strided-convolution. Here, we propose a novel Spatial Frequency Modulation (SFM) that modulates high-frequency features to a lower frequency before downsampling and then demodulates them back during upsampling. Specifically, we implement modulation through adaptive resampling (ARS) and design a lightweight add-on that can densely sample the high-frequency areas to scale up the signal, thereby lowering its frequency in accordance with the Frequency Scaling Property. We also propose Multi-Scale Adaptive Upsampling (MSAU) to demodulate the modulated feature and recover high-frequency information through non-uniform upsampling This module further improves segmentation by explicitly exploiting information interaction between densely and sparsely resampled areas at multiple scales. Both modules can seamlessly integrate with various architectures, extending from convolutional neural networks to transformers. Feature visualization and analysis confirm that our method effectively alleviates aliasing while successfully retaining details after demodulation. Finally, we validate the broad applicability and effectiveness of SFM by extending it to image classification, adversarial robustness, instance segmentation, and panoptic segmentation tasks. The code is available at https://github.com/Linwei-Chen/SFM.

[207] Latent Diffusion Models with Masked AutoEncoders

Junho Lee, Jeongwoo Shin, Hyungwook Choi, Joonseok Lee

Main category: cs.CV

TL;DR: This paper proposes Variational Masked AutoEncoders (VMAEs) to improve autoencoders in Latent Diffusion Models by addressing three key properties: latent smoothness, perceptual compression quality, and reconstruction quality, leading to better LDMs called LDMAEs.

Details

Motivation: Existing autoencoders in Latent Diffusion Models fail to simultaneously satisfy three crucial properties (latent smoothness, perceptual compression quality, and reconstruction quality), which limits the optimal performance of LDMs despite their remarkable potential in image generation.

Method: The authors propose Variational Masked AutoEncoders (VMAEs) that leverage hierarchical features maintained by Masked AutoEncoders, and integrate these VMAEs into the LDM framework to create Latent Diffusion Models with Masked AutoEncoders (LDMAEs).

Result: The proposed VMAEs successfully address the limitations of existing autoencoders by simultaneously satisfying all three key properties (latent smoothness, perceptual compression quality, and reconstruction quality) that were previously not achievable together.

Conclusion: By integrating VMAEs into LDMs to create LDMAEs, the authors demonstrate that better autoencoder design can significantly improve the performance of Latent Diffusion Models, addressing a previously underexplored but critical component of these generative models.

Abstract: In spite of the remarkable potential of Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoders. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs).

[208] Frequency-Dynamic Attention Modulation for Dense Prediction

Linwei Chen, Lin Gu, Ying Fu

Main category: cs.CV

TL;DR: The paper proposes Frequency-Dynamic Attention Modulation (FDAM), a circuit-theory-inspired method that addresses frequency vanishing in Vision Transformers by introducing Attention Inversion and Frequency Dynamic Scaling to preserve critical details and textures.

Details

Motivation: Vision Transformers suffer from frequency vanishing due to their attention mechanism acting as low-pass filters and stacked-layer architecture, leading to loss of critical details and textures in visual tasks.

Method: FDAM consists of two techniques: (1) Attention Inversion (AttInv) that generates complementary high-pass filtering by inverting the low-pass filter in attention matrix, and (2) Frequency Dynamic Scaling (FreqScale) that weights different frequency components for fine-grained adjustments to target response function.

Result: Consistent performance improvements across various models (SegFormer, DeiT, MaskDINO) in semantic segmentation, object detection, and instance segmentation tasks. Achieved state-of-the-art results in remote sensing detection under single-scale settings.

Conclusion: FDAM effectively addresses frequency vanishing in ViTs by modulating overall frequency response, avoiding representation collapse and improving performance across diverse computer vision tasks through a plug-and-play approach.

Abstract: Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at https://github.com/Linwei-Chen/FDAM.

[209] DeepShade: Enable Shade Simulation by Text-conditioned Image Generation

Longchao Da, Xiangrui Liu, Mithun Shivakoti, Thirulogasankar Pranav Kutralingam, Yezhou Yang, Hua Wei

Main category: cs.CV

TL;DR: This paper introduces DeepShade, a diffusion-based model that generates accurate shade predictions from satellite imagery to help with route planning during heatwaves, trained on a comprehensive dataset created using Blender-based 3D simulations of building shadows.

Details

Motivation: Current routing systems fail to incorporate shade information for public health protection during heatwaves due to difficulties in estimating shades from noisy satellite imagery and limited training data availability. As global warming intensifies, there's an urgent need for shade-aware navigation systems to protect people from extreme heat.

Method: The authors create an extensive dataset using Blender-based 3D simulations to capture building shadows under various solar conditions, then develop DeepShade - a diffusion-based model that learns shade patterns by jointly considering RGB and Canny edge features, incorporates contrastive learning for temporal shade changes, and uses textual conditioning for known conditions like time of day and solar angles.

Result: The DeepShade model successfully generates shade images with improved performance when conditioned on textual descriptions. The approach was validated through real-world application in calculating shade ratios for route planning in Tempe, Arizona, demonstrating practical utility for shade-aware navigation.

Conclusion: This work provides a viable solution for incorporating shade information into routing systems, which can benefit society by enabling safer navigation during extreme heat events and serving as a reference for urban planning in hot weather conditions with potential broader environmental applications.

Abstract: Heatwaves pose a significant threat to public health, especially as global warming intensifies. However, current routing systems (e.g., online maps) fail to incorporate shade information due to the difficulty of estimating shades directly from noisy satellite imagery and the limited availability of training data for generative models. In this paper, we address these challenges through two main contributions. First, we build an extensive dataset covering diverse longitude-latitude regions, varying levels of building density, and different urban layouts. Leveraging Blender-based 3D simulations alongside building outlines, we capture building shadows under various solar zenith angles throughout the year and at different times of day. These simulated shadows are aligned with satellite images, providing a rich resource for learning shade patterns. Second, we propose the DeepShade, a diffusion-based model designed to learn and synthesize shade variations over time. It emphasizes the nuance of edge features by jointly considering RGB with the Canny edge layer, and incorporates contrastive learning to capture the temporal change rules of shade. Then, by conditioning on textual descriptions of known conditions (e.g., time of day, solar angles), our framework provides improved performance in generating shade images. We demonstrate the utility of our approach by using our shade predictions to calculate shade ratios for real-world route planning in Tempe, Arizona. We believe this work will benefit society by providing a reference for urban planning in extreme heat weather and its potential practical applications in the environment.

[210] Feature-Enhanced TResNet for Fine-Grained Food Image Classification

Lulu Liu, Zhiyong Xiao

Main category: cs.CV

TL;DR: The paper proposes FE-TResNet, a deep learning model that enhances food image classification by integrating Style-based Recalibration Module and Deep Channel-wise Attention into TResNet architecture, achieving over 80% accuracy on Chinese food datasets for precision nutrition applications.

Details

Motivation: Fine-grained food classification is challenging due to subtle visual differences among similar dishes, which is critical for precision nutrition applications including dietary monitoring, nutrient estimation, and personalized health management.

Method: The authors propose Feature-Enhanced TResNet (FE-TResNet) built on TResNet architecture, integrating two key components: Style-based Recalibration Module (StyleRM) and Deep Channel-wise Attention (DCA) to enhance feature extraction and emphasize subtle distinctions between food items.

Result: FE-TResNet achieved high classification accuracies of 81.37% on ChineseFoodNet dataset and 80.29% on CNFOOD-241 dataset, demonstrating its effectiveness in fine-grained food image recognition.

Conclusion: The proposed FE-TResNet model effectively addresses fine-grained food classification challenges and shows potential as a key enabler for intelligent dietary assessment and personalized recommendations in precision nutrition systems.

Abstract: Food is not only essential to human health but also serves as a medium for cultural identity and emotional connection. In the context of precision nutrition, accurately identifying and classifying food images is critical for dietary monitoring, nutrient estimation, and personalized health management. However, fine-grained food classification remains challenging due to the subtle visual differences among similar dishes. To address this, we propose Feature-Enhanced TResNet (FE-TResNet), a novel deep learning model designed to improve the accuracy of food image recognition in fine-grained scenarios. Built on the TResNet architecture, FE-TResNet integrates a Style-based Recalibration Module (StyleRM) and Deep Channel-wise Attention (DCA) to enhance feature extraction and emphasize subtle distinctions between food items. Evaluated on two benchmark Chinese food datasets-ChineseFoodNet and CNFOOD-241-FE-TResNet achieved high classification accuracies of 81.37% and 80.29%, respectively. These results demonstrate its effectiveness and highlight its potential as a key enabler for intelligent dietary assessment and personalized recommendations in precision nutrition systems.

[211] PoemTale Diffusion: Minimising Information Loss in Poem to Image Generation with Multi-Stage Prompt Refinement

Sofia Jamil, Bollampalli Areen Reddy, Raghvendra Kumar, Sriparna Saha, Koustava Goswami, K. J. Joseph

Main category: cs.CV

TL;DR: This paper introduces PoemTale Diffusion, a training-free approach that improves text-to-image generation for poetic verse by using multi-stage prompt refinement and consistent self-attention mechanisms to generate multiple coherent images that better capture abstract poetic meanings.

Details

Motivation: Text-to-image diffusion models struggle with creative expressions, particularly complex, abstract, or highly descriptive poetic language that features layered meanings and dual interpretations, leading to information loss during conversion.

Method: The approach integrates a multi-stage prompt refinement loop into Language Models to enhance poetic text interpretability, and modifies existing diffusion models’ self-attention mechanisms with consistent self-attention technique to generate multiple consistent images that collectively convey the poem’s meaning.

Result: The method was validated through both human evaluations by poetry experts and quantitative assessments, demonstrating improved efficacy in poem-to-image generation with enhanced information capture. The authors also created the P4I dataset containing 1111 poems from various sources.

Conclusion: PoemTale Diffusion successfully addresses the challenge of converting abstract poetic language to visual content by minimizing information loss and generating multiple consistent images, contributing a novel perspective to creative text-to-image generation and encouraging further research in poetry-based image generation.

Abstract: Recent advancements in text-to-image diffusion models have achieved remarkable success in generating realistic and diverse visual content. A critical factor in this process is the model’s ability to accurately interpret textual prompts. However, these models often struggle with creative expressions, particularly those involving complex, abstract, or highly descriptive language. In this work, we introduce a novel training-free approach tailored to improve image generation for a unique form of creative language: poetic verse, which frequently features layered, abstract, and dual meanings. Our proposed PoemTale Diffusion approach aims to minimise the information that is lost during poetic text-to-image conversion by integrating a multi stage prompt refinement loop into Language Models to enhance the interpretability of poetic texts. To support this, we adapt existing state-of-the-art diffusion models by modifying their self-attention mechanisms with a consistent self-attention technique to generate multiple consistent images, which are then collectively used to convey the poem’s meaning. Moreover, to encourage research in the field of poetry, we introduce the P4I (PoemForImage) dataset, consisting of 1111 poems sourced from multiple online and offline resources. We engaged a panel of poetry experts for qualitative assessments. The results from both human and quantitative evaluations validate the efficacy of our method and contribute a novel perspective to poem-to-image generation with enhanced information capture in the generated images.

[212] GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving

Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Yanjun Huang

Main category: cs.CV

TL;DR: GEMINUS is a Mixture-of-Experts autonomous driving framework that combines a Global Expert with Scene-Adaptive Experts via a Dual-aware Router to achieve both robust and adaptive performance across diverse traffic scenarios, outperforming existing methods on the Bench2Drive benchmark.

Details

Motivation: Single-mode planning methods in autonomous driving struggle to learn diversified driving skills for handling diverse scenarios, as they attempt to learn an overall policy that lacks adaptability to different traffic environments while maintaining robustness.

Method: The paper proposes GEMINUS, a Mixture-of-Experts framework with three key components: (1) a Global Expert trained on the overall dataset for robust performance, (2) Scene-Adaptive Experts trained on specific scene subsets for adaptive performance, and (3) a Dual-aware Router that considers both scenario-level features and routing uncertainty to dynamically activate appropriate expert modules.

Result: GEMINUS achieves state-of-the-art performance on the Bench2Drive closed-loop benchmark with only monocular vision input. Compared to the single-expert baseline, it shows significant improvements: 7.67% in Driving Score, 22.06% in Success Rate, and 19.41% in MultiAbility-Mean.

Conclusion: The effective coupling of Global Expert and Scene-Adaptive Experts through the Dual-aware Router enables GEMINUS to achieve both adaptive and robust performance in diverse autonomous driving scenarios, demonstrating the effectiveness of the Mixture-of-Experts approach for end-to-end autonomous driving systems.

Abstract: End-to-end autonomous driving requires adaptive and robust handling of complex and diverse traffic environments. However, prevalent single-mode planning methods attempt to learn an overall policy while struggling to acquire diversified driving skills to handle diverse scenarios. Therefore, this paper proposes GEMINUS, a Mixture-of-Experts end-to-end autonomous driving framework featuring a Global Expert, a Scene-Adaptive Experts Group, and equipped with a Dual-aware Router. Specifically, the Global Expert is trained on the overall dataset, possessing robust performance. The Scene-Adaptive Experts are trained on corresponding scene subsets, achieving adaptive performance. The Dual-aware Router simultaneously considers scenario-level features and routing uncertainty to dynamically activate expert modules. Through the effective coupling of the Global Expert and the Scene-Adaptive Experts Group via the Dual-aware Router, GEMINUS achieves adaptive and robust performance in diverse scenarios. GEMINUS outperforms existing methods in the Bench2Drive closed-loop benchmark and achieves state-of-the-art performance in Driving Score and Success Rate, even with only monocular vision input. Furthermore, ablation studies demonstrate significant improvements over the original single-expert baseline: 7.67% in Driving Score, 22.06% in Success Rate, and 19.41% in MultiAbility-Mean. The code will be available at https://github.com/newbrains1/GEMINUS.

[213] SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

Jiaji Zhang, Ruichao Sun, Hailiang Zhao, Jiaju Wu, Peng Chen, Hao Li, Yuying Liu, Xinkui Zhao, Kingsum Chow, Gang Xiong, Shuiguang Deng

Main category: cs.CV

TL;DR: SegQuant is a unified post-training quantization framework for diffusion models that combines segment-aware graph-based quantization and dual-scale quantization to reduce computational costs while maintaining performance across different model architectures.

Details

Motivation: Diffusion models are computationally intensive and challenging to deploy in resource-constrained environments. Existing post-training quantization methods rely on architecture-specific heuristics that limit generalizability and industrial deployment compatibility.

Method: SegQuant framework combines two complementary techniques: (1) SegLinear - a segment-aware, graph-based quantization strategy that captures structural semantics and spatial heterogeneity, and (2) DualScale - a dual-scale quantization scheme that preserves polarity-asymmetric activations to maintain visual fidelity.

Result: SegQuant achieves strong performance across different diffusion model architectures beyond just Transformer-based models, while ensuring seamless compatibility with mainstream deployment tools and maintaining visual fidelity in generated outputs.

Conclusion: SegQuant provides a unified, architecture-agnostic quantization solution that effectively reduces model size and computational cost for diffusion models while preserving generation quality and enabling practical industrial deployment.

Abstract: Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.

[214] Rethinking Occlusion in FER: A Semantic-Aware Perspective and Go Beyond

Huiyu Zhai, Xingxing Yang, Yalan Ye, Chenyang Li, Bin Fan, Changze Li

Main category: cs.CV

TL;DR: ORSANet tackles facial expression recognition under occlusion by using multi-modal semantic guidance (segmentation maps and landmarks), a Multi-scale Cross-interaction Module for feature fusion, and a Dynamic Adversarial Repulsion Enhancement Loss for better classification of similar expressions.

Details

Motivation: Existing facial expression recognition models struggle with partially occluded facial information and dataset biases, leading to inaccurate classifications when facial features cannot be effectively extracted from occluded faces.

Method: The paper proposes ORSANet with three key components: 1) Multi-modal semantic guidance using semantic segmentation maps as dense priors and facial landmarks as sparse geometric priors, 2) Multi-scale Cross-interaction Module (MCM) for adaptive fusion of landmark and semantic features across different scales, and 3) Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts margins for ambiguous expression classes.

Result: ORSANet achieves state-of-the-art performance on both public benchmarks and the newly constructed Occlu-FER dataset, demonstrating superior robustness for facial expression recognition under various real-world occlusion conditions.

Conclusion: The proposed ORSANet effectively addresses occlusion challenges in facial expression recognition through multi-modal semantic guidance and adaptive feature fusion, establishing new state-of-the-art performance and providing a specialized dataset for occlusion-oriented FER research.

Abstract: Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi-modal semantic guidance to disambiguate facial occlusion and learn high-level semantic knowledge, which is two-fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics-enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi-modal priors, we customize a Multi-scale Cross-interaction Module (MCM) to adaptively fuse the landmark feature and semantics-enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model’s ability to distinguish similar expressions. We further construct the first occlusion-oriented FER dataset to facilitate specialized robustness analysis on various real-world occlusion conditions, dubbed Occlu-FER. Extensive experiments on both public benchmarks and Occlu-FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at https://github.com/Wenyuzhy/ORSANet-master.

[215] Visual-Language Model Knowledge Distillation Method for Image Quality Assessment

Yongkang Hou, Jiarun Song

Main category: cs.CV

TL;DR: This paper proposes a knowledge distillation method that transfers CLIP’s image quality assessment capabilities to smaller, more efficient student models while maintaining superior performance and reducing computational complexity.

Details

Motivation: CLIP-based multimodal methods show excellent generalization in IQA tasks but suffer from excessive parameter burden and insufficient ability to identify local distorted features, limiting their practical deployment.

Method: The approach involves three key steps: (1) designing quality-graded prompt templates to guide CLIP’s quality score output, (2) fine-tuning CLIP to enhance IQA capabilities, and (3) implementing a modality-adaptive knowledge distillation strategy to transfer knowledge from the CLIP teacher model to architecturally advantaged student models.

Result: Experiments on multiple IQA datasets demonstrate that the proposed method significantly reduces model complexity while outperforming existing IQA methods, showing strong potential for practical deployment.

Conclusion: The knowledge distillation approach successfully addresses CLIP’s limitations in IQA by creating more efficient models that maintain superior performance, making the technology more suitable for real-world applications.

Abstract: Image Quality Assessment (IQA) is a core task in computer vision. Multimodal methods based on vision-language models, such as CLIP, have demonstrated exceptional generalization capabilities in IQA tasks. To address the issues of excessive parameter burden and insufficient ability to identify local distorted features in CLIP for IQA, this study proposes a visual-language model knowledge distillation method aimed at guiding the training of models with architectural advantages using CLIP’s IQA knowledge. First, quality-graded prompt templates were designed to guide CLIP to output quality scores. Then, CLIP is fine-tuned to enhance its capabilities in IQA tasks. Finally, a modality-adaptive knowledge distillation strategy is proposed to achieve guidance from the CLIP teacher model to the student model. Our experiments were conducted on multiple IQA datasets, and the results show that the proposed method significantly reduces model complexity while outperforming existing IQA methods, demonstrating strong potential for practical deployment.

[216] EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion

Shang Liu, Chenjie Cao, Chaohui Yu, Wen Qian, Jing Wang, Fan Wang

Main category: cs.CV

TL;DR: This paper introduces EarthCrafter, a framework for large-scale 3D Earth generation using the largest 3D aerial dataset (Aerial-Earth3D) with 50k scenes across the U.S. mainland, employing sparse-decoupled latent diffusion to generate thousands of square kilometers of realistic terrain.

Details

Motivation: Existing 3D generation methods cannot scale to geographic extents like modeling thousands of square kilometers of Earth's surface, which is essential for large-scale terrain and urban layout generation applications.

Method: The approach combines: 1) Aerial-Earth3D dataset with 50k curated 600m×600m scenes from Google Earth providing multi-view images, depth maps, normals, and semantic segmentation; 2) EarthCrafter framework using dual sparse 3D-VAEs to compress geometric voxels and textural 2D Gaussian Splats into compact latent spaces; 3) Condition-aware flow matching models trained on mixed inputs for flexible geometry and texture generation.

Result: EarthCrafter demonstrates substantially better performance in extremely large-scale 3D generation compared to existing methods, supporting versatile applications from semantic-guided urban layout generation to unconditional terrain synthesis while maintaining geographic plausibility.

Conclusion: The framework successfully addresses the challenge of large-scale 3D Earth generation by combining comprehensive aerial data infrastructure with efficient sparse-decoupled latent diffusion architecture, enabling realistic terrain synthesis at unprecedented geographic scales.

Abstract: Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth’s surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D. Our project page is available at https://whiteinblue.github.io/earthcrafter/

cs.AI

[217] Towards Autonomous Sustainability Assessment via Multimodal AI Agents

Zhihan Zhang, Alexander Metzger, Yuxuan Mei, Felix Hähnlein, Zachary Englhardt, Tingyu Cheng, Gregory D. Abowd, Shwetak Patel, Adriana Schulz, Vikram Iyer

Main category: cs.AI

TL;DR: This paper introduces multimodal AI agents that can perform life cycle assessments (LCA) for electronic devices in under one minute, achieving carbon footprint estimates within 19% of expert LCAs without requiring proprietary data, revolutionizing traditional LCA workflows that typically take weeks or months.

Details

Motivation: Traditional LCA requires extensive data that is often unavailable and expert time measured in weeks or months. There is a growing need for sustainability information, but current LCA processes are too slow and data-intensive to meet demand, creating significant barriers to environmental impact assessment.

Method: The authors develop multimodal AI agents that emulate interactions between LCA experts and stakeholders, using custom data abstraction and software tools to extract information from online text and images from repair communities and government certifications. They also create a direct estimation method comparing products to similar clusters and a data-driven approach for generating emission factors using weighted sums of similar materials.

Result: The AI system reduces LCA time from weeks/months to under one minute while achieving carbon footprint estimates within 19% accuracy of expert LCAs. The direct estimation method runs in 3ms with 12.28% MAPE on electronic products. The data-driven emission factor generation improves MAPE by 120.26% compared to human experts selecting closest database entries.

Conclusion: The multimodal AI approach successfully democratizes LCA by dramatically reducing time requirements and eliminating the need for proprietary data while maintaining reasonable accuracy. This breakthrough could transform LCA workflows and make sustainability assessments more accessible across industries.

Abstract: Interest in sustainability information has surged in recent years. However, the data required for a life cycle assessment (LCA) that maps the materials and processes from product manufacturing to disposal into environmental impacts (EI) are often unavailable. Here we reimagine conventional LCA by introducing multimodal AI agents that emulate interactions between LCA experts and stakeholders like product managers and engineers to calculate the cradle-to-gate (production) carbon emissions of electronic devices. The AI agents iteratively generate a detailed life-cycle inventory leveraging a custom data abstraction and software tools that extract information from online text and images from repair communities and government certifications. This approach reduces weeks or months of expert time to under one minute and closes data availability gaps while yielding carbon footprint estimates within 19% of expert LCAs with zero proprietary data. Additionally, we develop a method to directly estimate EI by comparing an input to a cluster of products with similar descriptions and known carbon footprints. This runs in 3 ms on a laptop with a MAPE of 12.28% on electronic products. Further, we develop a data-driven method to generate emission factors. We use the properties of an unknown material to represent it as a weighted sum of emission factors for similar materials. Compared to human experts picking the closest LCA database entry, this improves MAPE by 120.26%. We analyze the data and compute scaling of this approach and discuss its implications for future LCA workflows.

[218] New Mechanisms in Flex Distribution for Bounded Suboptimal Multi-Agent Path Finding

Shao-Hung Chan, Thomy Phan, Jiaoyang Li, Sven Koenig

Main category: cs.AI

TL;DR: This paper improves the EECBS algorithm for Multi-Agent Path Finding by proposing new flex distribution mechanisms (Conflict-Based, Delay-Based, and Mixed-Strategy) that better allocate computational resources to resolve collisions more efficiently while maintaining bounded-suboptimal guarantees.

Details

Motivation: The original EECBS algorithm with flex distribution can become inefficient because increasing thresholds may push the sum of costs beyond the bound, forcing the algorithm to switch between different path sets instead of resolving collisions on a particular set, thus reducing efficiency.

Method: The paper proposes three new flex distribution mechanisms: (1) Conflict-Based Flex Distribution that distributes flex proportional to the number of collisions, (2) Delay-Based Flex Distribution that estimates delays needed to satisfy constraints, and (3) Mixed-Strategy Flex Distribution that combines both approaches in a hierarchical framework.

Result: Experimental results demonstrate that the proposed flex distribution mechanisms outperform the original greedy flex distribution approach, while maintaining completeness and bounded-suboptimality guarantees for the EECBS algorithm.

Conclusion: The new flex distribution mechanisms successfully address the efficiency issues of the original EECBS algorithm by better allocating computational resources based on collision patterns and constraint satisfaction delays, leading to improved performance in Multi-Agent Path Finding problems.

Abstract: Multi-Agent Path Finding (MAPF) is the problem of finding a set of collision-free paths, one for each agent in a shared environment. Its objective is to minimize the sum of path costs (SOC), where the path cost of each agent is defined as the travel time from its start location to its target location. Explicit Estimation Conflict-Based Search (EECBS) is the leading algorithm for bounded-suboptimal MAPF, with the SOC of the solution being at most a user-specified factor $w$ away from optimal. EECBS maintains sets of paths and a lower bound $LB$ on the optimal SOC. Then, it iteratively selects a set of paths whose SOC is at most $w \cdot LB$ and introduces constraints to resolve collisions. For each path in a set, EECBS maintains a lower bound on its optimal path that satisfies constraints. By finding an individually bounded-suboptimal path with cost at most a threshold of $w$ times its lower bound, EECBS guarantees to find a bounded-suboptimal solution. To speed up EECBS, previous work uses flex distribution to increase the threshold. Though EECBS with flex distribution guarantees to find a bounded-suboptimal solution, increasing the thresholds may push the SOC beyond $w \cdot LB$, forcing EECBS to switch among different sets of paths instead of resolving collisions on a particular set of paths, and thus reducing efficiency. To address this issue, we propose Conflict-Based Flex Distribution that distributes flex in proportion to the number of collisions. We also estimate the delays needed to satisfy constraints and propose Delay-Based Flex Distribution. On top of that, we propose Mixed-Strategy Flex Distribution, combining both in a hierarchical framework. We prove that EECBS with our new flex distribution mechanisms is complete and bounded-suboptimal. Our experiments show that our approaches outperform the original (greedy) flex distribution.

[219] LoRA is All You Need for Safety Alignment of Reasoning LLMs

Yihao Xue, Baharan Mirzasoleiman

Main category: cs.AI

TL;DR: This paper addresses the “Safety Tax” problem where safety alignment fine-tuning degrades LLM reasoning abilities, proposing LoRA-based safety fine-tuning as a solution that maintains both safety and reasoning performance.

Details

Motivation: Large Language Models require safety alignment to prevent harmful outputs, but traditional safety fine-tuning significantly degrades reasoning capabilities (the "Safety Tax" phenomenon). There's a need for methods that can achieve safety alignment without compromising reasoning performance.

Method: The authors use Low-Rank Adaptation (LoRA) for Supervised Fine-Tuning (SFT) on refusal datasets instead of full-model fine-tuning. LoRA restricts safety weight updates to a low-rank space, minimizing interference with reasoning weights. They also explore regularization and weight merging techniques to further reduce weight overlap.

Result: Extensive experiments across four benchmarks (math, science, coding) show that LoRA-based safety fine-tuning achieves safety levels comparable to full-model fine-tuning while preserving reasoning abilities. LoRA produces smaller weight overlap with initial weights compared to full-model fine-tuning, and additional regularization methods show some improvement on certain tasks.

Conclusion: LoRA-based safety fine-tuning effectively solves the reasoning-safety trade-off by maintaining high safety standards without degrading reasoning capabilities. This approach offers a practical solution to the Safety Tax problem and motivates further research into methods that optimize the reasoning-safety balance.

Abstract: Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the “Safety Tax”. In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs – with safety levels comparable to full-model fine-tuning – without compromising their reasoning abilities. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. We also explore methods that further reduce such overlap – via regularization or during weight merging – and observe some improvement on certain tasks. We hope this result motivates designing approaches that yield more consistent improvements in the reasoning-safety trade-off.

[220] HySafe-AI: Hybrid Safety Architectural Analysis Framework for AI Systems: A Case Study

Mandar Pitale, Jelena Frtunikj, Abhinaw Priyadershi, Vasu Singh, Maria Spence

Main category: cs.AI

TL;DR: This paper introduces HySAFE-AI, a hybrid framework that adapts traditional safety analysis methods (FMEA and FTA) to evaluate the safety of modern AI systems, particularly end-to-end architectures like LLMs and VLMs used in safety-critical applications such as autonomous driving.

Details

Motivation: As AI systems, especially end-to-end monolithic architectures like LLMs and VLMs, become increasingly integral to safety-critical applications such as autonomous driving and robotics, there is a critical need to adapt traditional safety analysis methods to handle the intricate nature of foundational models and their latent representations.

Method: The paper reviews different architectural solutions and evaluates traditional safety analysis techniques (FMEA and FTA), then proposes HySAFE-AI - a Hybrid Safety Architectural Analysis Framework for AI Systems that adapts these traditional methods to better suit the complex nature of modern AI systems, particularly focusing on how foundational models form and utilize latent representations.

Result: The paper demonstrates how traditional safety analysis techniques can be improved and adapted for foundational models, presenting the HySAFE-AI framework as a solution for evaluating AI system safety in safety-critical applications.

Conclusion: The paper concludes by offering guidance for future work and suggestions to guide the evolution of AI safety standards, establishing HySAFE-AI as a viable framework for hybrid safety analysis of modern AI systems in critical applications.

Abstract: AI has become integral to safety-critical areas like autonomous driving systems (ADS) and robotics. The architecture of recent autonomous systems are trending toward end-to-end (E2E) monolithic architectures such as large language models (LLMs) and vision language models (VLMs). In this paper, we review different architectural solutions and then evaluate the efficacy of common safety analyses such as failure modes and effect analysis (FMEA) and fault tree analysis (FTA). We show how these techniques can be improved for the intricate nature of the foundational models, particularly in how they form and utilize latent representations. We introduce HySAFE-AI, Hybrid Safety Architectural Analysis Framework for AI Systems, a hybrid framework that adapts traditional methods to evaluate the safety of AI systems. Lastly, we offer hints of future work and suggestions to guide the evolution of future AI safety standards.

[221] Improving LLMs’ Generalized Reasoning Abilities by Graph Problems

Qifan Zhang, Nuo Chen, Zehua Li, Miao Peng, Jing Tang, Jia Li

Main category: cs.AI

TL;DR: This paper introduces GraphPile, a large-scale dataset for graph problem reasoning (GPR), and trains GraphMind models to enhance general reasoning capabilities of LLMs, achieving significant improvements in both mathematical and non-mathematical reasoning tasks.

Details

Motivation: Existing domain-specific continued pretraining methods for LLMs show promise in specific areas like mathematical reasoning but lack transferability to broader reasoning tasks. The authors aim to bridge this gap by using graph problem reasoning to enhance general reasoning capabilities across diverse tasks.

Method: The authors create GraphPile, a 10.9 billion token corpus spanning 23 graph tasks including pathfinding, network analysis, numerical computation, and topological reasoning. The dataset incorporates chain-of-thought, program-of-thought, trace of execution, and real-world graph data. They then perform continued pretraining on popular base models (Llama 3, 3.1, and Gemma 2) to create GraphMind models.

Result: GraphMind achieves up to 4.9% higher accuracy in mathematical reasoning and up to 21.2% improvement in non-mathematical reasoning tasks including logical and commonsense reasoning compared to baseline models.

Conclusion: This work successfully demonstrates that graph problem reasoning can enhance universal reasoning capabilities in LLMs, bridging the gap between domain-specific pretraining and general reasoning abilities, thereby advancing the adaptability and robustness of language models.

Abstract: Large Language Models (LLMs) have made remarkable strides in reasoning tasks, yet their performance often falters on novel and complex problems. Domain-specific continued pretraining (CPT) methods, such as those tailored for mathematical reasoning, have shown promise but lack transferability to broader reasoning tasks. In this work, we pioneer the use of Graph Problem Reasoning (GPR) to enhance the general reasoning capabilities of LLMs. GPR tasks, spanning pathfinding, network analysis, numerical computation, and topological reasoning, require sophisticated logical and relational reasoning, making them ideal for teaching diverse reasoning patterns. To achieve this, we introduce GraphPile, the first large-scale corpus specifically designed for CPT using GPR data. Spanning 10.9 billion tokens across 23 graph tasks, the dataset includes chain-of-thought, program-of-thought, trace of execution, and real-world graph data. Using GraphPile, we train GraphMind on popular base models Llama 3 and 3.1, as well as Gemma 2, achieving up to 4.9 percent higher accuracy in mathematical reasoning and up to 21.2 percent improvement in non-mathematical reasoning tasks such as logical and commonsense reasoning. By being the first to harness GPR for enhancing reasoning patterns and introducing the first dataset of its kind, our work bridges the gap between domain-specific pretraining and universal reasoning capabilities, advancing the adaptability and robustness of LLMs.

[222] Our Cars Can Talk: How IoT Brings AI to Vehicles

Amod Kant Agrawal

Main category: cs.AI

TL;DR: This paper proposes integrating AI copilots into vehicles to enable proactive maintenance by creating systems that can communicate with both machines and drivers, transforming traditional reactive maintenance approaches.

Details

Motivation: The need to transform vehicle maintenance from reactive to proactive approaches by leveraging AI technology and enabling vehicles as intelligent sensing platforms that can predict issues before they occur.

Method: Integration of AI copilots into vehicle systems that are designed to communicate in both machine language (for technical diagnostics) and human language (for driver interaction), creating a bridge between technical systems and users.

Result: A conceptual framework for intelligent vehicle systems that combines predictive maintenance capabilities with AI-powered user interaction, enabling vehicles to serve as comprehensive sensing and communication platforms.

Conclusion: The integration of bilingual AI copilots in vehicles represents a significant opportunity to revolutionize vehicle maintenance and user interaction, requiring interdisciplinary collaboration to advance research and development in intelligent vehicle systems.

Abstract: Bringing AI to vehicles and enabling them as sensing platforms is key to transforming maintenance from reactive to proactive. Now is the time to integrate AI copilots that speak both languages: machine and driver. This article offers a conceptual and technical perspective intended to spark interdisciplinary dialogue and guide future research and development in intelligent vehicle systems, predictive maintenance, and AI-powered user interaction.

[223] Agent Identity Evals: Measuring Agentic Identity

Elija Perrier, Michael Timothy Bennett

Main category: cs.AI

TL;DR: This paper introduces Agent Identity Evals (AIE), a framework for measuring how well language model agents maintain stable identity over time, addressing issues like inconsistency and unreliability that stem from underlying LLM limitations.

Details

Motivation: Language model agents (LMAs) suffer from identity instability due to inherited pathologies from large language models including statelessness, stochasticity, prompt sensitivity, and linguistic intermediation. This identity attrition undermines their reliability, trustworthiness, and agentic capabilities like reasoning and planning.

Method: The authors develop Agent Identity Evals (AIE), a rigorous, statistically-driven empirical framework with novel metrics to measure how LMA systems exhibit and maintain their agentic identity over time. The framework evaluates capabilities, properties, and recovery from state perturbations, and can be applied throughout the LMA lifecycle.

Result: AIE provides a comprehensive measurement system that can integrate with other performance and capability metrics to assist in designing optimal LMA infrastructure including memory systems and tools. The framework includes formal definitions, methods, and worked examples for practical application.

Conclusion: The AIE framework offers a systematic approach to evaluate and improve LMA identity consistency, which is essential for building more reliable and trustworthy language model agents with enhanced agentic capabilities.

Abstract: Central to agentic capability and trustworthiness of language model agents (LMAs) is the extent they maintain stable, reliable, identity over time. However, LMAs inherit pathologies from large language models (LLMs) (statelessness, stochasticity, sensitivity to prompts and linguistically-intermediation) which can undermine their identifiability, continuity, persistence and consistency. This attrition of identity can erode their reliability, trustworthiness and utility by interfering with their agentic capabilities such as reasoning, planning and action. To address these challenges, we introduce \textit{agent identity evals} (AIE), a rigorous, statistically-driven, empirical framework for measuring the degree to which an LMA system exhibit and maintain their agentic identity over time, including their capabilities, properties and ability to recover from state perturbations. AIE comprises a set of novel metrics which can integrate with other measures of performance, capability and agentic robustness to assist in the design of optimal LMA infrastructure and scaffolding such as memory and tools. We set out formal definitions and methods that can be applied at each stage of the LMA life-cycle, and worked examples of how to apply them.

[224] Students’ Feedback Requests and Interactions with the SCRIPT Chatbot: Do They Get What They Ask For?

Andreas Scholl, Natalie Kiesler

Main category: cs.AI

TL;DR: Researchers developed SCRIPT, a ChatGPT-4o-mini based chatbot for programming education, and tested it with 136 introductory programming students to analyze feedback preferences and interaction patterns.

Details

Motivation: To support novice programming learners by developing an AI-powered chatbot that can provide both open-ended interactions and structured guidance, addressing the need for effective GenAI tools in programming education.

Method: Developed SCRIPT chatbot using ChatGPT-4o-mini with predefined prompts for structured guidance. Conducted an experiment with 136 students from an introductory programming course at a German university, analyzing their interactions and feedback preferences while solving programming tasks.

Result: Students’ feedback requests followed specific sequences, the chatbot responses aligned well with requested feedback types (75% accuracy), and the system successfully adhered to prompt constraints. The study provided insights into student interaction patterns with AI-based learning tools.

Conclusion: The research provides valuable insights for designing GenAI-based learning support systems and reveals the challenge of balancing guidance and flexibility in AI-assisted educational tools for programming education.

Abstract: Building on prior research on Generative AI (GenAI) and related tools for programming education, we developed SCRIPT, a chatbot based on ChatGPT-4o-mini, to support novice learners. SCRIPT allows for open-ended interactions and structured guidance through predefined prompts. We evaluated the tool via an experiment with 136 students from an introductory programming course at a large German university and analyzed how students interacted with SCRIPT while solving programming tasks with a focus on their feedback preferences. The results reveal that students’ feedback requests seem to follow a specific sequence. Moreover, the chatbot responses aligned well with students’ requested feedback types (in 75%), and it adhered to the system prompt constraints. These insights inform the design of GenAI-based learning support systems and highlight challenges in balancing guidance and flexibility in AI-assisted tools.

[225] Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments

Shitong Zhu, Chenhao Fang, Derek Larson, Neel Reddy Pochareddy, Rajeev Rao, Sophie Zeng, Yanqing Peng, Wendy Summer, Alex Goncalves, Arya Pudota, Herve Robert

Main category: cs.AI

TL;DR: The paper introduces CBA (Compliance Brain Assistant), an AI assistant for enterprise compliance tasks that uses a smart routing system to balance response quality and latency by choosing between fast retrieval mode and complex agentic mode based on query complexity.

Details

Motivation: Enterprise compliance personnel need efficient AI assistance for daily tasks, but there's a challenge in balancing response quality with latency. Simple queries don't need complex processing, while complicated requests require multi-step actions and tool invocations across various compliance artifacts.

Method: The system employs a user query router that intelligently selects between two modes: (1) FastTrack mode for simple requests requiring only context retrieval from knowledge corpora, and (2) FullAgentic mode for complex requests needing composite actions, tool invocations, and API calls to discover and enrich context across compliance artifacts.

Result: CBA significantly outperformed vanilla LLM on real-world privacy/compliance queries with 83.7% vs 41.7% average keyword match rate and 82.0% vs 20.0% LLM-judge pass rate. The routing-based design achieved better performance than individual modes while maintaining similar runtime.

Conclusion: The routing mechanism successfully validates the hypothesis that intelligent mode selection creates an effective trade-off between response quality and processing efficiency, making CBA a practical solution for enterprise compliance assistance.

Abstract: This paper presents Compliance Brain Assistant (CBA), a conversational, agentic AI assistant designed to boost the efficiency of daily compliance tasks for personnel in enterprise environments. To strike a good balance between response quality and latency, we design a user query router that can intelligently choose between (i) FastTrack mode: to handle simple requests that only need additional relevant context retrieved from knowledge corpora; and (ii) FullAgentic mode: to handle complicated requests that need composite actions and tool invocations to proactively discover context across various compliance artifacts, and/or involving other APIs/models for accommodating requests. A typical example would be to start with a user query, use its description to find a specific entity and then use the entity’s information to query other APIs for curating and enriching the final AI response. Our experimental evaluations compared CBA against an out-of-the-box LLM on various real-world privacy/compliance-related queries targeting various personas. We found that CBA substantially improved upon the vanilla LLM’s performance on metrics such as average keyword match rate (83.7% vs. 41.7%) and LLM-judge pass rate (82.0% vs. 20.0%). We also compared metrics for the full routing-based design against the fast-track only and full-agentic modes and found that it had a better average match-rate and pass-rate while keeping the run-time approximately the same. This finding validated our hypothesis that the routing mechanism leads to a good trade-off between the two worlds.

[226] Ctx2TrajGen: Traffic Context-Aware Microscale Vehicle Trajectories using Generative Adversarial Imitation Learning

Joobin Jin, Seokjun Hong, Gyeongseon Baek, Yeeun Kim, Byeongjoon Noh

Main category: cs.AI

TL;DR: Ctx2TrajGen is a context-aware framework that uses GAIL with PPO and WGAN-GP to generate realistic vehicle trajectories for traffic analysis and autonomous driving, outperforming existing methods on the DRIFT dataset.

Details

Motivation: Precise modeling of microscopic vehicle trajectories is critical for traffic behavior analysis and autonomous driving systems, but existing approaches struggle with nonlinear interdependencies, training instability, and lack of contextual awareness in microscopic settings.

Method: The paper proposes Ctx2TrajGen, a context-aware trajectory generation framework that uses Generative Adversarial Imitation Learning (GAIL) combined with Proximal Policy Optimization (PPO) and Wasserstein GAN with Gradient Penalty (WGAN-GP). The model explicitly conditions on surrounding vehicles and road geometry to generate interaction-aware trajectories.

Result: Experiments on the drone-captured DRIFT dataset show that Ctx2TrajGen achieves superior performance compared to existing methods across three key metrics: realism, behavioral diversity, and contextual fidelity.

Conclusion: Ctx2TrajGen offers a robust solution to data scarcity and domain shift challenges in trajectory generation without requiring simulation, making it valuable for traffic behavior analysis and autonomous driving applications.

Abstract: Precise modeling of microscopic vehicle trajectories is critical for traffic behavior analysis and autonomous driving systems. We propose Ctx2TrajGen, a context-aware trajectory generation framework that synthesizes realistic urban driving behaviors using GAIL. Leveraging PPO and WGAN-GP, our model addresses nonlinear interdependencies and training instability inherent in microscopic settings. By explicitly conditioning on surrounding vehicles and road geometry, Ctx2TrajGen generates interaction-aware trajectories aligned with real-world context. Experiments on the drone-captured DRIFT dataset demonstrate superior performance over existing methods in terms of realism, behavioral diversity, and contextual fidelity, offering a robust solution to data scarcity and domain shift without simulation.

[227] An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models

Haoran Sun, Zekun Zhang, Shaoning Zeng

Main category: cs.AI

TL;DR: This paper proposes UDASA, an automated framework that improves large language model alignment with human intent by using uncertainty quantification across semantics, factuality, and value alignment to create progressive training stages without requiring human annotations.

Details

Motivation: Achieving high-quality alignment between LLMs and human intent/safety norms without human annotations remains a fundamental challenge, as current methods still require extensive human supervision for effective alignment.

Method: UDASA generates multiple responses for each input, quantifies uncertainty across three dimensions (semantics, factuality, value alignment), constructs preference pairs based on uncertainty scores, categorizes training samples into conservative/moderate/exploratory stages, and progressively optimizes the model across these stages.

Result: UDASA outperforms existing alignment methods across multiple evaluation tasks including harmlessness, helpfulness, truthfulness, and controlled sentiment generation, demonstrating significant improvements in model performance.

Conclusion: The uncertainty-driven adaptive self-alignment framework successfully enables automated LLM alignment improvement without human annotations, providing a viable path toward scalable and effective model alignment with human values and safety requirements.

Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in instruction following and general-purpose reasoning. However, achieving high-quality alignment with human intent and safety norms without human annotations remains a fundamental challenge. In this work, we propose an Uncertainty-Driven Adaptive Self-Alignment (UDASA) framework designed to improve LLM alignment in a fully automated manner. UDASA first generates multiple responses for each input and quantifies output uncertainty across three dimensions: semantics, factuality, and value alignment. Based on these uncertainty scores, the framework constructs preference pairs and categorizes training samples into three stages, conservative, moderate, and exploratory, according to their uncertainty difference. The model is then optimized progressively across these stages. In addition, we conduct a series of preliminary studies to validate the core design assumptions and provide strong empirical motivation for the proposed framework. Experimental results show that UDASA outperforms existing alignment methods across multiple tasks, including harmlessness, helpfulness, truthfulness, and controlled sentiment generation, significantly improving model performance.

[228] LTLZinc: a Benchmarking Framework for Continual Learning and Neuro-Symbolic Temporal Reasoning

Luca Salvatore Lorello, Nikolaos Manginas, Marco Lippi, Stefano Melacci

Main category: cs.AI

TL;DR: The paper introduces LTLZinc, a benchmarking framework that generates temporal reasoning and continual learning datasets from linear temporal logic specifications, revealing limitations in current neuro-symbolic methods when dealing with time-dependent problems.

Details

Motivation: Most neuro-symbolic AI approaches only work in static scenarios, but real-world applications require reasoning along temporal dimensions. There's a lack of benchmarking frameworks that can evaluate neuro-symbolic and continual learning methods on temporal reasoning tasks, creating a gap in understanding how these methods perform when time and sequential learning are involved.

Method: The authors developed LTLZinc, a framework that generates datasets by combining linear temporal logic (LTL) specifications with MiniZinc constraints and arbitrary image classification datasets. The framework creates expressive temporal reasoning tasks and continual learning scenarios with fine-grained annotations that support multiple neural and neuro-symbolic training configurations.

Result: Experiments on six neuro-symbolic sequence classification tasks and four class-continual learning tasks generated by LTLZinc demonstrated the challenging nature of temporal learning and reasoning. The results revealed significant limitations in current state-of-the-art neuro-symbolic and continual learning methods when applied to temporal scenarios.

Conclusion: The LTLZinc framework successfully exposes the limitations of current methods in temporal reasoning scenarios. The authors release the generator and ten ready-to-use tasks to foster research towards developing unified temporal learning and reasoning frameworks that can better handle time-dependent neuro-symbolic problems.

Abstract: Neuro-symbolic artificial intelligence aims to combine neural architectures with symbolic approaches that can represent knowledge in a human-interpretable formalism. Continual learning concerns with agents that expand their knowledge over time, improving their skills while avoiding to forget previously learned concepts. Most of the existing approaches for neuro-symbolic artificial intelligence are applied to static scenarios only, and the challenging setting where reasoning along the temporal dimension is necessary has been seldom explored. In this work we introduce LTLZinc, a benchmarking framework that can be used to generate datasets covering a variety of different problems, against which neuro-symbolic and continual learning methods can be evaluated along the temporal and constraint-driven dimensions. Our framework generates expressive temporal reasoning and continual learning tasks from a linear temporal logic specification over MiniZinc constraints, and arbitrary image classification datasets. Fine-grained annotations allow multiple neural and neuro-symbolic training settings on the same generated datasets. Experiments on six neuro-symbolic sequence classification and four class-continual learning tasks generated by LTLZinc, demonstrate the challenging nature of temporal learning and reasoning, and highlight limitations of current state-of-the-art methods. We release the LTLZinc generator and ten ready-to-use tasks to the neuro-symbolic and continual learning communities, in the hope of fostering research towards unified temporal learning and reasoning frameworks.

[229] Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning

Xinyao Liu, Diping Song

Main category: cs.AI

TL;DR: This paper presents FundusExpert, an ophthalmology-specific multimodal large language model that integrates positioning and diagnosis reasoning capabilities, achieving superior performance in ophthalmic tasks through the FundusGen dataset and Fundus-Engine system.

Details

Motivation: Existing multimodal large language models face critical challenges in specialized medical domains like ophthalmology, including fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding and accurate diagnosis.

Method: The authors developed FundusExpert using the FundusGen dataset constructed through an intelligent Fundus-Engine system that automates localization and uses MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis. They also constructed a clinically aligned cognitive chain to guide interpretable reasoning paths.

Result: FundusExpert achieved best performance in ophthalmic question-answering tasks, surpassing 40B MedRegA by 26.6% average accuracy. In zero-shot report generation, it achieved 77.0% clinical consistency, significantly outperforming GPT-4o’s 47.6%. The study also revealed a scaling law between data quality and model capability (L ∝ N^0.068).

Conclusion: The work successfully develops a scalable, clinically-aligned MLLM for ophthalmology by integrating region-level localization with diagnostic reasoning chains. The cognitive alignment annotations in FundusGen enhance data utilization efficiency, providing a pathway toward bridging the visual-language gap in specialized medical MLLMs.

Abstract: Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths. FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0%, significantly outperforming GPT-4o’s 47.6%. Furthermore, we reveal a scaling law between data quality and model capability ($L \propto N^{0.068}$), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in specific MLLMs. Our project can be found at https://github.com/MeteorElf/FundusExpert.

[230] CQE under Epistemic Dependencies: Algorithms and Experiments (extended version)

Lorenzo Marconi, Flavia Ricci, Riccardo Rosati

Main category: cs.AI

TL;DR: This paper investigates Controlled Query Evaluation (CQE) over ontologies using epistemic dependencies to regulate information disclosure, focusing on answering Boolean unions of conjunctive queries while maintaining security guarantees through optimal GA censors.

Details

Motivation: The need to control information disclosure in ontology-based systems while maintaining query answering capabilities. Existing CQE frameworks require better integration with epistemic dependencies and optimal censoring mechanisms to ensure both security and computational efficiency.

Method: The authors combine epistemic dependencies (EDs) with optimal GA censors (maximal sets of safely revealable ground atoms) and use intersection-based approach for answering Boolean unions of conjunctive queries (BUCQs). They develop a first-order rewriting algorithm for DL-Lite_R ontologies and a subclass of EDs.

Result: The intersection-based approach ensures security for full EDs. For a subclass of EDs and DL-Lite_R ontologies, answering BUCQs has AC^0 data complexity. Experimental evaluation in two scenarios demonstrates practical feasibility of the rewriting algorithm.

Conclusion: The proposed CQE framework successfully combines epistemic dependencies with optimal GA censors to provide secure and computationally efficient query answering over ontologies, with the intersection-based approach offering strong security guarantees while maintaining tractable complexity for specific classes of EDs and ontologies.

Abstract: We investigate Controlled Query Evaluation (CQE) over ontologies, where information disclosure is regulated by epistemic dependencies (EDs), a family of logical rules recently proposed for the CQE framework. In particular, we combine EDs with the notion of optimal GA censors, i.e. maximal sets of ground atoms that are entailed by the ontology and can be safely revealed. We focus on answering Boolean unions of conjunctive queries (BUCQs) with respect to the intersection of all optimal GA censors - an approach that has been shown in other contexts to ensure strong security guarantees with favorable computational behavior. First, we characterize the security of this intersection-based approach and identify a class of EDs (namely, full EDs) for which it remains safe. Then, for a subclass of EDs and for DL-Lite_R ontologies, we show that answering BUCQs in the above CQE semantics is in AC^0 in data complexity by presenting a suitable, detailed first-order rewriting algorithm. Finally, we report on experiments conducted in two different evaluation scenarios, showing the practical feasibility of our rewriting function.

[231] Automated Hybrid Grounding Using Structural and Data-Driven Heuristics

Alexander Beiser, Markus Hecher, Stefan Woltran

Main category: cs.AI

TL;DR: This paper presents an automated hybrid grounding approach for Answer Set Programming that intelligently switches between standard bottom-up grounding and body-decoupled grounding using data-structural heuristics to overcome the grounding bottleneck.

Details

Motivation: The grounding bottleneck is a major challenge preventing widespread adoption of Answer Set Programming in industry. While hybrid grounding combining standard bottom-up and body-decoupled techniques shows promise, it was unclear when to use each approach optimally.

Method: The authors develop a splitting algorithm based on data-structural heuristics that automatically detects when to use body-decoupled grounding versus standard bottom-up grounding. The heuristics are based on rule structure analysis and an estimation procedure that incorporates instance data.

Result: Experimental results on a prototypical implementation show promising performance with improvements on hard-to-ground scenarios, while achieving near state-of-the-art performance on hard-to-solve instances.

Conclusion: The automated hybrid grounding approach successfully addresses the decision problem of when to apply different grounding techniques, demonstrating improvements in challenging grounding scenarios while maintaining competitive performance overall.

Abstract: The grounding bottleneck poses one of the key challenges that hinders the widespread adoption of Answer Set Programming in industry. Hybrid Grounding is a step in alleviating the bottleneck by combining the strength of standard bottom-up grounding with recently proposed techniques where rule bodies are decoupled during grounding. However, it has remained unclear when hybrid grounding shall use body-decoupled grounding and when to use standard bottom-up grounding. In this paper, we address this issue by developing automated hybrid grounding: we introduce a splitting algorithm based on data-structural heuristics that detects when to use body-decoupled grounding and when standard grounding is beneficial. We base our heuristics on the structure of rules and an estimation procedure that incorporates the data of the instance. The experiments conducted on our prototypical implementation demonstrate promising results, which show an improvement on hard-to-ground scenarios, whereas on hard-to-solve instances we approach state-of-the-art performance.

[232] Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning

Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu

Main category: cs.AI

TL;DR: This paper systematically investigates multi-domain reasoning in Reinforcement Learning with Verifiable Rewards (RLVR) by studying how mathematical reasoning, code generation, and logical puzzle solving interact when training LLMs, revealing key insights about domain interactions and cross-domain generalization.

Details

Motivation: Real-world reasoning scenarios require integrated application of multiple cognitive skills, but existing RLVR research focuses on isolated reasoning domains. The interplay among different reasoning skills under reinforcement learning remains poorly understood, creating a need to bridge this gap through systematic multi-domain investigation.

Method: The study uses GRPO algorithm and Qwen-2.5-7B model family to conduct a comprehensive four-component analysis: (1) evaluating in-domain improvements and cross-domain generalization on single-domain datasets, (2) examining interactions during combined cross-domain training, (3) comparing base vs instruct models under identical RL configurations, and (4) exploring curriculum learning strategies, reward design variations, and language-specific factors.

Result: The extensive experiments reveal significant insights into domain interaction dynamics, identifying key factors that influence both specialized and generalizable reasoning performance. The results show how different reasoning domains can mutually enhance or conflict with each other during combined training.

Conclusion: The findings provide valuable guidance for optimizing RL methodologies to develop comprehensive, multi-domain reasoning capabilities in LLMs, offering insights into how to effectively train models that can handle integrated reasoning tasks across mathematical, coding, and logical domains.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models’ in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.

[233] TAI Scan Tool: A RAG-Based Tool With Minimalistic Input for Trustworthy AI Self-Assessment

Athanasios Davvetas, Xenia Ziouvelou, Ypatia Dami, Alexis Kaponis, Konstantina Giouvanopoulou, Michael Papademas

Main category: cs.AI

TL;DR: This paper presents the TAI Scan Tool, a RAG-based self-assessment system that helps determine AI system risk levels under the AI Act with minimal input requirements, using a two-step pre-screening and assessment approach.

Details

Motivation: The need for a practical tool to facilitate compliance with the AI Act by providing automated assessment of AI systems' risk levels and retrieving relevant regulatory articles with minimal user input.

Method: A two-step RAG-based approach consisting of: (1) pre-screening phase and (2) assessment phase that evaluates AI systems against AI Act requirements and retrieves relevant compliance articles.

Result: Qualitative evaluation using use-case scenarios showed promising results with correct risk level predictions and successful retrieval of relevant articles across three distinct semantic groups. The tool’s reasoning was found to rely on comparison with high-risk system settings.

Conclusion: The TAI Scan Tool successfully provides AI Act compliance assessment with minimalistic input, correctly identifying risk levels and retrieving relevant regulatory articles, with its reasoning process focusing on high-risk system comparisons due to their frequent presence in AI Act documentation.

Abstract: This paper introduces the TAI Scan Tool, a RAG-based TAI self-assessment tool with minimalistic input. The current version of the tool supports the legal TAI assessment, with a particular emphasis on facilitating compliance with the AI Act. It involves a two-step approach with a pre-screening and an assessment phase. The assessment output of the system includes insight regarding the risk-level of the AI system according to the AI Act, while at the same time retrieving relevant articles to aid with compliance and notify on their obligations. Our qualitative evaluation using use-case scenarios yields promising results, correctly predicting risk levels while retrieving relevant articles across three distinct semantic groups. Furthermore, interpretation of results shows that the tool’s reasoning relies on comparison with the setting of high-risk systems, a behaviour attributed to their deployment requiring careful consideration, and therefore frequently presented within the AI Act.

[234] Simulating multiple human perspectives in socio-ecological systems using large language models

Yongchao Zeng, Calum Brown, Ioannis Kyriakou, Ronja Hotz, Mark Rounsevell

Main category: cs.AI

TL;DR: The paper introduces HoPeS, a simulation framework using LLM-powered agents to help users experience different stakeholder perspectives in socio-ecological systems, demonstrated through land use change scenarios where users can role-play as researchers, observers, and policymakers to understand perspective differences and conflicts.

Details

Motivation: Understanding socio-ecological systems requires insights from diverse stakeholder perspectives, which are often difficult to access. There's a need for alternative methods to explore and understand these different viewpoints to better comprehend complex system dynamics and stakeholder interactions.

Method: The authors developed HoPeS (Human-Oriented Perspective Shifting), a modeling framework that uses LLM-powered agents to represent various stakeholders. The system includes a simulation protocol that acts as a “scaffold” to facilitate multiple perspective-taking simulations, allowing users to step into different agent roles and experience perspectival differences while supporting reflection, transition, and integration across perspectives.

Result: A prototype system was developed and tested in the context of institutional dynamics and land use change. In an illustrative experiment, a user adopted perspectives of both a system observer and a researcher, experiencing the challenge of making evidence-based policy recommendations to LLM agents representing various institutions. The experiment revealed persistent discrepancies between policy recommendations and implementation due to competing stakeholder advocacies, mirroring real-world researcher-policymaker misalignment. The user experienced frustration and disappointment while trying to maintain political neutrality, but showed high motivation to experiment with alternative approaches.

Conclusion: The HoPeS framework successfully demonstrates the potential for enabling users to explore different stakeholder perspectives in socio-ecological systems. Despite challenges like policy implementation gaps and emotional frustration experienced by users, the system shows promise for facilitating new forms of interdisciplinary collaboration in socio-ecological simulations, with further refinement likely to enhance its capabilities.

Abstract: Understanding socio-ecological systems requires insights from diverse stakeholder perspectives, which are often hard to access. To enable alternative, simulation-based exploration of different stakeholder perspectives, we develop the HoPeS (Human-Oriented Perspective Shifting) modelling framework. HoPeS employs agents powered by large language models (LLMs) to represent various stakeholders; users can step into the agent roles to experience perspectival differences. A simulation protocol serves as a “scaffold” to streamline multiple perspective-taking simulations, supporting users in reflecting on, transitioning between, and integrating across perspectives. A prototype system is developed to demonstrate HoPeS in the context of institutional dynamics and land use change, enabling both narrative-driven and numerical experiments. In an illustrative experiment, a user successively adopts the perspectives of a system observer and a researcher

a role that analyses data from the embedded land use model to inform evidence-based decision-making for other LLM agents representing various institutions. Despite the user’s effort to recommend technically sound policies, discrepancies persist between the policy recommendation and implementation due to stakeholders’ competing advocacies, mirroring real-world misalignment between researcher and policymaker perspectives. The user’s reflection highlights the subjective feelings of frustration and disappointment as a researcher, especially due to the challenge of maintaining political neutrality while attempting to gain political influence. Despite this, the user exhibits high motivation to experiment with alternative narrative framing strategies, suggesting the system’s potential in exploring different perspectives. Further system and protocol refinement are likely to enable new forms of interdisciplinary collaboration in socio-ecological simulations.

[235] Symbiotic Agents: A Novel Paradigm for Trustworthy AGI-driven Networks

Ilias Chatzistefanidis, Navid Nikaein

Main category: cs.AI

TL;DR: This paper introduces symbiotic agents that combine Large Language Models (LLMs) with real-time optimization algorithms for 6G networks, achieving 5x reduction in decision errors and 99.9% GPU resource reduction while enabling trustworthy AGI-driven network management.

Details

Motivation: The evolution from specialized AI algorithms handling isolated tasks to artificial general intelligence (AGI)-driven networks requires autonomous agents capable of real-time decision-making for network management and service provisioning. Current LLM-based agents lack the precision and efficiency needed for critical network functions, necessitating a new paradigm that combines reasoning capabilities with optimization algorithms for trustworthy AI in 6G networks.

Method: The authors propose symbiotic agents that integrate LLMs with dual-level optimization: (1) input-level optimizers providing bounded uncertainty steering for numerically precise tasks, and (2) output-level optimizers supervised by LLMs for adaptive real-time control. They implement two agent types - Radio Access Network optimizers and multi-agent negotiators for Service-Level Agreements - within an end-to-end AGI network architecture.

Result: Symbiotic agents demonstrated 5x reduction in decision errors compared to standalone LLM-based agents. Smaller language models (SLM) achieved similar accuracy with 99.9% reduction in GPU resource overhead and near-real-time performance (82 ms loops). Multi-agent collaboration on a real-world 5G testbed showed 44% reduction in RAN over-utilization and significant improvements in service-level agreement flexibility and resource allocation.

Conclusion: The symbiotic paradigm establishes a foundation for next-generation AGI-driven network systems that remain adaptable, efficient, and trustworthy as LLMs advance. By combining LLM reasoning with optimization algorithms, this approach enables practical deployment of autonomous agents in critical network infrastructure while maintaining numerical precision and real-time performance requirements.

Abstract: Large Language Model (LLM)-based autonomous agents are expected to play a vital role in the evolution of 6G networks, by empowering real-time decision-making related to management and service provisioning to end-users. This shift facilitates the transition from a specialized intelligence approach, where artificial intelligence (AI) algorithms handle isolated tasks, to artificial general intelligence (AGI)-driven networks, where agents possess broader reasoning capabilities and can manage diverse network functions. In this paper, we introduce a novel agentic paradigm that combines LLMs with real-time optimization algorithms towards Trustworthy AI, defined as symbiotic agents. Optimizers at the LLM’s input-level provide bounded uncertainty steering for numerically precise tasks, whereas output-level optimizers supervised by the LLM enable adaptive real-time control. We design and implement two novel agent types including: (i) Radio Access Network optimizers, and (ii) multi-agent negotiators for Service-Level Agreements (SLAs). We further propose an end-to-end architecture for AGI networks and evaluate it on a 5G testbed capturing channel fluctuations from moving vehicles. Results show that symbiotic agents reduce decision errors fivefold compared to standalone LLM-based agents, while smaller language models (SLM) achieve similar accuracy with a 99.9% reduction in GPU resource overhead and in near-real-time loops of 82 ms. A multi-agent demonstration for collaborative RAN on the real-world testbed highlights significant flexibility in service-level agreement and resource allocation, reducing RAN over-utilization by approximately 44%. Drawing on our findings and open-source implementations, we introduce the symbiotic paradigm as the foundation for next-generation, AGI-driven networks-systems designed to remain adaptable, efficient, and trustworthy even as LLMs advance.

[236] Thinking Isn’t an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations

Zhao Song, Song Yue, Jiahao Zhang

Main category: cs.AI

TL;DR: This paper investigates whether Large Reasoning Models (LRMs) that show step-by-step thinking processes actually improve reasoning ability, finding that while LRMs may underperform without tools, they consistently outperform non-reasoning models when augmented with Python interpreters and scratchpads across all complexity levels.

Details

Motivation: Recent empirical studies suggested that LRMs' explicit step-by-step thinking processes may not enhance reasoning ability, with non-reasoning LLMs sometimes outperforming LRMs on various complexity tasks. This challenges the fundamental assumption behind LRM design and necessitates investigation into whether tool augmentation can address these limitations.

Method: The researchers incorporated two types of tools (Python interpreters and scratchpads) and evaluated three representative LLMs alongside their LRM counterparts on Apple’s benchmark reasoning puzzles across different complexity levels to test whether tool augmentation improves LRM performance.

Result: With proper tool use, LRMs consistently outperformed their non-reasoning counterparts across all levels of task complexity, demonstrating that tool augmentation can overcome the previously observed limitations of reasoning models.

Conclusion: The findings challenge the recent narrative that reasoning in LLMs is merely an illusion and highlight the significant potential of tool-augmented LRMs for solving complex problems, suggesting that the issue may not be with reasoning itself but with how it’s implemented without proper tool support.

Abstract: Large Reasoning Models (LRMs) have become a central focus in today’s large language model (LLM) research, where models are designed to output a step-by-step thinking process before arriving at a final answer to handle complex reasoning tasks. Despite their promise, recent empirical studies (e.g., [Shojaee et al., 2025] from Apple) suggest that this thinking process may not actually enhance reasoning ability, where LLMs without explicit reasoning actually outperform LRMs on tasks with low or high complexity. In this work, we revisit these findings and investigate whether the limitations of LRMs persist when tool augmentations are introduced. We incorporate two types of tools, Python interpreters and scratchpads, and evaluate three representative LLMs and their LRM counterparts on Apple’s benchmark reasoning puzzles. Our results show that, with proper tool use, LRMs consistently outperform their non-reasoning counterparts across all levels of task complexity. These findings challenge the recent narrative that reasoning is an illusion and highlight the potential of tool-augmented LRMs for solving complex problems.

[237] Online Submission and Evaluation System Design for Competition Operations

Zhe Chen, Daniel Harabor, Ryan Hechnenberger, Nathan R. Sturtevant

Main category: cs.AI

TL;DR: The paper presents an automated online competition system that streamlines the submission and evaluation process for research competitions, addressing operational burdens and compatibility issues faced by organizers and participants.

Details

Motivation: Research communities face difficulties tracking progress across domains due to scattered publications and conflicting state-of-the-art claims. While periodic competitions help evaluate advancements, they create significant operational burdens for organizers who must manage large volumes of submissions and deal with compatibility issues from participants' diverse development environments.

Method: The authors developed an online competition system that automates the submission and evaluation process. The system uses isolated environments to evaluate submissions, enabling organizers to efficiently manage large numbers of submissions while avoiding compatibility issues.

Result: The system has been successfully deployed and used for several competitions, including the Grid-Based Pathfinding Competition and the League of Robot Runners competition, demonstrating its practical effectiveness in real-world scenarios.

Conclusion: The automated online competition system effectively addresses the operational challenges of research competitions by streamlining submission management and evaluation processes, making it easier for research communities to track progress and compare algorithmic performance across domains.

Abstract: Research communities have developed benchmark datasets across domains to compare the performance of algorithms and techniques However, tracking the progress in these research areas is not easy, as publications appear in different venues at the same time, and many of them claim to represent the state-of-the-art. To address this, research communities often organise periodic competitions to evaluate the performance of various algorithms and techniques, thereby tracking advancements in the field. However, these competitions pose a significant operational burden. The organisers must manage and evaluate a large volume of submissions. Furthermore, participants typically develop their solutions in diverse environments, leading to compatibility issues during the evaluation of their submissions. This paper presents an online competition system that automates the submission and evaluation process for a competition. The competition system allows organisers to manage large numbers of submissions efficiently, utilising isolated environments to evaluate submissions. This system has already been used successfully for several competitions, including the Grid-Based Pathfinding Competition and the League of Robot Runners competition.

[238] From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems

Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin

Main category: cs.AI

TL;DR: This paper presents a systematic review of how AI can accelerate and enhance the research process, organizing studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication, while identifying current challenges and future directions.

Details

Motivation: Research is fundamental to human civilization advancement but requires substantial time and effort from researchers. The rapid development of AI technologies has inspired exploration of how AI can accelerate and enhance research processes, necessitating a systematic review to monitor relevant advancements in this domain.

Method: The authors conducted a systematic review organizing relevant studies into three main categories: (1) hypothesis formulation (knowledge synthesis and hypothesis generation), (2) hypothesis validation (verification of scientific claims, theorem proving, and experiment validation), and (3) manuscript publication (manuscript writing and peer review process). They also identified current challenges and potential future directions.

Result: The paper provides a comprehensive categorization of AI applications in research across the three identified domains, discusses current challenges in these areas, identifies potential future research directions, and offers an overview of existing benchmarks and tools that support AI integration into the research process.

Conclusion: This systematic review serves as an introduction for beginners in AI-assisted research and aims to foster future research in this domain. The authors have made resources publicly available to support further development in the field of AI for research acceleration and enhancement.

Abstract: Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.

[239] Conflict Detection for Temporal Knowledge Graphs:A Fast Constraint Mining Algorithm and New Benchmarks

Jianhao Chen, Junyang Ren, Wentao Ding, Haoyuan Ouyang, Wei Hu, Yuzhong Qu

Main category: cs.AI

TL;DR: This paper proposes PaTeCon, an automatic pattern-based method for mining temporal constraints in knowledge graphs to detect conflicts, eliminating the need for manual constraint enumeration by experts.

Details

Motivation: Previous studies for maintaining temporal consistency in knowledge graphs rely on manually enumerated temporal constraints, which are labor-intensive and may have granularity issues. The introduction of time restrictions in temporal facts brings new challenges to knowledge graph quality management.

Method: The authors propose PaTeCon, a pattern-based temporal constraint mining method that uses graph patterns and statistical information from the given knowledge graph to automatically generate temporal constraints without requiring human experts. The method is optimized for significant speed improvement.

Result: The authors built two new benchmarks by annotating Wikidata and Freebase for conflict detection. Extensive experiments demonstrate that the pattern-based automatic constraint mining approach is highly effective in generating valuable temporal constraints.

Conclusion: PaTeCon successfully addresses the limitations of manual temporal constraint enumeration by providing an automatic, pattern-based solution that effectively generates temporal constraints for knowledge graph quality management, as validated through comprehensive experiments on real-world datasets.

Abstract: Temporal facts, which are used to describe events that occur during specific time periods, have become a topic of increased interest in the field of knowledge graph (KG) research. In terms of quality management, the introduction of time restrictions brings new challenges to maintaining the temporal consistency of KGs. Previous studies rely on manually enumerated temporal constraints to detect conflicts, which are labor-intensive and may have granularity issues. To address this problem, we start from the common pattern of temporal facts and propose a pattern-based temporal constraint mining method, PaTeCon. Unlike previous studies, PaTeCon uses graph patterns and statistical information relevant to the given KG to automatically generate temporal constraints, without the need for human experts. In this paper, we illustrate how this method can be optimized to achieve significant speed improvement. We also annotate Wikidata and Freebase to build two new benchmarks for conflict detection. Extensive experiments demonstrate that our pattern-based automatic constraint mining approach is highly effective in generating valuable temporal constraints.

[240] LLM as a code generator in Agile Model Driven Development

Ahmed R. Sadik, Sebastian Brulin, Markus Olhofer

Main category: cs.AI

TL;DR: This paper proposes an Agile Model Driven Development (AMDD) approach using GPT4 for automatic code generation from UML models, addressing the ambiguity challenges in natural language-based code generation by incorporating OCL and FIPA ontology constraints for multi-agent unmanned vehicle fleet systems.

Details

Motivation: Natural language descriptions for software development contain inherent ambiguity that poses substantial obstacles when using Large Language Models like GPT4 for generating deployable, structured code artifacts. There is a need for a more structured approach to overcome these challenges in automatic code generation.

Method: The paper employs Agile Model Driven Development (AMDD) with GPT4 as the code generator. They model a multi-agent Unmanned Vehicle Fleet (UVF) system using UML, integrate Object Constraint Language (OCL) for code structure meta-modeling, and use FIPA ontology language for communication semantics meta-modeling to reduce model ambiguity. GPT4 then generates Java and Python code compatible with JADE and PADE frameworks respectively.

Result: The auto-generated code successfully aligns with expected behaviors and shows enhanced agent interactions. When comparing code complexity, the ontology-constrained meta model produces more complex code than OCL-only constraints, but the cyclomatic complexity remains within manageable levels, staying below the high-risk threshold for complexity.

Conclusion: The AMDD approach with GPT4 effectively addresses ambiguity issues in automatic code generation by using structured modeling with OCL and FIPA ontology constraints. Additional meta-model constraints can be incorporated without exceeding complexity risk thresholds, making this approach viable for scalable and flexible code generation while maintaining agility for adapting to changes in models or deployment environments.

Abstract: Leveraging Large Language Models (LLM) like GPT4 in the auto generation of code represents a significant advancement, yet it is not without its challenges. The ambiguity inherent in natural language descriptions of software poses substantial obstacles to generating deployable, structured artifacts. This research champions Model Driven Development (MDD) as a viable strategy to overcome these challenges, proposing an Agile Model Driven Development (AMDD) approach that employs GPT4 as a code generator. This approach enhances the flexibility and scalability of the code auto generation process and offers agility that allows seamless adaptation to changes in models or deployment environments. We illustrate this by modeling a multi agent Unmanned Vehicle Fleet (UVF) system using the Unified Modeling Language (UML), significantly reducing model ambiguity by integrating the Object Constraint Language (OCL) for code structure meta modeling, and the FIPA ontology language for communication semantics meta modeling. Applying GPT4 auto generation capabilities yields Java and Python code that is compatible with the JADE and PADE frameworks, respectively. Our thorough evaluation of the auto generated code verifies its alignment with expected behaviors and identifies enhancements in agent interactions. Structurally, we assessed the complexity of code derived from a model constrained solely by OCL meta models, against that influenced by both OCL and FIPA ontology meta models. The results indicate that the ontology constrained meta model produces inherently more complex code, yet its cyclomatic complexity remains within manageable levels, suggesting that additional meta model constraints can be incorporated without exceeding the high risk threshold for complexity.

[241] LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants

Haochen Huang, Jiahuan Pei, Mohammad Aliannejadi, Xin Sun, Moonisa Ahsan, Chuang Yu, Zhaochun Ren, Pablo Cesar, Junxiao Wang

Main category: cs.AI

TL;DR: This paper introduces LEGO Co-builder, a hybrid benchmark for evaluating vision-language models on multimodal assembly instructions, revealing that even advanced models like GPT-4o struggle with fine-grained spatial reasoning and object state detection in assembly tasks.

Details

Motivation: Vision-language models face significant challenges in understanding and following multimodal assembly instructions, particularly when tasks require fine-grained spatial reasoning and precise object state detection. Current evaluation methods lack comprehensive benchmarks that can systematically assess these capabilities in real-world assembly scenarios.

Method: The authors developed LEGO Co-builder, a hybrid benchmark that combines real-world LEGO assembly logic with programmatically generated multimodal scenes. The dataset captures stepwise visual states and procedural instructions, providing a controlled environment for evaluating instruction-following, object detection, and state detection. They introduced a unified framework to assess leading VLMs including GPT-4o, Gemini, and Qwen-VL under both zero-shot and fine-tuned settings.

Result: The evaluation revealed significant limitations in current VLMs’ performance on fine-grained assembly tasks. Even the most advanced model, GPT-4o, achieved only a maximum F1 score of 40.54% on state detection tasks, demonstrating substantial gaps in fine-grained visual understanding capabilities across all tested models.

Conclusion: The study highlights critical limitations in current vision-language models’ ability to handle fine-grained spatial reasoning and object state detection in assembly tasks. The authors released the benchmark, codebase, and generation pipeline to facilitate future research on multimodal assembly assistants that can better support real-world workflows and applications.

Abstract: Vision-language models (VLMs) are facing the challenges of understanding and following multimodal assembly instructions, particularly when fine-grained spatial reasoning and precise object state detection are required. In this work, we explore LEGO Co-builder, a hybrid benchmark combining real-world LEGO assembly logic with programmatically generated multimodal scenes. The dataset captures stepwise visual states and procedural instructions, allowing controlled evaluation of instruction-following, object detection, and state detection. We introduce a unified framework and assess leading VLMs such as GPT-4o, Gemini, and Qwen-VL, under zero-shot and fine-tuned settings. Our results reveal that even advanced models like GPT-4o struggle with fine-grained assembly tasks, with a maximum F1 score of just 40.54% on state detection, highlighting gaps in fine-grained visual understanding. We release the benchmark, codebase, and generation pipeline to support future research on multimodal assembly assistants grounded in real-world workflows.

[242] Learning Neural Strategy-Proof Matching Mechanism from Examples

Ryota Maruo, Koh Takeuchi, Hisashi Kashima

Main category: cs.AI

TL;DR: This paper proposes NeuralSD, a neural network architecture that learns strategy-proof two-sided matching mechanisms from examples while handling variable numbers of agents and contextual information, based on serial dictatorship with differentiable tensor operations.

Details

Motivation: Existing learning-based matching mechanisms cannot guarantee strategy-proofness and struggle with varying agent numbers and contextual information, which are crucial limitations for real-world applications where agents might manipulate the system.

Method: The authors develop NeuralSD, a neural network based on serial dictatorship (SD) that uses attention mechanisms to compute agent rankings from contextual information. They introduce tensor serial dictatorship (TSD) as a differentiable relaxation of SD using tensor operations, enabling end-to-end training while maintaining strategy-proofness.

Result: NeuralSD outperformed baseline methods in predicting matchings from examples and achieved better performance on multiple metrics measuring the quality of matching outcomes, while maintaining strategy-proofness guarantees.

Conclusion: The proposed NeuralSD framework successfully addresses key limitations of existing learning-based matching mechanisms by guaranteeing strategy-proofness, handling variable agent numbers, and incorporating contextual information, making it more suitable for practical applications.

Abstract: Designing two-sided matching mechanisms is challenging when practical demands for matching outcomes are difficult to formalize and the designed mechanism must satisfy theoretical conditions. To address this, prior work has proposed a framework that learns a matching mechanism from examples, using a parameterized family that satisfies properties such as stability. However, despite its usefulness, this framework does not guarantee strategy-proofness (SP), and cannot handle varying numbers of agents or incorporate publicly available contextual information about agents, both of which are crucial in real-world applications. In this paper, we propose a new parametrized family of matching mechanisms that always satisfy strategy-proofness, are applicable for an arbitrary number of agents, and deal with public contextual information of agents, based on the serial dictatorship (SD). This family is represented by NeuralSD, a novel neural network architecture based on SD, where agent rankings in SD are treated as learnable parameters computed from agents’ contexts using an attention-based sub-network. To enable learning, we introduce tensor serial dictatorship (TSD), a differentiable relaxation of SD using tensor operations. This allows NeuralSD to be trained end-to-end from example matchings while satisfying SP. We conducted experiments to learn a matching mechanism from matching examples while satisfying SP. We demonstrated that our method outperformed baselines in predicting matchings and on several metrics for goodness of matching outcomes.

[243] Balans: Multi-Armed Bandits-based Adaptive Large Neighborhood Search for Mixed-Integer Programming Problem

Junyang Cai, Serdar Kadioglu, Bistra Dilkina

Main category: cs.AI

TL;DR: Balans is an adaptive meta-solver for mixed-integer programming that uses online learning and multi-armed bandit algorithms to dynamically select neighborhood operators during search, eliminating the need for offline training while achieving better performance than default MIP solvers.

Details

Motivation: Existing learning-based MIP solving approaches suffer from heavy reliance on offline training, requiring costly data collection and training epochs while offering limited generalization to unseen or larger instances. There is a need for adaptive MIP solving methods that can learn online without supervision or prior training.

Method: Balans employs adaptive large-neighborhood search operating on top of MIP solvers through successive destroy and repair neighborhood operators. The key innovation is using multi-armed bandit algorithms to guide the selection among different neighborhood definitions on-the-fly for each specific instance during the search process.

Result: Extensive experiments on hard optimization instances demonstrate that Balans provides significant performance gains over default MIP solvers, outperforms committing to any single best neighborhood, and improves upon state-of-the-art large-neighborhood search methods for MIPs.

Conclusion: Balans successfully addresses the limitations of offline training-dependent MIP solving approaches by providing an adaptive, online learning solution that achieves superior performance without requiring supervision or prior training. The authors have released it as configurable, MIP solver agnostic, open-source software.

Abstract: Mixed-integer programming (MIP) is a powerful paradigm for modeling and solving various important combinatorial optimization problems. Recently, learning-based approaches have shown a potential to speed up MIP solving via offline training that then guides important design decisions during the search. However, a significant drawback of these methods is their heavy reliance on offline training, which requires collecting training datasets and computationally costly training epochs yet offering only limited generalization to unseen (larger) instances. In this paper, we propose Balans, an adaptive meta-solver for MIPs with online learning capability that does not require any supervision or apriori training. At its core, Balans is based on adaptive large-neighborhood search, operating on top of an MIP solver by successive applications of destroy and repair neighborhood operators. During the search, the selection among different neighborhood definitions is guided on the fly for the instance at hand via multi-armed bandit algorithms. Our extensive experiments on hard optimization instances show that Balans offers significant performance gains over the default MIP solver, is better than committing to any single best neighborhood, and improves over the state-of-the-art large-neighborhood search for MIPs. Finally, we release Balans as a highly configurable, MIP solver agnostic, open-source software.

[244] A novel approach to navigate the taxonomic hierarchy to address the Open-World Scenarios in Medicinal Plant Classification

Soumen Sinha, Tanisha Rana, Rahul Roy

Main category: cs.AI

TL;DR: The paper proposes a novel hierarchical taxonomy classification method for medicinal plants that can handle unknown species by integrating DenseNet121, Multi-Scale Self-Attention (MSSA), and cascaded classifiers, achieving good performance while being 4x smaller than existing methods.

Details

Motivation: Existing medicinal plant classification methods fail to perform hierarchical classification and accurately identify unknown species, limiting their effectiveness in comprehensive plant taxonomy. There's a need for a system that can classify plants at multiple taxonomic levels (phylum to species) and handle unknown species by assigning appropriate hierarchical labels.

Method: The approach integrates three key components: DenseNet121 as the backbone network, Multi-Scale Self-Attention (MSSA) mechanism to capture both local and global contextual information across multiple scales, and cascaded classifiers for hierarchical classification from phylum to species level. The attention mechanism helps focus on important features and improves distinction between similar species.

Result: The model achieved average accuracies of 83.36% for phylum, 78.30% for class, 60.34% for order, and 43.32% for family prediction on unknown species. The model was tested on two state-of-the-art datasets with and without background artifacts. The proposed model is approximately four times smaller than existing state-of-the-art methods.

Conclusion: The proposed method provides an effective solution for hierarchical plant taxonomy classification that can handle both known and unknown species. The integration of DenseNet121, MSSA, and cascaded classifiers demonstrates superior performance in identifying species across multiple taxonomic levels while maintaining a compact model size suitable for real-world deployment.

Abstract: In this article, we propose a novel approach for plant hierarchical taxonomy classification by posing the problem as an open class problem. It is observed that existing methods for medicinal plant classification often fail to perform hierarchical classification and accurately identifying unknown species, limiting their effectiveness in comprehensive plant taxonomy classification. Thus we address the problem of unknown species classification by assigning it best hierarchical labels. We propose a novel method, which integrates DenseNet121, Multi-Scale Self-Attention (MSSA) and cascaded classifiers for hierarchical classification. The approach systematically categorizes medicinal plants at multiple taxonomic levels, from phylum to species, ensuring detailed and precise classification. Using multi scale space attention, the model captures both local and global contextual information from the images, improving the distinction between similar species and the identification of new ones. It uses attention scores to focus on important features across multiple scales. The proposed method provides a solution for hierarchical classification, showcasing superior performance in identifying both known and unknown species. The model was tested on two state-of-art datasets with and without background artifacts and so that it can be deployed to tackle real word application. We used unknown species for testing our model. For unknown species the model achieved an average accuracy of 83.36%, 78.30%, 60.34% and 43.32% for predicting correct phylum, class, order and family respectively. Our proposed model size is almost four times less than the existing state of the art methods making it easily deploy able in real world application.

[245] More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong

Main category: cs.AI

TL;DR: This study reveals that while multi-model generated preference data improves general task performance in DPO alignment, it creates safety vulnerabilities by enabling reward hacking and increasing jailbreak attack success rates, with single-model self-generated data proving safer for alignment.

Details

Motivation: The need to understand safety implications of different preference data generation strategies in Direct Preference Optimization (DPO) for aligning large language models with human values, particularly given the growing use of synthetic preference data for cost-effective alignment.

Method: Comparative analysis of single-model vs multi-model preference data generation for DPO alignment, examining performance on general benchmarks (ARC, Hellaswag, MMLU, TruthfulQA, Winogrande) and safety metrics (attack success rate against jailbreaking prompts), with experiments across Llama, Mistral, and Qwen model families.

Result: Multi-model generated data improves general task performance but dramatically increases vulnerability to jailbreaking attacks due to reward hacking. Single-model self-generated responses significantly outperform multi-model configurations in safety metrics. Multi-model preference data shows high linear separability, allowing models to exploit superficial cues instead of learning robust safety constraints.

Conclusion: For safety-critical applications, single-model self-generated preference data is superior to multi-model approaches in DPO alignment, as multi-model data creates exploitable patterns that compromise safety despite improving general performance. The high linear separability in multi-model data enables reward hacking rather than genuine safety learning.

Abstract: Aligning large language models (LLMs) with human values is an increasingly critical step in post-training. Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback (RLHF). Synthetic preference data with its low cost and high quality enable effective alignment through single- or multi-model generated preference data. Our study reveals a striking, safety-specific phenomenon associated with DPO alignment: Although multi-model generated data enhances performance on general tasks (ARC, Hellaswag, MMLU, TruthfulQA, Winogrande) by providing diverse responses, it also tends to facilitate reward hacking during training. This can lead to a high attack success rate (ASR) when models encounter jailbreaking prompts. The issue is particularly pronounced when employing stronger models like GPT-4o or larger models in the same family to generate chosen responses paired with target model self-generated rejected responses, resulting in dramatically poorer safety outcomes. Furthermore, with respect to safety, using solely self-generated responses (single-model generation) for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models, whether used directly as chosen data or as part of a multi-model response pool. We demonstrate that multi-model preference data exhibits high linear separability between chosen and rejected responses, which allows models to exploit superficial cues rather than internalizing robust safety constraints. Our experiments, conducted on models from the Llama, Mistral, and Qwen families, consistently validate these findings.

[246] Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Xinji Mai, Haotian Xu, Xing W, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang

Main category: cs.AI

TL;DR: The paper introduces ZeroTIR, a reinforcement learning approach that trains large language models to autonomously generate and execute Python code for mathematical reasoning without supervised tool-use examples, demonstrating predictable scaling relationships between training effort and tool-augmented reasoning capabilities.

Details

Motivation: Large Language Models struggle with mathematical reasoning tasks that require precise computation. While RL from outcome-based rewards can improve text-based reasoning, it's unclear how agents learn to autonomously use external tools like code execution for mathematical problem-solving without explicit supervision.

Method: The authors develop ZeroTIR (Zero Tool-Integrated Reasoning), which uses reinforcement learning from outcome-based rewards to train base LLMs to spontaneously generate and execute Python code for mathematical problems. They implement a decoupled code execution environment and validate across standard RL algorithms and frameworks.

Result: Key metrics scale predictably during RL training, showing strong positive correlations between increased training steps and spontaneous code execution frequency, average response length, and final task accuracy. ZeroTIR significantly outperforms non-tool ZeroRL baselines on challenging math benchmarks, demonstrating a quantifiable relationship between computational training effort and effective tool-augmented reasoning strategies.

Conclusion: The study provides foundational understanding of how autonomous tool use emerges and scales in Agent RL systems. The predictable scaling relationships offer insights into the acquisition of tool-augmented reasoning capabilities and establish a reproducible benchmark for future research in this area.

Abstract: Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at \href{https://github.com/yyht/openrlhf_async_pipline}{https://github.com/yyht/openrlhf_async_pipline}.

[247] Turing Test 2.0: The General Intelligence Threshold

Georgios Mappouras

Main category: cs.AI

TL;DR: This paper proposes a new “Turing test 2.0” framework to detect artificial general intelligence (AGI), arguing that traditional methods like the original Turing test are insufficient for measuring AGI in modern AI systems.

Details

Motivation: The rise of AI and large language models has sparked a race toward AGI, but there is no clear agreement on how to detect when AI systems achieve AGI. Traditional methods like the Turing test are inadequate for measuring true general intelligence in modern AI models.

Method: The authors introduce two key contributions: (1) a clear definition of general intelligence (G.I.) with a G.I. Threshold (G.I.T.) to distinguish AGI-capable systems from those that are not, and (2) a new “Turing test 2.0” framework for constructing tests that can detect general intelligence in a simple, comprehensive, and clear-cut pass/fail manner.

Result: The paper demonstrates real-life examples of applying tests following the Turing test 2.0 framework on modern AI models, showing how the proposed method can be practically implemented to evaluate current AI systems.

Conclusion: The proposed Turing test 2.0 framework provides a more suitable and practical approach for detecting AGI compared to traditional methods, offering clear criteria and testing procedures that can definitively determine whether a system has achieved general intelligence.

Abstract: With the rise of artificial intelligence (A.I.) and large language models like ChatGPT, a new race for achieving artificial general intelligence (A.G.I) has started. While many speculate how and when A.I. will achieve A.G.I., there is no clear agreement on how A.G.I. can be detected in A.I. models, even when popular tools like the Turing test (and its modern variations) are used to measure their intelligence. In this work, we discuss why traditional methods like the Turing test do not suffice for measuring or detecting A.G.I. and provide a new, practical method that can be used to decide if a system (computer or any other) has reached or surpassed A.G.I. To achieve this, we make two new contributions. First, we present a clear definition for general intelligence (G.I.) and set a G.I. Threshold (G.I.T.) that can be used to distinguish between systems that achieve A.G.I. and systems that do not. Second, we present a new framework on how to construct tests that can detect if a system has achieved G.I. in a simple, comprehensive, and clear-cut fail/pass way. We call this novel framework the Turing test 2.0. We then demonstrate real-life examples of applying tests that follow our Turing test 2.0 framework on modern A.I. models.

[248] Tournament of Prompts: Evolving LLM Instructions Through Structured Debates and Elo Ratings

Anirudh Nair, Adi Banerjee, Laurent Mombaerts, Matthew Hagen, Tarik Borogovac

Main category: cs.AI

TL;DR: DEEVO is a novel framework that uses debate-driven evaluation and Elo-based selection to automatically optimize prompts for Large Language Models, eliminating the need for manual prompt engineering and predefined metrics while outperforming existing methods on both open-ended and close-ended tasks.

Details

Motivation: Prompt engineering is a critical bottleneck for leveraging LLMs effectively, requiring specialized expertise and manual intervention. Existing automated prompt optimization methods fail on subjective quality assessment tasks because they need well-defined numerical fitness functions or rely on generic templates that cannot capture complex, nuanced requirements.

Method: DEEVO (DEbate-driven EVOlutionary prompt optimization) uses a debate-driven evaluation system with Elo-based selection to guide prompt evolution. The framework explores discrete prompt space while maintaining semantic coherence through intelligent crossover and strategic mutation operations that incorporate debate-based feedback, combining elements from successful and unsuccessful prompts based on identified strengths rather than arbitrary splicing.

Result: DEEVO significantly outperforms both manual prompt engineering and state-of-the-art optimization approaches on open-ended and close-ended tasks, achieving this without requiring ground truth feedback. The method successfully drives improvement while preserving valuable diversity in the prompt population using Elo ratings as a fitness proxy.

Conclusion: DEEVO represents a significant advancement in prompt optimization research by connecting LLMs’ reasoning capabilities with adaptive optimization, eliminating the need for predetermined metrics while continuously improving AI systems. This approach addresses the fundamental challenge of optimizing prompts for subjective quality assessment tasks.

Abstract: Prompt engineering represents a critical bottleneck to harness the full potential of Large Language Models (LLMs) for solving complex tasks, as it requires specialized expertise, significant trial-and-error, and manual intervention. This challenge is particularly pronounced for tasks involving subjective quality assessment, where defining explicit optimization objectives becomes fundamentally problematic. Existing automated prompt optimization methods falter in these scenarios, as they typically require well-defined task-specific numerical fitness functions or rely on generic templates that cannot capture the nuanced requirements of complex use cases. We introduce DEEVO (DEbate-driven EVOlutionary prompt optimization), a novel framework that guides prompt evolution through a debate-driven evaluation with an Elo-based selection. Contrary to prior work, DEEVOs approach enables exploration of the discrete prompt space while preserving semantic coherence through intelligent crossover and strategic mutation operations that incorporate debate-based feedback, combining elements from both successful and unsuccessful prompts based on identified strengths rather than arbitrary splicing. Using Elo ratings as a fitness proxy, DEEVO simultaneously drives improvement and preserves valuable diversity in the prompt population. Experimental results demonstrate that DEEVO significantly outperforms both manual prompt engineering and alternative state-of-the-art optimization approaches on open-ended tasks and close-ended tasks despite using no ground truth feedback. By connecting LLMs reasoning capabilities with adaptive optimization, DEEVO represents a significant advancement in prompt optimization research by eliminating the need of predetermined metrics to continuously improve AI systems.

[249] A Community-driven vision for a new Knowledge Resource for AI

Vinay K Chaudhri, Chaitan Baru, Brandon Bennett, Mehul Bhatt, Darion Cassel, Anthony G Cohn, Rina Dechter, Esra Erdem, Dave Ferrucci, Ken Forbus, Gregory Gelfond, Michael Genesereth, Andrew S. Gordon, Benjamin Grosof, Gopal Gupta, Jim Hendler, Sharat Israni, Tyler R. Josephson, Patrick Kyllonen, Yuliya Lierler, Vladimir Lifschitz, Clifton McFate, Hande K. McGinty, Leora Morgenstern, Alessandro Oltramari, Praveen Paritosh, Dan Roth, Blake Shepard, Cogan Shimzu, Denny Vrandečić, Mark Whiting, Michael Witbrock

Main category: cs.AI

TL;DR: This paper synthesizes findings from an AAAI workshop with 50+ researchers on creating a comprehensive, multi-purpose knowledge resource for AI, proposing a community-driven vision for new knowledge infrastructure that includes an open engineering framework with knowledge modules for practical applications.

Details

Motivation: Despite existing knowledge resources like WordNet and ConceptNet, AI still lacks verifiable, general-purpose knowledge sources. This deficiency affects large language models (knowledge gaps), robotic planning (missing world knowledge), and fact-checking (reliance on human expertise). The motivation is to address this critical AI infrastructure gap.

Method: The authors conducted an AAAI workshop gathering over 50 researchers to explore what kind of knowledge resource is most needed in AI and how modern technology can shape its development. They synthesized findings from this collaborative effort to develop a community-driven vision.

Result: The workshop produced a synthesized vision for new knowledge infrastructure that leverages contemporary advances in knowledge representation and reasoning. A key promising approach identified is building an open engineering framework that can effectively exploit knowledge modules within practical applications, supported by appropriate conventions and social structures.

Conclusion: The paper concludes that AI needs a new type of knowledge infrastructure that goes beyond existing resources. This should be built as a community-driven effort using an open engineering framework that incorporates knowledge modules with established conventions and social structures to ensure effective adoption and contribution by the research community.

Abstract: The long-standing goal of creating a comprehensive, multi-purpose knowledge resource, reminiscent of the 1984 Cyc project, still persists in AI. Despite the success of knowledge resources like WordNet, ConceptNet, Wolfram|Alpha and other commercial knowledge graphs, verifiable, general-purpose widely available sources of knowledge remain a critical deficiency in AI infrastructure. Large language models struggle due to knowledge gaps; robotic planning lacks necessary world knowledge; and the detection of factually false information relies heavily on human expertise. What kind of knowledge resource is most needed in AI today? How can modern technology shape its development and evaluation? A recent AAAI workshop gathered over 50 researchers to explore these questions. This paper synthesizes our findings and outlines a community-driven vision for a new knowledge infrastructure. In addition to leveraging contemporary advances in knowledge representation and reasoning, one promising idea is to build an open engineering framework to exploit knowledge modules effectively within the context of practical applications. Such a framework should include sets of conventions and social structures that are adopted by contributors.

[250] Working with AI: Measuring the Occupational Implications of Generative AI

Kiran Tomlinson, Sonia Jaffe, Will Wang, Scott Counts, Siddharth Suri

Main category: cs.AI

TL;DR: This paper analyzes 200k conversations with Microsoft Bing Copilot to understand how people use AI for work activities and calculates AI applicability scores for different occupations, finding highest impact on knowledge work like computer/mathematical and office support roles.

Details

Motivation: Understanding the economic impact of generative AI adoption is crucial for society. The authors aim to empirically analyze how AI is actually being used in work contexts and which occupations are most affected by studying real user interactions with AI systems.

Method: The researchers analyzed a dataset of 200,000 anonymized conversations between users and Microsoft Bing Copilot. They classified work activities that users seek AI help for, measured task success rates and scope of impact, then combined these metrics with occupational data to compute AI applicability scores for different job categories.

Result: The most common work activities people use AI for are information gathering and writing. AI most frequently provides information, assistance, writing, teaching, and advising. Knowledge work occupations (computer/mathematical, office/administrative support) and sales roles showed the highest AI applicability scores. The study also revealed correlations between wage/education levels and AI applicability.

Conclusion: The research provides empirical evidence that AI has the highest applicability for knowledge-based occupations, particularly those involving information processing and communication. This real-world usage data offers insights into actual AI adoption patterns compared to theoretical predictions of occupational AI impact.

Abstract: Given the rapid adoption of generative AI and its potential to impact a wide range of tasks, understanding the effects of AI on the economy is one of society’s most important questions. In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations do those activities. We analyze a dataset of 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot, a publicly available generative AI system. We find the most common work activities people seek AI assistance for involve gathering information and writing, while the most common activities that AI itself is performing are providing information and assistance, writing, teaching, and advising. Combining these activity classifications with measurements of task success and scope of impact, we compute an AI applicability score for each occupation. We find the highest AI applicability scores for knowledge work occupation groups such as computer and mathematical, and office and administrative support, as well as occupations such as sales whose work activities involve providing and communicating information. Additionally, we characterize the types of work activities performed most successfully, how wage and education correlate with AI applicability, and how real-world usage compares to predictions of occupational AI impact.

[251] Automated planning with ontologies under coherence update semantics (Extended Version)

Stefan Borgwardt, Duy Nhu, Gabriele Röger

Main category: cs.AI

TL;DR: This paper presents a new approach for automated planning that incorporates DL-Lite ontologies, combining explicit-input knowledge and action bases (eKABs) with coherence update semantics, while maintaining polynomial complexity and providing practical implementation through compilation to classical planning.

Details

Motivation: Standard automated planning uses closed-world semantics, but incorporating background knowledge through ontologies (which use open-world semantics) can enhance planning capabilities. The motivation is to bridge this gap by developing a method that effectively combines ontology-based knowledge with automated planning while maintaining computational tractability.

Method: The paper develops a planning approach that combines DL-Lite ontologies with explicit-input knowledge and action bases (eKABs) for action conditions, and uses coherence update semantics for ontology-aware action effects. The method is implemented through polynomial compilation into classical planning problems.

Result: The resulting formalism maintains the same complexity as previous approaches (not higher complexity). The authors provide a practical implementation and evaluate their system on both existing and new benchmarks, examining performance across different compilation variants.

Conclusion: The paper successfully demonstrates that DL-Lite ontologies can be effectively integrated into automated planning without increasing computational complexity, providing a viable approach for incorporating background knowledge into planning systems through polynomial compilation to classical planning.

Abstract: Standard automated planning employs first-order formulas under closed-world semantics to achieve a goal with a given set of actions from an initial state. We follow a line of research that aims to incorporate background knowledge into automated planning problems, for example, by means of ontologies, which are usually interpreted under open-world semantics. We present a new approach for planning with DL-Lite ontologies that combines the advantages of ontology-based action conditions provided by explicit-input knowledge and action bases (eKABs) and ontology-aware action effects under the coherence update semantics. We show that the complexity of the resulting formalism is not higher than that of previous approaches and provide an implementation via a polynomial compilation into classical planning. An evaluation of existing and new benchmarks examines the performance of a planning system on different variants of our compilation.

[252] Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

Zhenyun Yin, Shujie Wang, Xuhong Wang, Xingjun Ma, Yinchun Wang

Main category: cs.AI

TL;DR: This paper introduces Deliberative Searcher, a framework that combines certainty calibration with retrieval-based search for open-domain question answering, using reinforcement learning to improve the reliability and trustworthiness of large language models.

Details

Motivation: Large language models need improved reliability for real-world deployment, particularly in ensuring that model confidence aligns with actual correctness to produce more trustworthy outputs.

Method: The proposed Deliberative Searcher framework integrates certainty calibration with retrieval-based search, performs multi-step reflection and verification over Wikipedia data, and uses reinforcement learning optimization with accuracy targets under soft reliability constraints.

Result: Empirical results demonstrate improved alignment between model confidence and correctness, leading to more trustworthy outputs in open-domain question answering tasks.

Conclusion: The Deliberative Searcher framework successfully enhances LLM reliability by combining certainty calibration with retrieval-based search, showing promise for more trustworthy AI systems in real-world applications.

Abstract: Improving the reliability of large language models (LLMs) is critical for deploying them in real-world scenarios. In this paper, we propose \textbf{Deliberative Searcher}, the first framework to integrate certainty calibration with retrieval-based search for open-domain question answering. The agent performs multi-step reflection and verification over Wikipedia data and is trained with a reinforcement learning algorithm that optimizes for accuracy under a soft reliability constraint. Empirical results show that proposed method improves alignment between model confidence and correctness, leading to more trustworthy outputs. This paper will be continuously updated.

cs.SD

[253] Weak Supervision Techniques towards Enhanced ASR Models in Industry-level CRM Systems

Zhongsheng Wang, Sijie Wang, Jia Wang, Yung-I Liang, Yuxi Zhang, Jiamou Liu

Main category: cs.SD

TL;DR: This paper proposes a fine-tuning approach for industry-specific automatic speech recognition (ASR) models to improve customer relationship management (CRM) systems by better identifying customer types and enabling personalized services.

Details

Motivation: General pre-trained ASR models struggle with industry-specific speech recognition tasks in CRM systems, making it difficult to accurately identify customer voices and intentions for personalized service delivery.

Method: The authors developed a fine-tuning solution specifically designed for industry-specific ASR models to enhance their performance in specialized business contexts.

Result: Experimental results demonstrate substantial improvement in ASR model performance for industry applications, with the fine-tuned models significantly enhancing their auxiliary role in CRM systems.

Conclusion: The proposed fine-tuning approach successfully addresses industry-specific speech recognition challenges in CRM systems and has been successfully deployed in actual industrial applications, improving customer type identification and personalized service delivery.

Abstract: In the design of customer relationship management (CRM) systems, accurately identifying customer types and offering personalized services are key to enhancing customer satisfaction and loyalty. However, this process faces the challenge of discerning customer voices and intentions, and general pre-trained automatic speech recognition (ASR) models make it difficult to effectively address industry-specific speech recognition tasks. To address this issue, we innovatively proposed a solution for fine-tuning industry-specific ASR models, which significantly improved the performance of the fine-tuned ASR models in industry applications. Experimental results show that our method substantially improves the crucial auxiliary role of the ASR model in industry CRM systems, and this approach has also been adopted in actual industrial applications.

Tobias Morocutti, Jonathan Greif, Paul Primus, Florian Schmid, Gerhard Widmer

Main category: cs.SD

TL;DR: This paper presents a novel approach for spatial semantic segmentation of sound scenes (S5) that improves upon conventional two-stage pipelines by integrating fine-tuned Transformers for sound event detection with an iterative refinement mechanism, achieving second place in DCASE Challenge 2025 Task 4.

Details

Motivation: Conventional S5 systems using two-stage pipelines (audio tagging followed by label-conditioned source separation) are limited by the absence of fine-grained temporal information, which is critical for effective separation of sound sources from complex acoustic mixtures.

Method: The approach involves three key components: (1) fine-tuning a pre-trained Transformer for active sound class detection, (2) using a separate instance of the fine-tuned Transformer for sound event detection (SED) to provide detailed time-varying guidance to the separation module, and (3) implementing an iterative refinement mechanism that progressively enhances separation quality by recursively reusing the separator’s output from previous iterations.

Result: The system achieved significant improvements in both audio tagging and source separation performance, securing second place in Task 4 of the DCASE Challenge 2025. The enhanced synergy between event detection and source separation stages demonstrated the effectiveness of incorporating fine-grained temporal information.

Conclusion: The proposed S5 approach successfully addresses the limitations of conventional two-stage pipelines by enhancing the integration between event detection and source separation through fine-tuned Transformers and iterative refinement, leading to superior performance in spatial semantic segmentation of sound scenes.

Abstract: Spatial semantic segmentation of sound scenes (S5) involves the accurate identification of active sound classes and the precise separation of their sources from complex acoustic mixtures. Conventional systems rely on a two-stage pipeline - audio tagging followed by label-conditioned source separation - but are often constrained by the absence of fine-grained temporal information critical for effective separation. In this work, we address this limitation by introducing a novel approach for S5 that enhances the synergy between the event detection and source separation stages. Our key contributions are threefold. First, we fine-tune a pre-trained Transformer to detect active sound classes. Second, we utilize a separate instance of this fine-tuned Transformer to perform sound event detection (SED), providing the separation module with detailed, time-varying guidance. Third, we implement an iterative refinement mechanism that progressively enhances separation quality by recursively reusing the separator’s output from previous iterations. These advancements lead to significant improvements in both audio tagging and source separation performance, as demonstrated by our system’s second-place finish in Task 4 of the DCASE Challenge 2025. Our implementation and model checkpoints are available in our GitHub repository: https://github.com/theMoro/dcase25task4 .

[255] Application of Whisper in Clinical Practice: the Post-Stroke Speech Assessment during a Naming Task

Milena Davudova, Ziyuan Cai, Valentina Giunchiglia, Dragos C. Gruia, Giulia Sanguedolce, Adam Hampshire, Fatemeh Geranmayeh

Main category: cs.SD

TL;DR: This study evaluates Whisper ASR model for transcribing stroke patients’ speech in picture-naming tasks, finding that while baseline performance is poor, fine-tuning significantly improves accuracy and enables prediction of speech quality, though cross-domain generalization remains limited.

Details

Motivation: Language assessment after stroke is cognitively complex and requires intensive clinical resources, limiting timely diagnosis. ASR foundation models like Whisper could potentially augment human evaluation, but their effectiveness for speech and language impairment assessment is uncertain.

Method: The researchers evaluated Whisper ASR model on stroke patients’ speech during picture-naming tasks, assessing both verbatim transcription accuracy and downstream prediction of language function. They fine-tuned the baseline model and tested generalizability on an unseen TORGO dataset.

Result: Baseline Whisper performed poorly on single-word utterances, but fine-tuning reduced Word Error Rate by 87.72% for healthy speech and 71.22% for patient speech. The model achieved average F1 Macro scores of 0.74 for healthy subjects and 0.75 for patients in speech quality prediction. However, evaluation on TORGO dataset showed limited cross-domain generalizability.

Conclusion: Fine-tuned foundation models like Whisper show potential for automated speech and language assessment in stroke rehabilitation, but challenges remain in cross-domain generalization. The models require adaptation to specific clinical populations rather than relying on zero-shot performance for clinical speech applications.

Abstract: Detailed assessment of language impairment following stroke remains a cognitively complex and clinician-intensive task, limiting timely and scalable diagnosis. Automatic Speech Recognition (ASR) foundation models offer a promising pathway to augment human evaluation through intelligent systems, but their effectiveness in the context of speech and language impairment remains uncertain. In this study, we evaluate whether Whisper, a state-of-the-art ASR foundation model, can be applied to transcribe and analyze speech from patients with stroke during a commonly used picture-naming task. We assess both verbatim transcription accuracy and the model’s ability to support downstream prediction of language function, which has major implications for outcomes after stroke. Our results show that the baseline Whisper model performs poorly on single-word speech utterances. Nevertheless, fine-tuning Whisper significantly improves transcription accuracy (reducing Word Error Rate by 87.72% in healthy speech and 71.22% in speech from patients). Further, learned representations from the model enable accurate prediction of speech quality (average F1 Macro of 0.74 for healthy, 0.75 for patients). However, evaluations on an unseen (TORGO) dataset reveal limited generalizability, highlighting the inability of Whisper to perform zero-shot transcription of single-word utterances on out-of-domain clinical speech and emphasizing the need to adapt models to specific clinical populations. While challenges remain in cross-domain generalization, these findings highlight the potential of foundation models, when appropriately fine-tuned, to advance automated speech and language assessment and rehabilitation for stroke-related impairments.

[256] BoSS: Beyond-Semantic Speech

Qing Wang, Zehan Li, Hang Lv, Hongjie Chen, Yaodong Song, Jian Kang, Jie Lian, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li

Main category: cs.SD

TL;DR: This paper introduces Beyond-Semantic Speech (BoSS) and a 5-level capability framework (L1-L5) to address the limitations of current speech technologies in capturing implicit signals and contextual cues beyond explicit semantics in human communication.

Details

Motivation: Current speech technologies like ASR and TTS fail to capture beyond-semantic dimensions of human communication, such as implicit signals, emotions, and contextual cues that are critical for shaping meaning in spoken interactions.

Method: The authors propose a hierarchical Spoken Interaction System Capability Levels framework (L1-L5) and formalize Beyond-Semantic Speech (BoSS) using cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics across multiple dimensions including affective cues, contextual dynamics, and implicit semantics.

Result: Evaluation of BoSS-related attributes across five dimensions reveals that current spoken language models (SLMs) struggle to fully interpret beyond-semantic signals, demonstrating significant limitations in understanding communicative intentions and contextual scenarios.

Conclusion: The findings highlight the critical need for advancing BoSS research to enable richer, more context-aware human-machine communication that can better capture the multidimensional nature of human speech beyond explicit semantics.

Abstract: Human communication involves more than explicit semantics, with implicit signals and contextual cues playing a critical role in shaping meaning. However, modern speech technologies, such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) often fail to capture these beyond-semantic dimensions. To better characterize and benchmark the progression of speech intelligence, we introduce Spoken Interaction System Capability Levels (L1-L5), a hierarchical framework illustrated the evolution of spoken dialogue systems from basic command recognition to human-like social interaction. To support these advanced capabilities, we propose Beyond-Semantic Speech (BoSS), which refers to the set of information in speech communication that encompasses but transcends explicit semantics. It conveys emotions, contexts, and modifies or extends meanings through multidimensional features such as affective cues, contextual dynamics, and implicit semantics, thereby enhancing the understanding of communicative intentions and scenarios. We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics. We evaluate BoSS-related attributes across five different dimensions, reveals that current spoken language models (SLMs) are hard to fully interpret beyond-semantic signals. These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.

[257] Audio-Vision Contrastive Learning for Phonological Class Recognition

Daiqi Liu, Tomás Arias-Vergara, Jana Hutter, Andreas Maier, Paula Andrea Pérez-Toro

Main category: cs.SD

TL;DR: This paper proposes a multimodal deep learning framework combining real-time MRI and speech signals to classify articulatory-phonological features, achieving state-of-the-art performance with contrastive learning-based fusion (F1-score of 0.81).

Details

Motivation: Accurate classification of articulatory-phonological features is crucial for understanding human speech production and developing robust speech technologies, especially in clinical contexts for targeted phonemic analysis, therapy, disease diagnosis, and personalized rehabilitation.

Method: A multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals using four configurations: unimodal rtMRI, unimodal audio, multimodal middle fusion, and contrastive learning-based audio-vision fusion to classify three articulatory dimensions (manner, place, and voicing) across 15 phonological classes.

Result: The contrastive learning-based approach achieved state-of-the-art performance with an average F1-score of 0.81, representing an absolute increase of 0.23 over the unimodal baseline on the USC-TIMIT dataset.

Conclusion: The results confirm the effectiveness of contrastive representation learning for multimodal articulatory analysis, demonstrating significant improvement over unimodal approaches for classifying articulatory-phonological features.

Abstract: Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset show that our contrastive learning-based approach achieves state-of-the-art performance, with an average F1-score of 0.81, representing an absolute increase of 0.23 over the unimodal baseline. The results confirm the effectiveness of contrastive representation learning for multimodal articulatory analysis. Our code and processed dataset will be made publicly available at https://github.com/DaE-plz/AC_Contrastive_Phonology to support future research.

[258] Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li

Main category: cs.SD

TL;DR: Koel-TTS introduces enhanced encoder-decoder Transformer models for text-to-speech synthesis that use preference alignment techniques and classifier-free guidance to improve controllability, reducing hallucinations while achieving better speaker similarity, intelligibility, and naturalness than state-of-the-art models despite using smaller training datasets.

Details

Motivation: Autoregressive speech token generation models suffer from lack of controllability, leading to hallucinations and undesired vocalizations that don't conform to conditioning inputs, creating a need for more controllable and reliable text-to-speech synthesis systems.

Method: The paper presents Koel-TTS, which uses enhanced encoder-decoder Transformer TTS models incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models, along with classifier-free guidance to improve synthesis adherence to transcript and reference speaker audio.

Result: Koel-TTS significantly enhances target speaker similarity, intelligibility, and naturalness of synthesized speech, outperforming state-of-the-art TTS models on these metrics while being trained on a significantly smaller dataset and directly mapping text and context audio to acoustic tokens.

Conclusion: The integration of preference alignment techniques and classifier-free guidance in encoder-decoder Transformer models successfully addresses controllability issues in speech synthesis, demonstrating that smaller, well-optimized models can outperform larger state-of-the-art systems in key quality metrics.

Abstract: While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned metrics, outperforms state-of-the-art TTS models, despite being trained on a significantly smaller dataset. Audio samples and demos are available on our website.

[259] HiFi-Stream: Streaming Speech Enhancement with Generative Adversarial Networks

Ekaterina Dmitrieva, Maksim Kaledin

Main category: cs.SD

TL;DR: HiFi-Stream is an optimized speech enhancement model that significantly reduces computational complexity while maintaining quality, making it suitable for low-resource devices and streaming applications.

Details

Motivation: Modern deep learning speech enhancement solutions require high computational resources, making them challenging to deploy on low-resource mobile devices and voice software applications.

Method: The authors developed HiFi-Stream, an optimized version of the HiFi++ model that reduces size and computational complexity while preserving speech enhancement quality.

Result: HiFi-Stream maintains most qualities of the original HiFi++ model while achieving improved size and computational efficiency, making it one of the smallest and fastest available models. It demonstrates superior performance compared to modern baselines in streaming settings.

Conclusion: HiFi-Stream successfully addresses the computational limitations of existing speech enhancement models, providing an efficient solution for deployment on resource-constrained devices while maintaining competitive performance quality.

Abstract: Speech Enhancement techniques have become core technologies in mobile devices and voice software. Still, modern deep learning solutions often require high amount of computational resources what makes their usage on low-resource devices challenging. We present HiFi-Stream, an optimized version of recently published HiFi++ model. Our experiments demonstrate that HiFi-Stream saves most of the qualities of the original model despite its size and computational complexity improved in comparison to the original HiFi++ making it one of the smallest and fastest models available. The model is evaluated in streaming setting where it demonstrates its superior performance in comparison to modern baselines.

[260] Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

Shigeki Karita, Yuma Koizumi, Heiga Zen, Haruko Ishikawa, Robin Scheibler, Michiel Bacchiani

Main category: cs.SD

TL;DR: Miipher-2 is a speech restoration model designed for cleaning million-hour scale training data for large generative models, achieving efficient processing with superior performance across 300+ languages using a frozen Universal Speech Model and parallel adapters.

Details

Motivation: The need for efficient training data cleaning for large-scale generative models requires speech restoration that can generalize to unseen languages, operate without explicit conditioning, and maintain computational efficiency at million-hour scale.

Method: Uses a frozen pre-trained Universal Speech Model (USM) supporting 300+ languages as a feature extractor, incorporates parallel adapters to predict clean USM features from noisy inputs, and employs WaveFit neural vocoder for waveform synthesis. Trained on 3,000 hours of multi-lingual studio-quality recordings with augmented degradations.

Result: Achieves superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and objective/subjective sound quality scores across all tested languages. Operates with real-time factor of 0.0078, enabling processing of million-hour datasets in ~3 days using 100 consumer-grade accelerators.

Conclusion: Miipher-2 successfully addresses the key challenges of large-scale speech data cleaning by combining frozen USM features with efficient parallel adapters, demonstrating both high quality restoration and remarkable computational efficiency for million-hour scale processing.

Abstract: Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaveFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2’s superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.

cs.LG

[261] Evaluating Artificial Intelligence Algorithms for the Standardization of Transtibial Prosthetic Socket Shape Design

C. H. E. Jordaan, M. van der Stelt, T. J. J. Maal, V. M. A. Stirler, R. Leijendekkers, T. Kachman, G. A. de Jong

Main category: cs.LG

TL;DR: This study develops AI approaches to standardize transtibial prosthetic socket design by training algorithms on 118 patient datasets to predict socket shapes from 3D limb scans, with random forest performing best when predicting prosthetist adaptations rather than final socket shapes directly.

Details

Motivation: The quality of transtibial prosthetic sockets depends heavily on individual prosthetist skills and manual fitting expertise, creating variability in outcomes. There is a need to standardize prosthetic socket design through AI approaches to reduce dependence on manual craftsmanship and improve consistency.

Method: The researchers collected data from 118 patients including 3D scans of residual limbs and corresponding prosthetist-designed sockets. They applied data preprocessing including alignment, standardization, and compression using Morphable Models and PCA. Three AI algorithms were developed: 3D neural network, feedforward neural network, and random forest. Two prediction approaches were tested: direct socket shape prediction and adaptation prediction based on prosthetist modifications.

Result: All algorithms performed better when predicting required adaptations rather than directly predicting final socket shapes. The random forest model for adaptation prediction achieved the best performance with median surface-to-surface distance of 1.24mm, first quartile of 1.03mm, and third quartile of 1.54mm. Performance was evaluated using surface-to-surface distance measurements and distance maps to analyze error locations.

Conclusion: AI approaches, particularly random forest models predicting prosthetist adaptations, can effectively assist in standardizing transtibial prosthetic socket design. The adaptation prediction approach outperforms direct socket shape prediction across all tested algorithms, suggesting that modeling the prosthetist’s decision-making process is more effective than attempting to directly generate final socket geometries.

Abstract: The quality of a transtibial prosthetic socket depends on the prosthetist’s skills and expertise, as the fitting is performed manually. This study investigates multiple artificial intelligence (AI) approaches to help standardize transtibial prosthetic socket design. Data from 118 patients were collected by prosthetists working in the Dutch healthcare system. This data consists of a three-dimensional (3D) scan of the residual limb and a corresponding 3D model of the prosthetist-designed socket. Multiple data pre-processing steps are performed for alignment, standardization and optionally compression using Morphable Models and Principal Component Analysis. Afterward, three different algorithms - a 3D neural network, Feedforward neural network, and random forest - are developed to either predict 1) the final socket shape or 2) the adaptations performed by a prosthetist to predict the socket shape based on the 3D scan of the residual limb. Each algorithm’s performance was evaluated by comparing the prosthetist-designed socket with the AI-generated socket, using two metrics in combination with the error location. First, we measure the surface-to-surface distance to assess the overall surface error between the AI-generated socket and the prosthetist-designed socket. Second, distance maps between the AI-generated and prosthetist sockets are utilized to analyze the error’s location. For all algorithms, estimating the required adaptations outperformed direct prediction of the final socket shape. The random forest model applied to adaptation prediction yields the lowest error with a median surface-to-surface distance of 1.24 millimeters, a first quartile of 1.03 millimeters, and a third quartile of 1.54 millimeters.

[262] Exploring the Frontiers of kNN Noisy Feature Detection and Recovery for Self-Driving Labs

Qiuyu Shi, Kangming Li, Yao Fehlis, Daniel Persaud, Robert Black, Jason Hattrick-Simpers

Main category: cs.LG

TL;DR: This paper develops an automated workflow to detect and correct noisy features in self-driving laboratories (SDLs) for materials discovery, demonstrating that high-intensity noise and larger datasets improve detection/correction performance, while feature distribution types affect recoverability rates.

Details

Motivation: Self-driving laboratories for materials discovery face challenges with noisy input parameters that corrupt machine learning features, compromising both current and future experimental campaigns. There's a need for systematic methods to detect and correct these errors to maintain data quality and experimental precision in automated materials discovery.

Method: The authors develop an automated workflow that: (1) systematically detects noisy features, (2) determines which sample-feature pairings can be corrected, and (3) recovers correct feature values. They conduct systematic studies examining how dataset size, noise intensity, and feature value distribution affect detectability and recoverability, using kNN imputation as a model-agnostic framework.

Result: High-intensity noise and large training datasets facilitate better detection and correction of noisy features. Low-intensity noise reduces detection and recovery performance but can be compensated by larger clean training datasets. Features with continuous and dispersed distributions show greater recoverability compared to those with discrete or narrow distributions. The framework provides a tangible benchmark for kNN imputation in materials datasets.

Conclusion: The study successfully demonstrates a model-agnostic framework for rational data recovery that works across different noise conditions, dataset sizes, and feature distributions. This approach enhances data quality and experimental precision in automated materials discovery, providing valuable insights for improving self-driving laboratory performance.

Abstract: Self-driving laboratories (SDLs) have shown promise to accelerate materials discovery by integrating machine learning with automated experimental platforms. However, errors in the capture of input parameters may corrupt the features used to model system performance, compromising current and future campaigns. This study develops an automated workflow to systematically detect noisy features, determine sample-feature pairings that can be corrected, and finally recover the correct feature values. A systematic study is then performed to examine how dataset size, noise intensity, and feature value distribution affect both the detectability and recoverability of noisy features. In general, high-intensity noise and large training datasets are conducive to the detection and correction of noisy features. Low-intensity noise reduces detection and recovery but can be compensated for by larger clean training data sets. Detection and correction results vary between features with continuous and dispersed feature distributions showing greater recoverability compared to features with discrete or narrow distributions. This systematic study not only demonstrates a model agnostic framework for rational data recovery in the presence of noise, limited data, and differing feature distributions but also provides a tangible benchmark of kNN imputation in materials data sets. Ultimately, it aims to enhance data quality and experimental precision in automated materials discovery.

[263] TD-Interpreter: Enhancing the Understanding of Timing Diagrams with Visual-Language Learning

Jie He, Vincent Theo Willem Kenbeek, Zhantao Yang, Meixun Qu, Ezio Bartocci, Dejan Ničković, Radu Grosu

Main category: cs.LG

TL;DR: TD-Interpreter is a visual question-answer tool that helps engineers understand complex timing diagrams by fine-tuning LLaVA (7B MLLM) with synthetic training data, significantly outperforming GPT-4o on benchmarks.

Details

Motivation: Engineers struggle to understand complex timing diagrams from third parties during design and verification processes, requiring a specialized tool to interpret and answer queries about these visual technical documents.

Method: Fine-tuned LLaVA (a 7B Multimodal Large Language Model) using multimodal learning and developed a synthetic data generation workflow to align visual timing diagram information with textual interpretations to overcome limited training data availability.

Result: TD-Interpreter significantly outperformed untuned GPT-4o by a large margin on evaluated benchmarks, demonstrating its effectiveness in understanding and answering questions about timing diagrams.

Conclusion: TD-Interpreter successfully provides a visual question-answer environment for timing diagram interpretation, proving that specialized fine-tuning of MLLMs with synthetic data can effectively address domain-specific engineering challenges.

Abstract: We introduce TD-Interpreter, a specialized ML tool that assists engineers in understanding complex timing diagrams (TDs), originating from a third party, during their design and verification process. TD-Interpreter is a visual question-answer environment which allows engineers to input a set of TDs and ask design and verification queries regarding these TDs. We implemented TD-Interpreter with multimodal learning by fine-tuning LLaVA, a lightweight 7B Multimodal Large Language Model (MLLM). To address limited training data availability, we developed a synthetic data generation workflow that aligns visual information with its textual interpretation. Our experimental evaluation demonstrates the usefulness of TD-Interpreter which outperformed untuned GPT-4o by a large margin on the evaluated benchmarks.

[264] Reinforcement Learning in hyperbolic space for multi-step reasoning

Tao Xu, Dung-Yang Lee, Momiao Xiong

Main category: cs.LG

TL;DR: This paper introduces a framework that integrates hyperbolic Transformers into reinforcement learning for multi-step reasoning tasks, achieving significant improvements in accuracy (32-45%) and computational efficiency (16-32%) compared to vanilla transformer-based RL approaches.

Details

Motivation: Conventional RL methods struggle with complex multi-step reasoning tasks due to credit assignment issues, high-dimensional state representations, and stability concerns. While RL shows promise for multi-step reasoning by optimizing long-term rewards, there's a need for better approaches to handle hierarchical structures in reasoning tasks.

Method: The paper proposes a framework that integrates hyperbolic Transformers into reinforcement learning. The approach leverages hyperbolic embeddings to effectively model hierarchical structures in multi-step reasoning tasks, combining recent advancements in Transformer architectures with hyperbolic geometry.

Result: The hyperbolic RL approach achieves substantial improvements over vanilla transformer-based RL: 32-44% accuracy improvement on FrontierMath benchmark, 43-45% on nonlinear optimal control benchmark, while reducing computational time by 16-32% on FrontierMath and 16-17% on nonlinear optimal control tasks.

Conclusion: The work demonstrates the significant potential of hyperbolic Transformers in reinforcement learning, particularly for multi-step reasoning tasks involving hierarchical structures. The approach successfully addresses key challenges in conventional RL methods while providing both accuracy and computational efficiency gains.

Abstract: Multi-step reasoning is a fundamental challenge in artificial intelligence, with applications ranging from mathematical problem-solving to decision-making in dynamic environments. Reinforcement Learning (RL) has shown promise in enabling agents to perform multi-step reasoning by optimizing long-term rewards. However, conventional RL methods struggle with complex reasoning tasks due to issues such as credit assignment, high-dimensional state representations, and stability concerns. Recent advancements in Transformer architectures and hyperbolic geometry have provided novel solutions to these challenges. This paper introduces a new framework that integrates hyperbolic Transformers into RL for multi-step reasoning. The proposed approach leverages hyperbolic embeddings to model hierarchical structures effectively. We present theoretical insights, algorithmic details, and experimental results that include Frontier Math and nonlinear optimal control problems. Compared to RL with vanilla transformer, the hyperbolic RL largely improves accuracy by (32%~44%) on FrontierMath benchmark, (43%~45%) on nonlinear optimal control benchmark, while achieving impressive reduction in computational time by (16%~32%) on FrontierMath benchmark, (16%~17%) on nonlinear optimal control benchmark. Our work demonstrates the potential of hyperbolic Transformers in reinforcement learning, particularly for multi-step reasoning tasks that involve hierarchical structures.

[265] Diffusion-Modeled Reinforcement Learning for Carbon and Risk-Aware Microgrid Optimization

Yunyi Zhao, Wei Zhang, Cheng Xiang, Hongyang Du, Dusit Niyato, Shuhua Gao

Main category: cs.LG

TL;DR: DiffCarl is a diffusion-modeled reinforcement learning algorithm that optimizes multi-microgrid energy scheduling while considering carbon emissions and operational risks, achieving 2.3-30.1% lower costs and 28.7% reduced carbon emissions compared to existing methods.

Details

Motivation: Multi-microgrid systems face significant challenges in real-time energy scheduling and optimization due to growing renewable energy integration, increasing system complexity, and operational uncertainty, requiring intelligent solutions that can handle both environmental and economic objectives.

Method: DiffCarl integrates a diffusion model into a deep reinforcement learning (DRL) framework, learning action distributions through a denoising generation process to enable adaptive energy scheduling that explicitly accounts for carbon emissions and operational risk under uncertainty.

Result: DiffCarl outperforms classic algorithms and state-of-the-art DRL solutions with 2.3-30.1% lower operational costs, achieves 28.7% lower carbon emissions compared to carbon-unaware variants, and reduces performance variability in dynamic microgrid environments.

Conclusion: DiffCarl provides a practical and forward-looking solution for intelligent microgrid operation with flexible design that enables efficient adaptation to different system configurations and objectives, supporting real-world deployment in evolving energy systems.

Abstract: This paper introduces DiffCarl, a diffusion-modeled carbon- and risk-aware reinforcement learning algorithm for intelligent operation of multi-microgrid systems. With the growing integration of renewables and increasing system complexity, microgrid communities face significant challenges in real-time energy scheduling and optimization under uncertainty. DiffCarl integrates a diffusion model into a deep reinforcement learning (DRL) framework to enable adaptive energy scheduling under uncertainty and explicitly account for carbon emissions and operational risk. By learning action distributions through a denoising generation process, DiffCarl enhances DRL policy expressiveness and enables carbon- and risk-aware scheduling in dynamic and uncertain microgrid environments. Extensive experimental studies demonstrate that it outperforms classic algorithms and state-of-the-art DRL solutions, with 2.3-30.1% lower operational cost. It also achieves 28.7% lower carbon emissions than those of its carbon-unaware variant and reduces performance variability. These results highlight DiffCarl as a practical and forward-looking solution. Its flexible design allows efficient adaptation to different system configurations and objectives to support real-world deployment in evolving energy systems.

Pietro Giuseppe Fré, Federico Milanesio, Guido Sanguinetti, Matteo Santoro

Main category: cs.LG

TL;DR: This paper develops the mathematical foundations for Cartan Neural Networks, which use non-compact symmetric spaces U/H to create geometrically consistent and interpretable neural networks with group-theoretic structures.

Details

Motivation: Existing neural networks lack geometric consistency and interpretability. The authors are motivated to develop a geometrically principled approach to neural networks using homogeneous manifolds and group theory to achieve better theoretical understanding and interpretability.

Method: The paper employs non-compact symmetric spaces U/H as the mathematical framework, focusing on the geometric properties of neural network layers and how maps between layers interact with these structures to ensure covariance and geometric interpretability in Cartan Neural Networks.

Result: The paper establishes the mathematical foundations that make Cartan Neural Networks covariant and geometrically interpretable, expanding on the geometric properties and layer interactions within this framework.

Conclusion: This work, together with its twin paper, represents a foundational step toward developing a fully geometrically interpretable theory of neural networks that exploits group-theoretic structures, providing both theoretical rigor and practical feasibility for geometric neural network architectures.

Abstract: Recent work has identified non-compact symmetric spaces U/H as a promising class of homogeneous manifolds to develop a geometrically consistent theory of neural networks. An initial implementation of these concepts has been presented in a twin paper under the moniker of Cartan Neural Networks, showing both the feasibility and the performance of these geometric concepts in a machine learning context. The current paper expands on the mathematical structures underpinning Cartan Neural Networks, detailing the geometric properties of the layers and how the maps between layers interact with such structures to make Cartan Neural Networks covariant and geometrically interpretable. Together, these twin papers constitute a first step towards a fully geometrically interpretable theory of neural networks exploiting group-theoretic structures

[267] Confidence Optimization for Probabilistic Encoding

Pengjiu Xia, Yidian Huang, Wenchao Wei, Yuwen Tan

Main category: cs.LG

TL;DR: The paper proposes Confidence optimization Probabilistic Encoding (CPE) to improve neural network classification by addressing distance measurement issues caused by Gaussian noise in probabilistic encoding through confidence-aware mechanisms and L2 regularization.

Details

Motivation: Probabilistic encoding with Gaussian noise enhances generalization but distorts point-based distance measurements in classification tasks, reducing the reliability of distance calculations and affecting performance.

Method: CPE method with two key strategies: (1) confidence-aware mechanism to adjust distance calculations for consistency and reliability, and (2) replacing KL divergence-based variance regularization with simpler L2 regularization to directly constrain variance without unreliable prior assumptions.

Result: Extensive experiments on natural language classification tasks show significant performance and generalization improvements on both BERT and RoBERTa models, demonstrating the model-agnostic nature of the approach.

Conclusion: CPE successfully addresses the distance measurement distortion problem in probabilistic encoding, providing a model-agnostic solution that enhances both performance and generalization capabilities in neural network classification tasks.

Abstract: Probabilistic encoding introduces Gaussian noise into neural networks, enabling a smooth transition from deterministic to uncertain states and enhancing generalization ability. However, the randomness of Gaussian noise distorts point-based distance measurements in classification tasks. To mitigate this issue, we propose a confidence optimization probabilistic encoding (CPE) method that improves distance reliability and enhances representation learning. Specifically, we refine probabilistic encoding with two key strategies: First, we introduce a confidence-aware mechanism to adjust distance calculations, ensuring consistency and reliability in probabilistic encoding classification tasks. Second, we replace the conventional KL divergence-based variance regularization, which relies on unreliable prior assumptions, with a simpler L2 regularization term to directly constrain variance. The method we proposed is model-agnostic, and extensive experiments on natural language classification tasks demonstrate that our method significantly improves performance and generalization on both the BERT and the RoBERTa model.

[268] SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling

Yi Guo, Wei Wang, Zhihang Yuan, Rong Cao, Kuan Chen, Zhengyang Chen, Yuanyuan Huo, Yang Zhang, Yuping Wang, Shouda Liu, Yuxuan Wang

Main category: cs.LG

TL;DR: SplitMeanFlow introduces a novel algebraic identity called “Interval Splitting Consistency” to learn average velocity fields for one-step generation, replacing the differential formulation in MeanFlow with a more efficient and general approach that eliminates JVP computations and achieves 20x speedups in production speech synthesis.

Details

Motivation: Flow Matching models achieve excellent performance but suffer from computationally expensive iterative sampling. While MeanFlow addresses this with few-step generation using differential identities to learn average velocity fields, this differential formulation is limiting and computationally expensive due to JVP computations.

Method: The paper derives “Interval Splitting Consistency,” a purely algebraic identity based on the additivity property of definite integrals that establishes self-referential relationships for average velocity fields across time intervals. SplitMeanFlow enforces this algebraic consistency directly as a learning objective, eliminating differential operators and JVP computations.

Result: SplitMeanFlow provides simpler implementation, more stable training, and broader hardware compatibility compared to MeanFlow. The method has been successfully deployed in large-scale speech synthesis products (Doubao) with one-step and two-step models achieving 20x speedups in production.

Conclusion: The algebraic approach of SplitMeanFlow establishes a more general and efficient foundation for learning average velocity fields than differential methods. The formal proof shows MeanFlow’s differential identity is recovered as a limiting case, while practical benefits include computational efficiency and successful real-world deployment.

Abstract: Generative models like Flow Matching have achieved state-of-the-art performance but are often hindered by a computationally expensive iterative sampling process. To address this, recent work has focused on few-step or one-step generation by learning the average velocity field, which directly maps noise to data. MeanFlow, a leading method in this area, learns this field by enforcing a differential identity that connects the average and instantaneous velocities. In this work, we argue that this differential formulation is a limiting special case of a more fundamental principle. We return to the first principles of average velocity and leverage the additivity property of definite integrals. This leads us to derive a novel, purely algebraic identity we term Interval Splitting Consistency. This identity establishes a self-referential relationship for the average velocity field across different time intervals without resorting to any differential operators. Based on this principle, we introduce SplitMeanFlow, a new training framework that enforces this algebraic consistency directly as a learning objective. We formally prove that the differential identity at the core of MeanFlow is recovered by taking the limit of our algebraic consistency as the interval split becomes infinitesimal. This establishes SplitMeanFlow as a direct and more general foundation for learning average velocity fields. From a practical standpoint, our algebraic approach is significantly more efficient, as it eliminates the need for JVP computations, resulting in simpler implementation, more stable training, and broader hardware compatibility. One-step and two-step SplitMeanFlow models have been successfully deployed in large-scale speech synthesis products (such as Doubao), achieving speedups of 20x.

[269] Joint Pedestrian and Vehicle Traffic Optimization in Urban Environments using Reinforcement Learning

Bibek Poudel, Xuan Wang, Weizi Li, Lei Zhu, Kevin Heaslip

Main category: cs.LG

TL;DR: This paper presents a deep reinforcement learning framework for adaptive traffic signal control that jointly optimizes both pedestrian and vehicular efficiency, achieving significant reductions in wait times for both user groups compared to traditional fixed-time signals.

Details

Motivation: Existing RL-based traffic signal control methods focus primarily on vehicle-centric optimization, leaving pedestrian mobility needs and safety challenges unaddressed. There is a need for adaptive traffic control systems that serve all road users effectively.

Method: A deep reinforcement learning framework using a single-agent policy trained on real-world pedestrian and vehicle demand data derived from Wi-Fi logs and video analysis to control eight traffic signals along an urban corridor.

Result: The method achieved up to 67% reduction in average wait times per pedestrian and 52% per vehicle, with total wait time reductions of up to 67% for pedestrians and 53% for vehicles compared to traditional fixed-time signals. The approach also demonstrated generalization capabilities across varying traffic demands, including unseen conditions.

Conclusion: The deep RL framework successfully demonstrates the potential for developing adaptive transportation systems that jointly optimize for both pedestrian and vehicular efficiency, showing significant performance improvements and generalization capabilities that validate RL’s potential for inclusive traffic management.

Abstract: Reinforcement learning (RL) holds significant promise for adaptive traffic signal control. While existing RL-based methods demonstrate effectiveness in reducing vehicular congestion, their predominant focus on vehicle-centric optimization leaves pedestrian mobility needs and safety challenges unaddressed. In this paper, we present a deep RL framework for adaptive control of eight traffic signals along a real-world urban corridor, jointly optimizing both pedestrian and vehicular efficiency. Our single-agent policy is trained using real-world pedestrian and vehicle demand data derived from Wi-Fi logs and video analysis. The results demonstrate significant performance improvements over traditional fixed-time signals, reducing average wait times per pedestrian and per vehicle by up to 67% and 52% respectively, while simultaneously decreasing total wait times for both groups by up to 67% and 53%. Additionally, our results demonstrate generalization capabilities across varying traffic demands, including conditions entirely unseen during training, validating RL’s potential for developing transportation systems that serve all road users.

[270] SiLQ: Simple Large Language Model Quantization-Aware Training

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, Dharmendra S. Modha

Main category: cs.LG

TL;DR: A simple quantization-aware training approach that achieves superior performance over existing methods with minimal training overhead (less than 0.1% increase) and no additional operations beyond quantization itself.

Details

Motivation: Large language models need quantization to reduce inference latency, model size, and energy consumption for better user experience at lower cost, but existing methods either sacrifice accuracy or require mechanisms incompatible with specialized inference accelerators.

Method: An end-to-end quantization-aware training approach that can be applied to activations, cache, and weights without introducing additional operations beyond quantization itself, and generalizes across different model architectures.

Result: The method outperforms leading published quantization methods by large margins on several modern benchmarks for both base and instruct model variants, while requiring only a minimal increase (less than 0.1%) in total model training budget.

Conclusion: The proposed quantization-aware training approach provides an effective solution for model quantization that maintains high accuracy with minimal computational overhead and broad compatibility with inference accelerators.

Abstract: Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. Here, we demonstrate a simple, end-to-end quantization-aware training approach that, with an increase in total model training budget of less than 0.1%, outperforms the leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights, and requires the introduction of no additional operations to the model other than the quantization itself.

[271] Hierarchical Reinforcement Learning Framework for Adaptive Walking Control Using General Value Functions of Lower-Limb Sensor Signals

Sonny T. Jones, Grange M. Simpson, Patrick M. Pilarski, Ashley N. Dalrymple

Main category: cs.LG

TL;DR: This paper develops a Hierarchical Reinforcement Learning approach for adaptive lower-limb exoskeleton control that uses General Value Functions to predict sensor signals and improve terrain classification accuracy during walking.

Details

Motivation: To enhance mobility and autonomy for individuals with motor impairments by developing adaptive control strategies for lower-limb exoskeletons that can effectively handle varied terrains and reduce misclassification during uncertain walking conditions.

Method: The researchers employed Hierarchical Reinforcement Learning with a two-level framework: higher-level terrain strategy adaptation and lower-level predictive information provision through continual learning of General Value Functions (GVFs). GVFs generated temporal abstractions from multiple wearable sensors (electromyography, pressure insoles, goniometers) to predict future signal values and improve policy network decision-making.

Result: The addition of GVF predictions significantly increased overall network accuracy. Terrain-specific performance improvements were observed across multiple walking conditions including even ground, uneven ground, ramps (up and down), and turns - terrains that were frequently misclassified without predictive information.

Conclusion: Predictive information from GVFs can effectively aid decision-making during uncertainty, particularly for terrains with high misclassification rates. This work provides valuable insights into HRL applications and advances the development of safer exoskeletons capable of facilitating smooth transitions across diverse walking environments.

Abstract: Rehabilitation technology is a natural setting to study the shared learning and decision-making of human and machine agents. In this work, we explore the use of Hierarchical Reinforcement Learning (HRL) to develop adaptive control strategies for lower-limb exoskeletons, aiming to enhance mobility and autonomy for individuals with motor impairments. Inspired by prominent models of biological sensorimotor processing, our investigated HRL approach breaks down the complex task of exoskeleton control adaptation into a higher-level framework for terrain strategy adaptation and a lower-level framework for providing predictive information; this latter element is implemented via the continual learning of general value functions (GVFs). GVFs generated temporal abstractions of future signal values from multiple wearable lower-limb sensors, including electromyography, pressure insoles, and goniometers. We investigated two methods for incorporating actual and predicted sensor signals into a policy network with the intent to improve the decision-making capacity of the control system of a lower-limb exoskeleton during ambulation across varied terrains. As a key result, we found that the addition of predictions made from GVFs increased overall network accuracy. Terrain-specific performance increases were seen while walking on even ground, uneven ground, up and down ramps, and turns, terrains that are often misclassified without predictive information. This suggests that predictive information can aid decision-making during uncertainty, e.g., on terrains that have a high chance of being misclassified. This work, therefore, contributes new insights into the nuances of HRL and the future development of exoskeletons to facilitate safe transitioning and traversing across different walking environments.

[272] PyG 2.0: Scalable Learning on Real World Graphs

Matthias Fey, Jinu Sunil, Akihiro Nitta, Rishi Puri, Manan Shah, Blaž Stojanovič, Ramona Bendias, Alexandria Barghi, Vid Kocijan, Zecheng Zhang, Xinwei He, Jan Eric Lenssen, Jure Leskovec

Main category: cs.LG

TL;DR: PyG 2.0 is a major update to PyTorch Geometric that significantly improves scalability and real-world application capabilities for Graph Neural Networks, with enhanced support for heterogeneous/temporal graphs and optimizations for large-scale graph learning problems.

Details

Motivation: The need to enhance PyTorch Geometric's capabilities to handle large-scale graph learning problems more efficiently and support diverse real-world applications, including heterogeneous and temporal graphs that are common in practical scenarios.

Method: Comprehensive framework update introducing enhanced architecture with support for heterogeneous and temporal graphs, scalable feature/graph stores, and various optimizations to improve performance and usability for large-scale applications.

Result: PyG 2.0 successfully enables researchers and practitioners to tackle large-scale graph learning problems efficiently, with demonstrated applications across various domains including relational deep learning and large language modeling.

Conclusion: PyG 2.0 establishes itself as a leading framework for Graph Neural Networks by providing substantial improvements in scalability and real-world application support, making it more accessible and efficient for handling diverse graph learning challenges across multiple application areas.

Abstract: PyG (PyTorch Geometric) has evolved significantly since its initial release, establishing itself as a leading framework for Graph Neural Networks. In this paper, we present Pyg 2.0 (and its subsequent minor versions), a comprehensive update that introduces substantial improvements in scalability and real-world application capabilities. We detail the framework’s enhanced architecture, including support for heterogeneous and temporal graphs, scalable feature/graph stores, and various optimizations, enabling researchers and practitioners to tackle large-scale graph learning problems efficiently. Over the recent years, PyG has been supporting graph learning in a large variety of application areas, which we will summarize, while providing a deep dive into the important areas of relational deep learning and large language modeling.

[273] Should Bias Always be Eliminated? A Principled Framework to Use Data Bias for OOD Generation

Yan Li, Guangyi Chen, Yunlong Deng, Zijian Li, Zeyu Tang, Anpeng Wu, Kun Zhang

Main category: cs.LG

TL;DR: This paper challenges the conventional wisdom of eliminating bias in out-of-distribution domain adaptation, instead proposing a framework that strategically leverages bias alongside invariant representations to improve model performance across different domains.

Details

Motivation: Most existing OOD domain adaptation methods focus on eliminating biased features through invariant representation learning, but this approach may discard potentially useful information. The authors question whether bias should always be eliminated and explore when and how bias can be beneficially retained and leveraged.

Method: The paper introduces a novel framework with two key components: (1) using invariance as guidance to extract predictive ingredients from bias, and (2) exploiting identified bias to estimate environmental conditions and explore appropriate bias-aware predictors to bridge environment gaps. The approach strategically combines bias with invariant representations during inference.

Result: Experiments on synthetic datasets and standard domain generalization benchmarks consistently show that the proposed method outperforms existing approaches, demonstrating improved robustness and adaptability across different domains.

Conclusion: The study demonstrates that bias can be strategically leveraged rather than eliminated in domain adaptation tasks. The proposed framework successfully combines bias-aware predictors with invariant representations, leading to superior performance compared to traditional bias-elimination approaches.

Abstract: Most existing methods for adapting models to out-of-distribution (OOD) domains rely on invariant representation learning to eliminate the influence of biased features. However, should bias always be eliminated – and if not, when should it be retained, and how can it be leveraged? To address these questions, we first present a theoretical analysis that explores the conditions under which biased features can be identified and effectively utilized. Building on this theoretical foundation, we introduce a novel framework that strategically leverages bias to complement invariant representations during inference. The framework comprises two key components that leverage bias in both direct and indirect ways: (1) using invariance as guidance to extract predictive ingredients from bias, and (2) exploiting identified bias to estimate the environmental condition and then use it to explore appropriate bias-aware predictors to alleviate environment gaps. We validate our approach through experiments on both synthetic datasets and standard domain generalization benchmarks. Results consistently demonstrate that our method outperforms existing approaches, underscoring its robustness and adaptability.

[274] laplax – Laplace Approximations with JAX

Tobias Weber, Bálint Mucsányi, Lenard Rommel, Thomas Christie, Lars Kasüschke, Marvin Pförtner, Philipp Hennig

Main category: cs.LG

TL;DR: The paper introduces laplax, a new open-source Python package built with JAX for performing Laplace approximations in deep neural networks, designed to enable Bayesian uncertainty quantification with a modular, functional architecture for research applications.

Details

Motivation: To provide a scalable and efficient tool for quantifying weight-space uncertainty in deep neural networks using Laplace approximations, enabling Bayesian methods like predictive uncertainty and model selection, while offering a researcher-friendly framework for rapid prototyping in Bayesian neural networks and uncertainty quantification research.

Method: Development of laplax, an open-source Python package implemented with JAX featuring a modular and purely functional architecture with minimal external dependencies, specifically designed for performing Laplace approximations in deep neural networks.

Result: A flexible and researcher-friendly framework that enables the application of Bayesian tools such as predictive uncertainty estimation and model selection via Occam’s razor in deep neural networks through efficient Laplace approximations.

Conclusion: The laplax package successfully provides an accessible and efficient platform for Bayesian neural network research, uncertainty quantification in deep learning, and the development of improved Laplace approximation techniques, facilitating rapid prototyping and experimentation in these areas.

Abstract: The Laplace approximation provides a scalable and efficient means of quantifying weight-space uncertainty in deep neural networks, enabling the application of Bayesian tools such as predictive uncertainty and model selection via Occam’s razor. In this work, we introduce laplax, a new open-source Python package for performing Laplace approximations with jax. Designed with a modular and purely functional architecture and minimal external dependencies, laplax offers a flexible and researcher-friendly framework for rapid prototyping and experimentation. Its goal is to facilitate research on Bayesian neural networks, uncertainty quantification for deep learning, and the development of improved Laplace approximation techniques.

[275] Causal Graph Fuzzy LLMs: A First Introduction and Applications in Time Series Forecasting

Omid Orang, Patricia O. Lucas, Gabriel I. F. Paiva, Petronio C. L. Silva, Felipe Augusto Rocha da Silva, Adriano Alonso Veloso, Frederico Gadelha Guimaraes

Main category: cs.LG

TL;DR: This paper introduces CGF-LLM, a novel architecture that combines GPT-2 with fuzzy time series and causal graphs to forecast multivariate time series by converting numerical data into interpretable textual representations.

Details

Motivation: The growing interest in applying Large Language Models to time series forecasting, coupled with the need for more interpretable approaches that can handle complex multivariate time series dynamics through semantic understanding and structural insights.

Method: The proposed CGF-LLM framework uses parallel application of fuzzification and causal analysis to convert numerical time series into interpretable textual forms, which are then fed into a pretrained GPT-2 model for forecasting multivariate time series.

Result: The model demonstrated effectiveness across four different multivariate time series datasets, showing that the textual representation approach provides a more interpretable view of complex time series dynamics.

Conclusion: CGF-LLM represents the first architecture combining GPT-2, fuzzy time series, and causal graphs for time series forecasting, opening promising future directions for LLM-based time series forecasting using fuzzy time series approaches.

Abstract: In recent years, the application of Large Language Models (LLMs) to time series forecasting (TSF) has garnered significant attention among researchers. This study presents a new frame of LLMs named CGF-LLM using GPT-2 combined with fuzzy time series (FTS) and causal graph to predict multivariate time series, marking the first such architecture in the literature. The key objective is to convert numerical time series into interpretable forms through the parallel application of fuzzification and causal analysis, enabling both semantic understanding and structural insight as input for the pretrained GPT-2 model. The resulting textual representation offers a more interpretable view of the complex dynamics underlying the original time series. The reported results confirm the effectiveness of our proposed LLM-based time series forecasting model, as demonstrated across four different multivariate time series datasets. This initiative paves promising future directions in the domain of TSF using LLMs based on FTS.

[276] BiLO: Bilevel Local Operator Learning for PDE Inverse Problems. Part II: Efficient Uncertainty Quantification with Low-Rank Adaptation

Ray Zirui Zhang, Christopher E. Miles, Xiaohui Xie, John S. Lowengrub

Main category: cs.LG

TL;DR: This paper extends Bilevel Local Operator Learning (BiLO) to Bayesian inference for PDE-constrained optimization, using gradient-based MCMC and low-rank adaptation to efficiently sample PDE parameters from posterior distributions while maintaining strong PDE constraints for improved accuracy.

Details

Motivation: Uncertainty quantification and inverse problems governed by PDEs are crucial in scientific and engineering applications, but existing Bayesian neural network methods face challenges with high-dimensional weight sampling and require prior distributions on neural network solutions.

Method: The approach uses a bilevel optimization framework: at the lower level, a neural network approximates the local solution operator by minimizing local operator loss; at the upper level, PDE parameters are sampled from posterior distributions using gradient-based MCMC methods and low-rank adaptation (LoRA), with uncertainty propagating naturally through PDE constraints.

Result: The method demonstrates accurate parameter inference and uncertainty quantification across various PDE models while maintaining high computational efficiency. The authors analyze both dynamic error in MCMC gradient sampling and static error in posterior distribution due to inexact lower-level minimization.

Conclusion: By enforcing strong PDE constraints and bypassing high-dimensional neural network weight sampling, the proposed BiLO extension provides a computationally efficient framework for Bayesian inference in PDE-constrained problems with improved accuracy in both parameter inference and uncertainty quantification.

Abstract: Uncertainty quantification and inverse problems governed by partial differential equations (PDEs) are central to a wide range of scientific and engineering applications. In this second part of a two part series, we extend Bilevel Local Operator Learning (BiLO) for PDE-constrained optimization problems developed in Part 1 to the Bayesian inference framework. At the lower level, we train a network to approximate the local solution operator by minimizing the local operator loss with respect to the weights of the neural network. At the upper level, we sample the PDE parameters from the posterior distribution. We achieve efficient sampling through gradient-based Markov Chain Monte Carlo (MCMC) methods and low-rank adaptation (LoRA). Compared with existing methods based on Bayesian neural networks, our approach bypasses the challenge of sampling in the high-dimensional space of neural network weights and does not require specifying a prior distribution on the neural network solution. Instead, uncertainty propagates naturally from the data through the PDE constraints. By enforcing strong PDE constraints, the proposed method improves the accuracy of both parameter inference and uncertainty quantification. We analyze the dynamic error of the gradient in the MCMC sampler and the static error in the posterior distribution due to inexact minimization of the lower level problem and demonstrate a direct link between the tolerance for solving the lower level problem and the accuracy of the resulting uncertainty quantification. Through numerical experiments across a variety of PDE models, we demonstrate that our method delivers accurate inference and quantification of uncertainties while maintaining high computational efficiency.

[277] Pragmatic Policy Development via Interpretable Behavior Cloning

Anton Matsson, Yaochen Rao, Heather J. Litman, Fredrik D. Johansson

Main category: cs.LG

TL;DR: This paper proposes an interpretable alternative to offline reinforcement learning by deriving treatment policies from the most frequently chosen actions in patient states using tree-based models, addressing interpretability and evaluation challenges in safety-critical healthcare domains.

Details

Motivation: Offline reinforcement learning faces significant challenges in safety-critical domains due to lack of interpretability (black-box nature) and unreliable evaluation methods (sensitivity to deviations from behavior policy in off-policy evaluation), limiting its practical application in healthcare settings.

Method: The authors propose using tree-based models to estimate behavior policies and derive treatment policies from the most frequently chosen actions in each patient state. The tree structure provides natural state grouping and interpretability by design, while controlling the number of actions considered allows for reliable off-policy evaluation by maintaining overlap with the behavior policy.

Result: The proposed approach was validated on real-world healthcare datasets for rheumatoid arthritis and sepsis care, demonstrating that the derived policies can outperform current clinical practice while maintaining interpretability and standardizing frequent treatment patterns that capture collective clinical judgment.

Conclusion: This pragmatic framework offers an interpretable alternative to traditional offline RL methods for deriving treatment policies in healthcare, successfully balancing performance improvement with the critical requirements of interpretability and reliable evaluation in safety-critical domains.

Abstract: Offline reinforcement learning (RL) holds great promise for deriving optimal policies from observational data, but challenges related to interpretability and evaluation limit its practical use in safety-critical domains. Interpretability is hindered by the black-box nature of unconstrained RL policies, while evaluation – typically performed off-policy – is sensitive to large deviations from the data-collecting behavior policy, especially when using methods based on importance sampling. To address these challenges, we propose a simple yet practical alternative: deriving treatment policies from the most frequently chosen actions in each patient state, as estimated by an interpretable model of the behavior policy. By using a tree-based model, which is specifically designed to exploit patterns in the data, we obtain a natural grouping of states with respect to treatment. The tree structure ensures interpretability by design, while varying the number of actions considered controls the degree of overlap with the behavior policy, enabling reliable off-policy evaluation. This pragmatic approach to policy development standardizes frequent treatment patterns, capturing the collective clinical judgment embedded in the data. Using real-world examples in rheumatoid arthritis and sepsis care, we demonstrate that policies derived under this framework can outperform current practice, offering interpretable alternatives to those obtained via offline RL.

[278] Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation

Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng

Main category: cs.LG

TL;DR: This paper benchmarks foundation models (GPT-4o-mini, LLaMA 3.3 70B, TabPFN v2) for synthetic tabular data generation in low-data settings, revealing significant privacy risks through membership inference attacks, and proposes simple prompt modifications to improve the privacy-utility tradeoff.

Details

Motivation: State-of-the-art generative models require large datasets and can overfit in low-data settings. While foundation models using in-context learning (ICL) avoid retraining with few examples, they risk verbatim repetition of seed rows, creating privacy concerns that haven't been systematically studied for tabular data where single rows may identify individuals.

Method: The authors conduct a comprehensive benchmark comparing three foundation models against four baselines on 35 real-world tables from health, finance, and policy domains. They evaluate statistical fidelity, downstream utility, and membership inference leakage, then perform a factorial study testing prompt modifications including batch size, temperature, and summary statistics usage.

Result: Foundation models consistently show the highest privacy risk, with LLaMA 3.3 70B achieving up to 54 percentage points higher true-positive rate at 1% FPR than the safest baseline. CTGAN and GPT-4o-mini offer better privacy-utility tradeoffs. Three simple prompt tweaks can reduce worst-case AUC by 14 points and rare-class leakage by up to 39 points while maintaining over 90% fidelity.

Conclusion: The study provides the first systematic evaluation of privacy risks in foundation model-based tabular synthesis, demonstrating significant vulnerabilities but also identifying practical mitigation strategies through prompt engineering that can substantially improve privacy protection while preserving data utility.

Abstract: Synthetic tabular data is essential for machine learning workflows, especially for expanding small or imbalanced datasets and enabling privacy-preserving data sharing. However, state-of-the-art generative models (GANs, VAEs, diffusion models) rely on large datasets with thousands of examples. In low-data settings, often the primary motivation for synthetic data, these models can overfit, leak sensitive records, and require frequent retraining. Recent work uses large pre-trained transformers to generate rows via in-context learning (ICL), which needs only a few seed examples and no parameter updates, avoiding retraining. But ICL repeats seed rows verbatim, introducing a new privacy risk that has only been studied in text. The severity of this risk in tabular synthesis-where a single row may identify a person-remains unclear. We address this gap with the first benchmark of three foundation models (GPT-4o-mini, LLaMA 3.3 70B, TabPFN v2) against four baselines on 35 real-world tables from health, finance, and policy. We evaluate statistical fidelity, downstream utility, and membership inference leakage. Results show foundation models consistently have the highest privacy risk. LLaMA 3.3 70B reaches up to 54 percentage points higher true-positive rate at 1% FPR than the safest baseline. GPT-4o-mini and TabPFN are also highly vulnerable. We plot the privacy-utility frontier and show that CTGAN and GPT-4o-mini offer better tradeoffs. A factorial study finds that three zero-cost prompt tweaks-small batch size, low temperature, and using summary statistics-can reduce worst-case AUC by 14 points and rare-class leakage by up to 39 points while maintaining over 90% fidelity. Our benchmark offers a practical guide for safer low-data synthesis with foundation models.

[279] Advancing Robustness in Deep Reinforcement Learning with an Ensemble Defense Approach

Adithya Mohan, Dominik Rößle, Daniel Cremers, Torsten Schön

Main category: cs.LG

TL;DR: This paper proposes an ensemble-based defense architecture to protect Deep Reinforcement Learning (DRL) models in autonomous driving from adversarial attacks, achieving significant improvements in robustness with over 213% increase in mean reward and 82% reduction in collision rates compared to baseline methods.

Details

Motivation: While DRL has shown success across various domains including autonomous driving, there is a critical research gap regarding the robustness of DRL models against adversarial attacks, particularly the lack of integrated multiple defense mechanisms specifically designed for autonomous driving scenarios.

Method: The authors propose a novel ensemble-based defense architecture that integrates multiple defense mechanisms to mitigate adversarial attacks in autonomous driving DRL models, going beyond existing standalone defense strategies like adversarial training and distillation.

Result: The proposed ensemble method demonstrates significant performance improvements under FGSM attacks: mean reward increased from 5.87 to 18.38 (213% improvement) and mean collision rate decreased from 0.50 to 0.09 (82% reduction) in highway and merge scenarios, outperforming all standalone defense strategies.

Conclusion: The ensemble-based defense architecture successfully enhances the robustness of DRL models in autonomous driving against adversarial attacks, providing superior protection compared to individual defense mechanisms and addressing a critical security gap in autonomous vehicle systems.

Abstract: Recent advancements in Deep Reinforcement Learning (DRL) have demonstrated its applicability across various domains, including robotics, healthcare, energy optimization, and autonomous driving. However, a critical question remains: How robust are DRL models when exposed to adversarial attacks? While existing defense mechanisms such as adversarial training and distillation enhance the resilience of DRL models, there remains a significant research gap regarding the integration of multiple defenses in autonomous driving scenarios specifically. This paper addresses this gap by proposing a novel ensemble-based defense architecture to mitigate adversarial attacks in autonomous driving. Our evaluation demonstrates that the proposed architecture significantly enhances the robustness of DRL models. Compared to the baseline under FGSM attacks, our ensemble method improves the mean reward from 5.87 to 18.38 (over 213% increase) and reduces the mean collision rate from 0.50 to 0.09 (an 82% decrease) in the highway scenario and merge scenario, outperforming all standalone defense strategies.

[280] Sensor Drift Compensation in Electronic-Nose-Based Gas Recognition Using Knowledge Distillation

Juntao Lin, Xianghao Zhan

Main category: cs.LG

TL;DR: This paper proposes using Knowledge Distillation (KD) to address sensor drift in electronic nose systems for gas classification, demonstrating superior performance over existing methods with up to 18% accuracy improvement through rigorous statistical validation on the UCI Gas Sensor Array Drift Dataset.

Details

Motivation: Electronic nose systems suffer from sensor drift due to environmental changes and sensor aging, which degrades gas classification performance in real-world deployments. Previous drift compensation studies lacked robust statistical validation and risked overcompensating, leading to loss of class-related variance.

Method: The authors designed two domain adaptation tasks using the UCI Gas Sensor Array Drift Dataset and systematically tested three methods: a novel Knowledge Distillation (KD) approach, the benchmark Domain Regularized Component Analysis (DRCA), and a hybrid KD-DRCA method across 30 random test set partitions for statistical rigor.

Result: Knowledge Distillation consistently outperformed both DRCA and KD-DRCA methods, achieving up to 18% improvement in accuracy and 15% improvement in F1-score across the experimental validation, demonstrating superior effectiveness in sensor drift compensation.

Conclusion: This represents the first application of Knowledge Distillation for electronic nose drift mitigation, significantly outperforming the previous state-of-the-art DRCA method and enhancing the reliability of sensor drift compensation for real-world electronic nose deployments.

Abstract: Due to environmental changes and sensor aging, sensor drift challenges the performance of electronic nose systems in gas classification during real-world deployment. Previous studies using the UCI Gas Sensor Array Drift Dataset reported promising drift compensation results but lacked robust statistical experimental validation and may overcompensate for sensor drift, losing class-related variance.To address these limitations and improve sensor drift compensation with statistical rigor, we first designed two domain adaptation tasks based on the same electronic nose dataset: using the first batch to predict the remaining batches, simulating a controlled laboratory setting; and predicting the next batch using all prior batches, simulating continuous training data updates for online training. We then systematically tested three methods: our proposed novel Knowledge Distillation (KD) method, the benchmark method Domain Regularized Component Analysis (DRCA), and a hybrid method KD-DRCA, across 30 random test set partitions on the UCI dataset. We showed that KD consistently outperformed both DRCA and KD-DRCA, achieving up to an 18% improvement in accuracy and 15% in F1-score, demonstrating KD’s superior effectiveness in drift compensation. This is the first application of KD for electronic nose drift mitigation, significantly outperforming the previous state-of-the-art DRCA method and enhancing the reliability of sensor drift compensation in real-world environments.

[281] ZORMS-LfD: Learning from Demonstrations with Zeroth-Order Random Matrix Search

Olivia Dry, Timothy L. Molloy, Wanxin Jin, Iman Shames

Main category: cs.LG

TL;DR: This paper introduces ZORMS-LfD, a zeroth-order optimization method for learning optimal control problems from expert demonstrations that doesn’t require gradient computation and works for both continuous and discrete time systems with constraints.

Details

Motivation: Existing first-order methods for learning from demonstrations require gradient computation of costs, constraints, dynamics, and learning loss, which may not always be available or computable. Additionally, most methods focus on discrete time problems, with limited attention to constrained continuous-time problems.

Method: The authors propose Zeroth-Order Random Matrix Search for Learning from Demonstrations (ZORMS-LfD), which uses gradient-free optimization to learn costs, constraints, and dynamics of constrained optimal control problems from expert demonstrations without requiring smoothness assumptions.

Result: ZORMS-LfD matches or exceeds state-of-the-art methods in learning loss and compute time across various benchmarks. For unconstrained continuous-time problems, it achieves similar performance with over 80% reduction in compute time. For constrained continuous-time problems, it outperforms gradient-free methods like Nelder-Mead.

Conclusion: ZORMS-LfD provides an effective gradient-free alternative for learning from demonstrations in optimal control, particularly excelling in continuous-time constrained problems where specialized methods are lacking, while maintaining competitive performance and significantly reducing computational requirements.

Abstract: We propose Zeroth-Order Random Matrix Search for Learning from Demonstrations (ZORMS-LfD). ZORMS-LfD enables the costs, constraints, and dynamics of constrained optimal control problems, in both continuous and discrete time, to be learned from expert demonstrations without requiring smoothness of the learning-loss landscape. In contrast, existing state-of-the-art first-order methods require the existence and computation of gradients of the costs, constraints, dynamics, and learning loss with respect to states, controls and/or parameters. Most existing methods are also tailored to discrete time, with constrained problems in continuous time receiving only cursory attention. We demonstrate that ZORMS-LfD matches or surpasses the performance of state-of-the-art methods in terms of both learning loss and compute time across a variety of benchmark problems. On unconstrained continuous-time benchmark problems, ZORMS-LfD achieves similar loss performance to state-of-the-art first-order methods with an over $80$% reduction in compute time. On constrained continuous-time benchmark problems where there is no specialized state-of-the-art method, ZORMS-LfD is shown to outperform the commonly used gradient-free Nelder-Mead optimization method.

[282] Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models

Andrii Balashov

Main category: cs.LG

TL;DR: This paper discovers that reinforcement learning fine-tuning of large language models naturally creates sparse parameter updates, modifying only 5-30% of weights while leaving most parameters unchanged, and shows this sparse subnetwork can achieve full model performance.

Details

Motivation: The authors challenge the common assumption that RL fine-tuning requires updating most of a model's parameters, investigating whether RL actually modifies the entire parameter space or only specific subnetworks during alignment with human preferences and complex tasks.

Method: The researchers analyze parameter update patterns across multiple RL algorithms (PPO, DPO, SimPO, PRIME) and model families (OpenAI, Meta, open-source LLMs), examining which weights change during fine-tuning and testing whether fine-tuning only the sparse subnetwork can recover full performance.

Result: RL fine-tuning consistently modifies only 5-30% of model weights across different algorithms and models, with substantial overlap in updated subnetworks across different seeds, datasets, and algorithms. Fine-tuning only this sparse subnetwork recovers full model performance and yields nearly identical parameters to fully fine-tuned models.

Conclusion: RL adapts models by focusing training on small, consistently updated subnetworks rather than shifting all weights, which occurs because RL operates near the model’s original distribution requiring only targeted changes. This insight enables more efficient RL methods and connects to the lottery ticket hypothesis framework.

Abstract: Reinforcement learning (RL) is a key post-pretraining step for aligning large language models (LLMs) with complex tasks and human preferences. While it is often assumed that RL fine-tuning requires updating most of a model’s parameters, we challenge this assumption with a surprising finding: RL fine-tuning consistently modifies only a small subnetwork (typically 5-30% of weights), leaving most parameters unchanged. We call this phenomenon RL-induced parameter update sparsity. It arises naturally, without any sparsity constraints or parameter-efficient tuning, and appears across multiple RL algorithms (e.g., PPO, DPO, SimPO, PRIME) and model families (e.g., OpenAI, Meta, and open-source LLMs). Moreover, the subnetworks updated by RL show substantial overlap across different seeds, datasets, and algorithms-far exceeding chance-suggesting a partially transferable structure in the pretrained model. We show that fine-tuning only this sparse subnetwork recovers full model performance and yields parameters nearly identical to the fully fine-tuned model. Our analysis suggests this sparsity emerges because RL operates near the model’s original distribution, requiring only targeted changes. KL penalties, gradient clipping, and on-policy dynamics have limited effect on the sparsity pattern. These findings shed new light on how RL adapts models: not by shifting all weights, but by focusing training on a small, consistently updated subnetwork. This insight enables more efficient RL methods and reframes sparsity through the lens of the lottery ticket hypothesis.

[283] Probabilistic Graphical Models: A Concise Tutorial

Jacqueline Maasch, Willie Neiswanger, Stefano Ermon, Volodymyr Kuleshov

Main category: cs.LG

TL;DR: A tutorial paper introducing probabilistic graphical modeling, covering the mathematical foundations combining probability and graph theory, representation methods, learning algorithms, and inference techniques for modeling uncertainty and making predictions.

Details

Motivation: To provide a comprehensive yet concise introduction to probabilistic graphical modeling framework that bridges probability theory and graph theory for handling uncertainty in machine learning applications.

Method: Tutorial approach covering three main areas: (1) visual representation of multivariate probability distributions using graphs, (2) algorithms for learning model parameters and graph structures from data, and (3) exact and approximate inference algorithms.

Result: A structured educational resource that presents the theoretical foundations, practical algorithms, and applications of probabilistic graphical models in an accessible format.

Conclusion: Probabilistic graphical modeling provides an elegant and powerful framework for probabilistic reasoning by combining probability distributions with graph-based representations, enabling compact modeling of complex multivariate distributions and supporting both learning and inference tasks.

Abstract: Probabilistic graphical modeling is a branch of machine learning that uses probability distributions to describe the world, make predictions, and support decision-making under uncertainty. Underlying this modeling framework is an elegant body of theory that bridges two mathematical traditions: probability and graph theory. This framework provides compact yet expressive representations of joint probability distributions, yielding powerful generative models for probabilistic reasoning. This tutorial provides a concise introduction to the formalisms, methods, and applications of this modeling framework. After a review of basic probability and graph theory, we explore three dominant themes: (1) the representation of multivariate distributions in the intuitive visual language of graphs, (2) algorithms for learning model parameters and graphical structures from data, and (3) algorithms for inference, both exact and approximate.

[284] Computer Vision for Real-Time Monkeypox Diagnosis on Embedded Systems

Jacob M. Delgado-López, Ricardo A. Morell-Rodriguez, Sebastián O. Espinosa-Del Rosario, Wilfredo E. Lugo-Beauchamp

Main category: cs.LG

TL;DR: Researchers developed an AI-powered monkeypox diagnostic tool using MobileNetV2 on NVIDIA Jetson Orin Nano, achieving 93.07% F1-Score with TensorRT optimization that reduced power consumption by half while maintaining accuracy, deployed via Wi-Fi hotspot for use in resource-constrained healthcare settings.

Details

Motivation: The need for rapid diagnosis of infectious diseases like monkeypox in resource-constrained environments where effective containment and treatment are crucial, particularly in underserved regions with limited healthcare resources.

Method: Utilized pre-trained MobileNetV2 architecture for binary classification trained on the open-source Monkeypox Skin Lesion Dataset, optimized using TensorRT framework for FP32, FP16, and INT8 formats with post-training quantization, and deployed on NVIDIA Jetson Orin Nano with Wi-Fi Access Point hotspot and web-based interface.

Result: Achieved 93.07% F1-Score with well-balanced precision and recall performance. TensorRT optimization reduced model size, increased inference speed, and lowered power consumption by approximately a factor of two while maintaining original accuracy. The system enables direct image upload and analysis through mobile devices via web interface.

Conclusion: The diagnostic tool represents an efficient, scalable, and energy-conscious solution for addressing diagnosis challenges in underserved regions, with optimizations making it suitable for deployment in resource-constrained environments and paving the way for broader adoption in low-resource healthcare settings.

Abstract: The rapid diagnosis of infectious diseases, such as monkeypox, is crucial for effective containment and treatment, particularly in resource-constrained environments. This study presents an AI-driven diagnostic tool developed for deployment on the NVIDIA Jetson Orin Nano, leveraging the pre-trained MobileNetV2 architecture for binary classification. The model was trained on the open-source Monkeypox Skin Lesion Dataset, achieving a 93.07% F1-Score, which reflects a well-balanced performance in precision and recall. To optimize the model, the TensorRT framework was used to accelerate inference for FP32 and to perform post-training quantization for FP16 and INT8 formats. TensorRT’s mixed-precision capabilities enabled these optimizations, which reduced the model size, increased inference speed, and lowered power consumption by approximately a factor of two, all while maintaining the original accuracy. Power consumption analysis confirmed that the optimized models used significantly less energy during inference, reinforcing their suitability for deployment in resource-constrained environments. The system was deployed with a Wi-Fi Access Point (AP) hotspot and a web-based interface, enabling users to upload and analyze images directly through connected devices such as mobile phones. This setup ensures simple access and seamless connectivity, making the tool practical for real-world applications. These advancements position the diagnostic tool as an efficient, scalable, and energy-conscious solution to address diagnosis challenges in underserved regions, paving the way for broader adoption in low-resource healthcare settings.

[285] Model Compression Engine for Wearable Devices Skin Cancer Diagnosis

Jacob M. Delgado-López, Andrea P. Seda-Hernandez, Juan D. Guadalupe-Rosado, Luis E. Fernandez Ramirez, Miguel Giboyeaux-Camilo, Wilfredo E. Lugo-Beauchamp

Main category: cs.LG

TL;DR: This study develops an AI-powered skin cancer detection system using MobileNetV2 and TensorRT optimization for deployment on NVIDIA Jetson Orin Nano, achieving 87.18% F1-score while maintaining energy efficiency for resource-limited healthcare settings.

Details

Motivation: Address the challenge of early skin cancer detection in resource-limited settings where access to specialized healthcare is scarce, by developing an AI diagnostic tool that can run efficiently on embedded systems.

Method: Used transfer learning with MobileNetV2 architecture for binary classification of skin lesions, employed TensorRT framework for model compression and optimization, and deployed on NVIDIA Jetson Orin Nano to balance performance with energy efficiency.

Result: Achieved F1-Score of 87.18% with 93.18% precision and 81.91% recall. Post-compression showed 0.41 reduction in model size, improved inference speed and throughput, and 0.93 decrease in energy consumption using INT8 precision.

Conclusion: The study validates the feasibility of deploying high-performing, energy-efficient diagnostic tools on resource-constrained edge devices, demonstrating potential to revolutionize healthcare diagnostics and bridge the technology gap in underserved regions.

Abstract: Skin cancer is one of the most prevalent and preventable types of cancer, yet its early detection remains a challenge, particularly in resource-limited settings where access to specialized healthcare is scarce. This study proposes an AI-driven diagnostic tool optimized for embedded systems to address this gap. Using transfer learning with the MobileNetV2 architecture, the model was adapted for binary classification of skin lesions into “Skin Cancer” and “Other.” The TensorRT framework was employed to compress and optimize the model for deployment on the NVIDIA Jetson Orin Nano, balancing performance with energy efficiency. Comprehensive evaluations were conducted across multiple benchmarks, including model size, inference speed, throughput, and power consumption. The optimized models maintained their performance, achieving an F1-Score of 87.18% with a precision of 93.18% and recall of 81.91%. Post-compression results showed reductions in model size of up to 0.41, along with improvements in inference speed and throughput, and a decrease in energy consumption of up to 0.93 in INT8 precision. These findings validate the feasibility of deploying high-performing, energy-efficient diagnostic tools on resource-constrained edge devices. Beyond skin cancer detection, the methodologies applied in this research have broader applications in other medical diagnostics and domains requiring accessible, efficient AI solutions. This study underscores the potential of optimized AI systems to revolutionize healthcare diagnostics, thereby bridging the divide between advanced technology and underserved regions.

[286] Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

Yufei He, Ruoyu Li, Alex Chen, Yue Liu, Yulin Chen, Yuan Sui, Cheng Chen, Yi Zhu, Luca Luo, Frank Yang, Bryan Hooi

Main category: cs.LG

TL;DR: ARIA is an LLM agent framework that continuously learns and adapts to changing domain knowledge during operation through self-assessment, human interaction, and knowledge repository updates, demonstrating superior performance in dynamic environments like regulatory compliance.

Details

Motivation: Current LLM agents struggle in environments with frequently changing rules and domain knowledge (like regulatory compliance) because offline fine-tuning and standard prompting cannot effectively adapt to new knowledge during actual operation, creating a need for continuous learning capabilities.

Method: ARIA uses structured self-dialogue to assess uncertainty and identify knowledge gaps, proactively requests explanations from human experts, maintains a timestamped internal knowledge repository, and resolves conflicting information through comparisons and clarification queries to enable continuous learning at test time.

Result: ARIA showed significant improvements in adaptability and accuracy compared to standard offline fine-tuning and existing self-improving agents on customer due diligence name screening tasks and dynamic knowledge benchmarks, with successful deployment serving over 150 million monthly active users on TikTok Pay.

Conclusion: ARIA successfully addresses the limitation of LLM agents in dynamic environments by enabling continuous learning during operation, proving its practical effectiveness through real-world deployment and demonstrating superior performance over existing approaches in adapting to evolving domain knowledge.

Abstract: Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine-tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.

[287] SADA: Stability-guided Adaptive Diffusion Acceleration

Ting Jiang, Yixiao Wang, Hancheng Ye, Zishan Shao, Jingwei Sun, Jingyang Zhang, Zekai Chen, Jianyi Zhang, Yiran Chen, Hai Li

Main category: cs.LG

TL;DR: The paper proposes SADA (Stability-guided Adaptive Diffusion Acceleration), a novel method that accelerates diffusion models by adaptively applying sparsity decisions based on sampling trajectory stability, achieving 1.8x+ speedups with minimal quality loss across multiple models and modalities.

Details

Motivation: Diffusion models have high computational costs due to iterative sampling and quadratic attention costs. Existing training-free acceleration methods reduce sampling time but suffer from low faithfulness because they don't consider varying denoising trajectories across prompts and ignore the underlying ODE formulation and numerical solutions.

Method: SADA unifies step-wise and token-wise sparsity decisions using a single stability criterion. It adaptively allocates sparsity based on sampling trajectory and introduces principled approximation schemes that leverage precise gradient information from numerical ODE solvers to accelerate ODE-based generative models including diffusion and flow-matching.

Result: Comprehensive evaluations on SD-2, SDXL, and Flux with EDM and DPM++ solvers show consistent ≥1.8x speedups with minimal fidelity degradation (LPIPS ≤0.10 and FID ≤4.5). SADA significantly outperforms prior methods and adapts seamlessly to other pipelines, accelerating ControlNet without modifications and speeding up MusicLDM by 1.8x with ~0.01 spectrogram LPIPS.

Conclusion: SADA successfully addresses the computational bottlenecks of diffusion models by providing a unified framework that maintains high fidelity while achieving significant speedups. The method’s adaptability across different models and modalities demonstrates its practical value for accelerating generative AI applications.

Abstract: Diffusion models have achieved remarkable success in generative tasks but suffer from high computational costs due to their iterative sampling process and quadratic attention costs. Existing training-free acceleration strategies that reduce per-step computation cost, while effectively reducing sampling time, demonstrate low faithfulness compared to the original baseline. We hypothesize that this fidelity gap arises because (a) different prompts correspond to varying denoising trajectory, and (b) such methods do not consider the underlying ODE formulation and its numerical solution. In this paper, we propose Stability-guided Adaptive Diffusion Acceleration (SADA), a novel paradigm that unifies step-wise and token-wise sparsity decisions via a single stability criterion to accelerate sampling of ODE-based generative models (Diffusion and Flow-matching). For (a), SADA adaptively allocates sparsity based on the sampling trajectory. For (b), SADA introduces principled approximation schemes that leverage the precise gradient information from the numerical ODE solver. Comprehensive evaluations on SD-2, SDXL, and Flux using both EDM and DPM++ solvers reveal consistent $\ge 1.8\times$ speedups with minimal fidelity degradation (LPIPS $\leq 0.10$ and FID $\leq 4.5$) compared to unmodified baselines, significantly outperforming prior methods. Moreover, SADA adapts seamlessly to other pipelines and modalities: It accelerates ControlNet without any modifications and speeds up MusicLDM by $1.8\times$ with $\sim 0.01$ spectrogram LPIPS.

[288] PICore: Physics-Informed Unsupervised Coreset Selection for Data Efficient Neural Operator Training

Anirudh Satheesh, Anant Khandelwal, Mucong Ding, Radu Balan

Main category: cs.LG

TL;DR: PICore is an unsupervised coreset selection framework that identifies the most informative training samples for neural operators without requiring ground-truth PDE solutions, achieving up to 78% increase in training efficiency while maintaining accuracy.

Details

Motivation: Neural operators for solving PDEs face two major bottlenecks: they require large amounts of training data and expensive labeled data generated through numerical simulations. Current methods need significant computational resources for both data generation and training.

Method: PICore uses a physics-informed loss function to select the most informative unlabeled input samples without needing ground-truth solutions. Only the selected compact subset of inputs is then simulated using numerical solvers to generate labels, followed by training the neural operator on this reduced labeled dataset.

Result: Across four diverse PDE benchmarks and multiple coreset selection strategies, PICore achieves up to 78% average increase in training efficiency compared to supervised coreset selection methods while maintaining minimal changes in accuracy.

Conclusion: PICore successfully addresses both data efficiency and computational cost issues in neural operator training by intelligently selecting informative samples before expensive simulation, leading to significant improvements in training efficiency without sacrificing performance.

Abstract: Neural operators offer a powerful paradigm for solving partial differential equations (PDEs) that cannot be solved analytically by learning mappings between function spaces. However, there are two main bottlenecks in training neural operators: they require a significant amount of training data to learn these mappings, and this data needs to be labeled, which can only be accessed via expensive simulations with numerical solvers. To alleviate both of these issues simultaneously, we propose PICore, an unsupervised coreset selection framework that identifies the most informative training samples without requiring access to ground-truth PDE solutions. PICore leverages a physics-informed loss to select unlabeled inputs by their potential contribution to operator learning. After selecting a compact subset of inputs, only those samples are simulated using numerical solvers to generate labels, reducing annotation costs. We then train the neural operator on the reduced labeled dataset, significantly decreasing training time as well. Across four diverse PDE benchmarks and multiple coreset selection strategies, PICore achieves up to 78% average increase in training efficiency relative to supervised coreset selection methods with minimal changes in accuracy. We provide code at https://github.com/Asatheesh6561/PICore.

[289] DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD

Xianbiao Qi, Marco Chen, Wenjie Xiao, Jiaquan Ye, Yelin He, Chun-Guang Li, Zhouchen Lin

Main category: cs.LG

TL;DR: The paper introduces Deeply Normalized Transformer (DNT) that enables effective training with vanilla momentum SGD instead of requiring adaptive optimizers like AdamW, through strategic normalization placement to concentrate gradient distributions.

Details

Motivation: Transformers typically require advanced adaptive optimizers like AdamW rather than simpler momentum SGD due to heavy-tailed gradient distributions, limiting training flexibility and efficiency.

Method: The authors develop DNT by strategically integrating normalization techniques at proper positions within Transformers to modulate Jacobian matrices, balance weight and activation influences, and concentrate gradient distributions to enable training with vanilla momentum SGD.

Result: DNT outperforms standard Transformer counterparts (ViT and GPT) and can be effectively trained with vanilla momentum SGD while achieving comparable performance to Transformers trained with AdamW.

Conclusion: The strategic placement of normalization in DNT successfully addresses the gradient distribution issues in Transformers, enabling efficient training with simpler optimizers while maintaining competitive performance.

Abstract: Transformers have become the de facto backbone of modern deep learning, yet their training typically demands an advanced optimizer with adaptive learning rate like AdamW, rather than a momentum SGDW (mSGDW). Previous works show that it is mainly due to a heavy-tailed distribution of the gradients. In this paper, we introduce a Deeply Normalized Transformer (DNT), which is meticulously engineered to overcome this limitation enabling seamless training with vanilla mSGDW while yielding comparable performance to the Transformers trained via AdamW. To be specific, in DNT, we strategically integrate normalization techniques at proper positions in the Transformers to effectively modulate the Jacobian matrices of each layer, balance the influence of weights, activations, and their interactions, and thus enable the distributions of gradients concentrated. We provide both theoretical justifications of the normalization technique used in our DNT and extensive empirical evaluation on two popular Transformer architectures to validate that: a) DNT outperforms its counterparts (\ie, ViT and GPT), and b) DNT can be effectively trained with vanilla mSGDW.

[290] Tabular Diffusion based Actionable Counterfactual Explanations for Network Intrusion Detection

Vinura Galwaduge, Jagath Samarabandu

Main category: cs.LG

TL;DR: This paper proposes a novel diffusion-based counterfactual explanation framework for network intrusion detection systems (NIDS) that provides actionable explanations for attack detection decisions, enabling better understanding and more effective countermeasures against network intrusions.

Details

Motivation: Modern NIDS use complex deep learning models that act as "black boxes," making it difficult to understand detection decisions, trust the results, and develop timely countermeasures. Existing explainable AI (XAI) methods don't provide explanations that can be easily converted into actionable countermeasures for network security.

Method: The authors develop a diffusion-based counterfactual explanation framework that generates minimal and diverse counterfactual explanations for network intrusion attacks. They also create global rules by summarizing counterfactual explanations that work at both instance and global levels for intrusion detection.

Result: The proposed method outperformed existing counterfactual explanation algorithms on 3 modern network intrusion datasets by providing more minimal and diverse explanations in a more time-efficient manner. The global counterfactual rules effectively filtered out incoming attack queries, demonstrating practical utility for intrusion detection and defense.

Conclusion: The diffusion-based counterfactual explanation framework successfully addresses the opacity problem in deep learning-based NIDS by providing actionable explanations that can be converted into practical countermeasures. The method offers both instance-level and global-level actionable insights that are crucial for efficient intrusion detection and defense mechanisms.

Abstract: Modern network intrusion detection systems (NIDS) frequently utilize the predictive power of complex deep learning models. However, the “black-box” nature of such deep learning methods adds a layer of opaqueness that hinders the proper understanding of detection decisions, trust in the decisions and prevent timely countermeasures against such attacks. Explainable AI (XAI) methods provide a solution to this problem by providing insights into the causes of the predictions. The majority of the existing XAI methods provide explanations which are not convenient to convert into actionable countermeasures. In this work, we propose a novel diffusion-based counterfactual explanation framework that can provide actionable explanations for network intrusion attacks. We evaluated our proposed algorithm against several other publicly available counterfactual explanation algorithms on 3 modern network intrusion datasets. To the best of our knowledge, this work also presents the first comparative analysis of existing counterfactual explanation algorithms within the context of network intrusion detection systems. Our proposed method provide minimal, diverse counterfactual explanations out of the tested counterfactual explanation algorithms in a more efficient manner by reducing the time to generate explanations. We also demonstrate how counterfactual explanations can provide actionable explanations by summarizing them to create a set of global rules. These rules are actionable not only at instance level but also at the global level for intrusion attacks. These global counterfactual rules show the ability to effectively filter out incoming attack queries which is crucial for efficient intrusion detection and defense mechanisms.

[291] Met$^2$Net: A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems

Shaohan Li, Hao Yang, Min Chen, Xiaolin Qin

Main category: cs.LG

TL;DR: This paper proposes Met2Net, an implicit two-stage training method for weather prediction that uses separate encoders/decoders for each weather variable and a translator to capture inter-variable interactions, achieving state-of-the-art performance with 28.82% and 23.39% MSE reduction for temperature and humidity predictions respectively.

Details

Motivation: Current end-to-end weather prediction methods face representation inconsistency in multivariable integration and struggle to capture dependencies between variables in complex weather systems. Existing two-stage training approaches from multimodal models show suboptimal results due to inconformity in training tasks between stages.

Method: The paper introduces an implicit two-stage training approach with separate encoders and decoders for each weather variable. Stage 1: Translator is frozen while Encoders and Decoders learn a shared latent space. Stage 2: Encoders and Decoders are frozen while the Translator captures inter-variable interactions for prediction. A self-attention mechanism is added for multivariable fusion in the latent space.

Result: Extensive experiments demonstrate state-of-the-art performance, with MSE reductions of 28.82% for near-surface air temperature predictions and 23.39% for relative humidity predictions compared to existing methods.

Conclusion: The proposed implicit two-stage training method effectively addresses representation inconsistency and variable dependency issues in weather prediction, achieving significant performance improvements over existing approaches through its novel architecture design and training strategy.

Abstract: The increasing frequency of extreme weather events due to global climate change urges accurate weather prediction. Recently, great advances have been made by the \textbf{end-to-end methods}, thanks to deep learning techniques, but they face limitations of \textit{representation inconsistency} in multivariable integration and struggle to effectively capture the dependency between variables, which is required in complex weather systems. Treating different variables as distinct modalities and applying a \textbf{two-stage training approach} from multimodal models can partially alleviate this issue, but due to the inconformity in training tasks between the two stages, the results are often suboptimal. To address these challenges, we propose an implicit two-stage training method, configuring separate encoders and decoders for each variable. In detailed, in the first stage, the Translator is frozen while the Encoders and Decoders learn a shared latent space, in the second stage, the Encoders and Decoders are frozen, and the Translator captures inter-variable interactions for prediction. Besides, by introducing a self-attention mechanism for multivariable fusion in the latent space, the performance achieves further improvements. Empirically, extensive experiments show the state-of-the-art performance of our method. Specifically, it reduces the MSE for near-surface air temperature and relative humidity predictions by 28.82% and 23.39%, respectively. The source code is available at https://github.com/ShremG/Met2Net.

[292] Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content Moderation

Zixuan Wang, Jinghao Shi, Hanzhong Liang, Xiang Shen, Vera Wen, Zhiqian Chen, Yifan Wu, Zhixin Zhang, Hongyu Xiong

Main category: cs.LG

TL;DR: This paper presents a cost-effective method to deploy multimodal large language models (MLLMs) for video content moderation by transforming generative models into classifiers and using a router-ranking cascade system to reduce computational costs while improving performance.

Details

Motivation: Traditional video classification models struggle with complex content moderation scenarios involving implicit harmful content and contextual ambiguity. While MLLMs offer superior cross-modal reasoning capabilities, their high computational costs and the challenge of adapting generative models for classification tasks prevent industrial adoption.

Method: The authors propose two key innovations: (1) an efficient method to transform generative MLLMs into multimodal classifiers using minimal discriminative training data, and (2) a router-ranking cascade system that integrates MLLMs with lightweight router models to enable scalable deployment.

Result: Offline experiments show 66.50% improvement in F1 score over traditional classifiers while requiring only 2% of fine-tuning data. Online evaluations demonstrate 41% increase in automatic content moderation volume, with the cascade system reducing computational cost to just 1.5% of direct full-scale deployment.

Conclusion: The proposed approach successfully addresses the dual challenges of MLLM deployment in content moderation - achieving superior performance with significantly reduced computational costs and minimal training data requirements, making MLLMs practically viable for industrial-scale video content moderation.

Abstract: Effective content moderation is essential for video platforms to safeguard user experience and uphold community standards. While traditional video classification models effectively handle well-defined moderation tasks, they struggle with complicated scenarios such as implicit harmful content and contextual ambiguity. Multimodal large language models (MLLMs) offer a promising solution to these limitations with their superior cross-modal reasoning and contextual understanding. However, two key challenges hinder their industrial adoption. First, the high computational cost of MLLMs makes full-scale deployment impractical. Second, adapting generative models for discriminative classification remains an open research problem. In this paper, we first introduce an efficient method to transform a generative MLLM into a multimodal classifier using minimal discriminative training data. To enable industry-scale deployment, we then propose a router-ranking cascade system that integrates MLLMs with a lightweight router model. Offline experiments demonstrate that our MLLM-based approach improves F1 score by 66.50% over traditional classifiers while requiring only 2% of the fine-tuning data. Online evaluations show that our system increases automatic content moderation volume by 41%, while the cascading deployment reduces computational cost to only 1.5% of direct full-scale deployment.

[293] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, Sean Hendryx

Main category: cs.LG

TL;DR: The paper introduces “Rubrics as Rewards” (RaR), a framework that uses structured checklist-style rubrics as interpretable reward signals for training language models, achieving up to 28% improvement over traditional preference-based methods while maintaining interpretability.

Details

Motivation: Real-world reinforcement learning tasks often lack single ground truth and reliable reward signals. Traditional preference-based methods use opaque reward functions that are difficult to interpret and prone to spurious correlations, making it challenging to effectively train language models for complex tasks.

Method: The authors propose Rubrics as Rewards (RaR), which uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO (Group Relative Policy Optimization). This approach replaces opaque reward functions with transparent, structured evaluation criteria.

Result: RaR achieves up to 28% relative improvement on HealthBench-1k compared to simple Likert-based approaches. The method matches or surpasses performance of reward signals derived from expert-written references while enabling smaller-scale judge models to better align with human preferences across different model scales.

Conclusion: Structured rubrics can serve as effective and interpretable reward signals for reinforcement learning with language models. This approach provides better alignment with human preferences while maintaining transparency and robustness across different model scales, offering a practical solution for real-world applications where ground truth is ambiguous.

Abstract: Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce $\textbf{Rubrics as Rewards}$ (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a $28%$ relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.

[294] Dataset Distillation as Data Compression: A Rate-Utility Perspective

Youneng Bao, Yiping Liu, Zhuo Chen, Yongsheng Liang, Mu Li, Kede Ma

Main category: cs.LG

TL;DR: This paper proposes a joint rate-utility optimization method for dataset distillation that achieves up to 170× greater compression than standard methods while maintaining comparable accuracy by parameterizing synthetic samples as optimizable latent codes and trading off storage cost (Shannon entropy) with distillation performance via Lagrange multipliers.

Details

Motivation: Existing dataset distillation methods either focus on maximizing performance under fixed storage budgets or pursue suitable synthetic data representations for redundancy removal, but fail to jointly optimize both storage efficiency and utility. The "scale-is-everything" paradigm in modern ML demands increasingly large datasets and models with prohibitive computational and storage requirements, necessitating better compression techniques.

Method: The authors parameterize synthetic samples as optimizable latent codes that are decoded by extremely lightweight networks. They estimate the Shannon entropy of quantized latents as the rate measure and use existing distillation loss as the utility measure, trading them off via a Lagrange multiplier. They also introduce “bits per class” (bpc) as a precise storage metric that accounts for sample, label, and decoder parameter costs.

Result: On CIFAR-10, CIFAR-100, and ImageNet-128 datasets, the method achieves up to 170× greater compression than standard distillation methods at comparable accuracy. The approach consistently establishes better rate-utility trade-offs across diverse bpc budgets, distillation losses, and backbone architectures.

Conclusion: The joint rate-utility optimization framework successfully addresses the dual objectives of storage efficiency and performance preservation in dataset distillation. The proposed method demonstrates superior compression capabilities while maintaining competitive accuracy, offering a more efficient solution for dataset distillation across various experimental settings.

Abstract: Driven by the ``scale-is-everything’’ paradigm, modern machine learning increasingly demands ever-larger datasets and models, yielding prohibitive computational and storage requirements. Dataset distillation mitigates this by compressing an original dataset into a small set of synthetic samples, while preserving its full utility. Yet, existing methods either maximize performance under fixed storage budgets or pursue suitable synthetic data representations for redundancy removal, without jointly optimizing both objectives. In this work, we propose a joint rate-utility optimization method for dataset distillation. We parameterize synthetic samples as optimizable latent codes decoded by extremely lightweight networks. We estimate the Shannon entropy of quantized latents as the rate measure and plug any existing distillation loss as the utility measure, trading them off via a Lagrange multiplier. To enable fair, cross-method comparisons, we introduce bits per class (bpc), a precise storage metric that accounts for sample, label, and decoder parameter costs. On CIFAR-10, CIFAR-100, and ImageNet-128, our method achieves up to $170\times$ greater compression than standard distillation at comparable accuracy. Across diverse bpc budgets, distillation losses, and backbone architectures, our approach consistently establishes better rate-utility trade-offs.

[295] P3SL: Personalized Privacy-Preserving Split Learning on Heterogeneous Edge Devices

Wei Fan, JinYi Yoon, Xiaochang Li, Huajie Shao, Bo Ji

Main category: cs.LG

TL;DR: P3SL is a personalized privacy-preserving split learning framework that enables heterogeneous edge devices to participate in federated training while maintaining customized privacy protection and optimal resource utilization without sharing sensitive device information with servers.

Details

Motivation: Existing split learning approaches for heterogeneous environments fail to address personalized privacy requirements and local model customization under varying environmental conditions, while often requiring devices to share sensitive information about their computational resources and privacy needs with servers.

Method: The paper proposes P3SL with two key components: (1) a personalized sequential split learning pipeline that allows customized privacy protection and personalized local models, and (2) a bi-level optimization technique that enables clients to determine optimal split points locally without sharing private information with the server.

Result: P3SL was evaluated on a testbed of 7 heterogeneous devices (4 Jetson Nano P3450, 2 Raspberry Pis, 1 laptop) using diverse model architectures and datasets under varying environmental conditions, demonstrating effective balance between energy consumption, privacy protection, and model accuracy.

Conclusion: P3SL successfully addresses the limitations of existing heterogeneous split learning by providing personalized privacy-preserving capabilities while maintaining high model performance across diverse edge computing environments without compromising sensitive device information.

Abstract: Split Learning (SL) is an emerging privacy-preserving machine learning technique that enables resource constrained edge devices to participate in model training by partitioning a model into client-side and server-side sub-models. While SL reduces computational overhead on edge devices, it encounters significant challenges in heterogeneous environments where devices vary in computing resources, communication capabilities, environmental conditions, and privacy requirements. Although recent studies have explored heterogeneous SL frameworks that optimize split points for devices with varying resource constraints, they often neglect personalized privacy requirements and local model customization under varying environmental conditions. To address these limitations, we propose P3SL, a Personalized Privacy-Preserving Split Learning framework designed for heterogeneous, resource-constrained edge device systems. The key contributions of this work are twofold. First, we design a personalized sequential split learning pipeline that allows each client to achieve customized privacy protection and maintain personalized local models tailored to their computational resources, environmental conditions, and privacy needs. Second, we adopt a bi-level optimization technique that empowers clients to determine their own optimal personalized split points without sharing private sensitive information (i.e., computational resources, environmental conditions, privacy requirements) with the server. This approach balances energy consumption and privacy leakage risks while maintaining high model accuracy. We implement and evaluate P3SL on a testbed consisting of 7 devices including 4 Jetson Nano P3450 devices, 2 Raspberry Pis, and 1 laptop, using diverse model architectures and datasets under varying environmental conditions.

[296] Eco-Friendly AI: Unleashing Data Power for Green Federated Learning

Mattia Sabella, Monica Vitali

Main category: cs.LG

TL;DR: This paper proposes a data-centric approach to Green Federated Learning that reduces environmental impact by optimizing data selection and federated node choices, demonstrating effectiveness in time series classification tasks.

Details

Motivation: The widespread adoption of AI/ML has significant environmental impact through energy consumption and carbon emissions. While Federated Learning offers privacy and transmission cost benefits, it introduces new challenges related to data heterogeneity and environmental impact that need to be addressed.

Method: The methodology involves: (1) analyzing characteristics of federated datasets, (2) selecting optimal data subsets based on quality metrics, (3) choosing federated nodes with lowest environmental impact, and (4) developing an interactive recommendation system that optimizes FL configurations through data reduction.

Result: The approach demonstrated promising results in reducing environmental impact of FL tasks when applied to time series classification, showing that data-centric optimization can effectively minimize carbon emissions during training.

Conclusion: A data-centric approach to Green Federated Learning can successfully reduce environmental impact by optimizing data volume and quality while maintaining training performance, contributing to the advancement of Green AI through practical methodology and recommendation systems.

Abstract: The widespread adoption of Artificial Intelligence (AI) and Machine Learning (ML) comes with a significant environmental impact, particularly in terms of energy consumption and carbon emissions. This pressing issue highlights the need for innovative solutions to mitigate AI’s ecological footprint. One of the key factors influencing the energy consumption of ML model training is the size of the training dataset. ML models are often trained on vast amounts of data continuously generated by sensors and devices distributed across multiple locations. To reduce data transmission costs and enhance privacy, Federated Learning (FL) enables model training without the need to move or share raw data. While FL offers these advantages, it also introduces challenges due to the heterogeneity of data sources (related to volume and quality), computational node capabilities, and environmental impact. This paper contributes to the advancement of Green AI by proposing a data-centric approach to Green Federated Learning. Specifically, we focus on reducing FL’s environmental impact by minimizing the volume of training data. Our methodology involves the analysis of the characteristics of federated datasets, the selecting of an optimal subset of data based on quality metrics, and the choice of the federated nodes with the lowest environmental impact. We develop a comprehensive methodology that examines the influence of data-centric factors, such as data quality and volume, on FL training performance and carbon emissions. Building on these insights, we introduce an interactive recommendation system that optimizes FL configurations through data reduction, minimizing environmental impact during training. Applying this methodology to time series classification has demonstrated promising results in reducing the environmental impact of FL tasks.

[297] DistrAttention: An Efficient and Flexible Self-Attention Mechanism on Modern GPUs

Haolin Jin, Mengbai Xiao, Yuan Yuan, Xiao Zhang, Dongxiao Yu, Guanghui Zhang, Haoliang Wang

Main category: cs.LG

TL;DR: DistrAttention is a novel self-attention mechanism that achieves linear time complexity while maintaining full contextual information by grouping data using locality-sensitive hashing, resulting in 37% faster computation than FlashAttention-2 with minimal accuracy loss.

Details

Motivation: The quadratic time complexity of self-attention in Transformers limits scalability for long sequences. Existing optimization approaches either sacrifice full-contextual information or lack flexibility, creating a need for an efficient mechanism that maintains both performance and context completeness.

Method: DistrAttention groups data along the embedding dimension using locality-sensitive hashing (LSH) to identify similar data points. It employs a lightweight sampling and fusion method with a block-wise grouping framework to minimize LSH-induced errors. The method optimizes block sizes for integration with FlashAttention-2 to achieve high GPU performance.

Result: DistrAttention achieves 37% faster self-attention computation compared to FlashAttention-2. In Vision Transformer (ViT) inference, it demonstrates the best combination of speed and accuracy among approximate self-attention mechanisms. For Llama3-1B, it achieves the lowest inference time with only 1% accuracy degradation.

Conclusion: DistrAttention successfully addresses the scalability limitations of Transformer self-attention by providing an efficient, flexible mechanism that maintains full contextual information while significantly improving computational speed across different model architectures and tasks.

Abstract: The Transformer architecture has revolutionized deep learning, delivering the state-of-the-art performance in areas such as natural language processing, computer vision, and time series prediction. However, its core component, self-attention, has the quadratic time complexity relative to input sequence length, which hinders the scalability of Transformers. The exsiting approaches on optimizing self-attention either discard full-contextual information or lack of flexibility. In this work, we design DistrAttention, an effcient and flexible self-attention mechanism with the full context. DistrAttention achieves this by grouping data on the embedding dimensionality, usually referred to as $d$. We realize DistrAttention with a lightweight sampling and fusion method that exploits locality-sensitive hashing to group similar data. A block-wise grouping framework is further designed to limit the errors introduced by locality sensitive hashing. By optimizing the selection of block sizes, DistrAttention could be easily integrated with FlashAttention-2, gaining high-performance on modern GPUs. We evaluate DistrAttention with extensive experiments. The results show that our method is 37% faster than FlashAttention-2 on calculating self-attention. In ViT inference, DistrAttention is the fastest and the most accurate among approximate self-attention mechanisms. In Llama3-1B, DistrAttention still achieves the lowest inference time with only 1% accuray loss.

[298] Rethinking VAE: From Continuous to Discrete Representations Without Probabilistic Assumptions

Songxuan Shi

Main category: cs.LG

TL;DR: This paper investigates the generative capabilities of Autoencoders and proposes a new VAE-like training method that connects VAEs and VQ-VAEs through clustering centers, revealing insights about encoding space compactness in generative modeling.

Details

Motivation: Traditional Autoencoders have limited generative capabilities due to undefined regions in their encoding space. The authors aim to enhance AE generative potential and establish theoretical connections between VAEs and VQ-VAEs while addressing the compactness issues in latent representations.

Method: The authors propose a reformulated training framework that introduces clustering centers to enhance data compactness in latent space without using traditional KL divergence or reparameterization techniques. They extend this approach to multiple learnable vectors to create a progression toward VQ-VAE-like models in continuous space.

Result: Experiments on MNIST, CelebA, and FashionMNIST datasets demonstrate smooth interpolative transitions, though blurriness remains an issue. The approach naturally progresses toward VQ-VAE-like behavior, but when encoders output multiple vectors, the model degenerates into a discrete Autoencoder that combines image fragments without learning semantic representations.

Conclusion: The study highlights the critical importance of encoding space compactness and dispersion in generative modeling. It provides new insights into the intrinsic connections between VAEs and VQ-VAEs, offering a fresh perspective on their design principles and inherent limitations in generative tasks.

Abstract: This paper explores the generative capabilities of Autoencoders (AEs) and establishes connections between Variational Autoencoders (VAEs) and Vector Quantized-Variational Autoencoders (VQ-VAEs) through a reformulated training framework. We demonstrate that AEs exhibit generative potential via latent space interpolation and perturbation, albeit limited by undefined regions in the encoding space. To address this, we propose a new VAE-like training method that introduces clustering centers to enhance data compactness and ensure well-defined latent spaces without relying on traditional KL divergence or reparameterization techniques. Experimental results on MNIST, CelebA, and FashionMNIST datasets show smooth interpolative transitions, though blurriness persists. Extending this approach to multiple learnable vectors, we observe a natural progression toward a VQ-VAE-like model in continuous space. However, when the encoder outputs multiple vectors, the model degenerates into a discrete Autoencoder (VQ-AE), which combines image fragments without learning semantic representations. Our findings highlight the critical role of encoding space compactness and dispersion in generative modeling and provide insights into the intrinsic connections between VAEs and VQ-VAEs, offering a new perspective on their design and limitations.

[299] Leveraging Knowledge Graphs and LLM Reasoning to Identify Operational Bottlenecks for Warehouse Planning Assistance

Rishi Parekh, Saisubramaniam Gopalakrishnan, Zishan Ahmad, Anirudh Deodhar

Main category: cs.LG

TL;DR: This paper presents a framework that combines Knowledge Graphs (KGs) and Large Language Model (LLM)-based agents to automatically analyze complex warehouse simulation data, identifying bottlenecks and inefficiencies through iterative reasoning and self-correction, achieving near-perfect performance in operational analysis.

Details

Motivation: Analyzing large, complex output datasets from Discrete Event Simulations (DES) of warehouse operations to identify bottlenecks and inefficiencies is a critical yet challenging task that often demands significant manual effort or specialized analytical tools, creating a need for automated, intuitive analysis methods.

Method: The framework transforms raw DES data into semantically rich Knowledge Graphs capturing relationships between simulation events and entities, then employs LLM-based agents that use iterative reasoning to generate interdependent sub-questions, create Cypher queries for KG interaction, extract information, and self-reflect to correct errors in an adaptive process.

Result: The approach outperforms baseline methods in warehouse bottleneck identification tests with equipment breakdowns and process irregularities, achieving near-perfect pass rates for operational questions in pinpointing inefficiencies and demonstrating superior diagnostic ability for complex investigative questions involving subtle, interconnected issues.

Conclusion: This work successfully bridges simulation modeling and AI (KG+LLM) technologies, offering a more intuitive method for generating actionable insights, reducing time-to-insight, and enabling automated warehouse inefficiency evaluation and diagnosis compared to traditional manual analysis approaches.

Abstract: Analyzing large, complex output datasets from Discrete Event Simulations (DES) of warehouse operations to identify bottlenecks and inefficiencies is a critical yet challenging task, often demanding significant manual effort or specialized analytical tools. Our framework integrates Knowledge Graphs (KGs) and Large Language Model (LLM)-based agents to analyze complex Discrete Event Simulation (DES) output data from warehouse operations. It transforms raw DES data into a semantically rich KG, capturing relationships between simulation events and entities. An LLM-based agent uses iterative reasoning, generating interdependent sub-questions. For each sub-question, it creates Cypher queries for KG interaction, extracts information, and self-reflects to correct errors. This adaptive, iterative, and self-correcting process identifies operational issues mimicking human analysis. Our DES approach for warehouse bottleneck identification, tested with equipment breakdowns and process irregularities, outperforms baseline methods. For operational questions, it achieves near-perfect pass rates in pinpointing inefficiencies. For complex investigative questions, we demonstrate its superior diagnostic ability to uncover subtle, interconnected issues. This work bridges simulation modeling and AI (KG+LLM), offering a more intuitive method for actionable insights, reducing time-to-insight, and enabling automated warehouse inefficiency evaluation and diagnosis.

[300] Decentralized Federated Learning of Probabilistic Generative Classifiers

Aritz Pérez, Carlos Echegoyen, Guzmán Santafé

Main category: cs.LG

TL;DR: A decentralized federated learning approach for probabilistic generative classifiers that enables nodes to collaborate directly without a central server by sharing local statistics with neighbors to iteratively converge to a global model.

Details

Motivation: Traditional federated learning relies on central servers, but decentralized architectures are needed where users can collaborate directly without sharing private data. There's a need for effective methods to learn probabilistic generative classifiers in such distributed settings across heterogeneous networks.

Method: A decentralized framework where nodes in a communication network share local statistics with neighboring nodes. Each node aggregates neighbors’ information and uses local updating rules to iteratively learn its own classifier, which progressively converges to a global model without requiring a central server.

Result: Extensive experiments show the algorithm consistently converges to globally competitive models across various network topologies, network sizes, local dataset sizes, and extreme non-i.i.d. data distributions, demonstrating robustness and effectiveness.

Conclusion: The proposed decentralized federated learning approach successfully enables collaborative learning of probabilistic generative classifiers without central coordination, achieving convergence to competitive global models while maintaining privacy and handling diverse network conditions and data distributions.

Abstract: Federated learning is a paradigm of increasing relevance in real world applications, aimed at building a global model across a network of heterogeneous users without requiring the sharing of private data. We focus on model learning over decentralized architectures, where users collaborate directly to update the global model without relying on a central server. In this context, the current paper proposes a novel approach to collaboratively learn probabilistic generative classifiers with a parametric form. The framework is composed by a communication network over a set of local nodes, each of one having its own local data, and a local updating rule. The proposal involves sharing local statistics with neighboring nodes, where each node aggregates the neighbors’ information and iteratively learns its own local classifier, which progressively converges to a global model. Extensive experiments demonstrate that the algorithm consistently converges to a globally competitive model across a wide range of network topologies, network sizes, local dataset sizes, and extreme non-i.i.d. data distributions.

[301] R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang

Main category: cs.LG

TL;DR: R-Stitch is a token-level, confidence-based hybrid decoding framework that accelerates Chain-of-Thought (CoT) reasoning by dynamically switching between small and large language models during inference, achieving up to 85% latency reduction with minimal accuracy loss.

Details

Motivation: Chain-of-thought reasoning in large language models introduces substantial computational overhead due to autoregressive decoding over long sequences. Existing acceleration methods like speculative decoding have limited effectiveness when small-large model agreement is low and fail to leverage small models' ability to produce concise reasoning.

Method: R-Stitch uses a confidence-based hybrid approach where a small language model generates tokens by default, and switches to a large language model only when the small model’s confidence falls below a threshold. This avoids full-sequence rollback and selectively uses the large model only for uncertain reasoning steps.

Result: Experiments on math reasoning benchmarks show R-Stitch achieves up to 85% reduction in inference latency with negligible accuracy drop. The framework is model-agnostic, training-free, and compatible with standard decoding pipelines.

Conclusion: R-Stitch effectively accelerates CoT reasoning by intelligently combining small and large models based on confidence levels, demonstrating practical effectiveness in reducing computational overhead while preserving reasoning quality in mathematical problem-solving tasks.

Abstract: Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM’s confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.

[302] Confounded Causal Imitation Learning with Instrumental Variables

Yan Zeng, Shenglan Nie, Feng Xie, Libo Huang, Peng Wu, Zhi Geng

Main category: cs.LG

TL;DR: This paper proposes C2L (Confounded Causal Imitation Learning), a two-stage framework that uses instrumental variables to address unmeasured confounders in imitation learning, enabling unbiased policy estimation from demonstrations.

Details

Motivation: Traditional imitation learning suffers from confounding effects of unmeasured variables that influence both states and actions, leading to biased policy estimation. Existing methods are limited to immediate temporal dependencies and cannot handle confounders that affect actions across multiple timesteps.

Method: A two-stage framework: (1) First stage constructs a testing criterion based on pseudo-variables to identify valid instrumental variables (IV) with sufficient and necessary identifiability conditions; (2) Second stage uses the identified IV for policy optimization through either simulator-based or offline learning approaches.

Result: Extensive experiments demonstrated the effectiveness of both identifying valid instrumental variables and learning policies. The method successfully handles confounders across multiple timesteps rather than just immediate dependencies.

Conclusion: The C2L model effectively breaks the confounding gap in imitation learning by leveraging instrumental variables, providing a principled approach to handle unmeasured confounders that affect actions across multiple timesteps and enabling more accurate policy learning from demonstrations.

Abstract: Imitation learning from demonstrations usually suffers from the confounding effects of unmeasured variables (i.e., unmeasured confounders) on the states and actions. If ignoring them, a biased estimation of the policy would be entailed. To break up this confounding gap, in this paper, we take the best of the strong power of instrumental variables (IV) and propose a Confounded Causal Imitation Learning (C2L) model. This model accommodates confounders that influence actions across multiple timesteps, rather than being restricted to immediate temporal dependencies. We develop a two-stage imitation learning framework for valid IV identification and policy optimization. In particular, in the first stage, we construct a testing criterion based on the defined pseudo-variable, with which we achieve identifying a valid IV for the C2L models. Such a criterion entails the sufficient and necessary identifiability conditions for IV validity. In the second stage, with the identified IV, we propose two candidate policy learning approaches: one is based on a simulator, while the other is offline. Extensive experiments verified the effectiveness of identifying the valid IV as well as learning the policy.

[303] EarthLink: Interpreting Climate Signals with Self-Evolving AI Agents

Zijie Guo, Jiong Wang, Xiaoyu Yue, Wangxu Wei, Zhe Jiang, Wanghan Xu, Ben Fei, Wenlong Zhang, Xinyu Gu, Lijing Cheng, Jing-Jia Luo, Chao Li, Yaqiang Wang, Tao Chen, Wanli Ouyang, Fenghua Ling, Lei Bai

Main category: cs.LG

TL;DR: EarthLink is the first AI agent designed as an interactive copilot for Earth scientists that automates end-to-end research workflows and demonstrates analytical competency comparable to human junior researchers in climate change studies.

Details

Motivation: Modern Earth science faces significant bottlenecks due to vast, fragmented, and complex Earth system data coupled with increasingly sophisticated analytical demands that hinder rapid scientific discovery.

Method: Development of EarthLink, an AI agent that automates end-to-end research workflows including planning, code generation, and multi-scenario analysis, with the ability to learn from user interaction through dynamic feedback loops and provide transparent, auditable workflows via natural language interface.

Result: EarthLink successfully performed core scientific tasks in climate change research, including model-observation comparisons and diagnosis of complex phenomena. Multi-expert evaluation showed it produced scientifically sound analyses with analytical competency rated as comparable to specific aspects of human junior researcher workflows.

Conclusion: EarthLink represents a pivotal step towards an efficient, trustworthy, and collaborative paradigm for Earth system research, enabling scientists to shift from manual execution to strategic oversight and hypothesis generation in an era of accelerating global change.

Abstract: Modern Earth science is at an inflection point. The vast, fragmented, and complex nature of Earth system data, coupled with increasingly sophisticated analytical demands, creates a significant bottleneck for rapid scientific discovery. Here we introduce EarthLink, the first AI agent designed as an interactive copilot for Earth scientists. It automates the end-to-end research workflow, from planning and code generation to multi-scenario analysis. Unlike static diagnostic tools, EarthLink can learn from user interaction, continuously refining its capabilities through a dynamic feedback loop. We validated its performance on a number of core scientific tasks of climate change, ranging from model-observation comparisons to the diagnosis of complex phenomena. In a multi-expert evaluation, EarthLink produced scientifically sound analyses and demonstrated an analytical competency that was rated as comparable to specific aspects of a human junior researcher’s workflow. Additionally, its transparent, auditable workflows and natural language interface empower scientists to shift from laborious manual execution to strategic oversight and hypothesis generation. EarthLink marks a pivotal step towards an efficient, trustworthy, and collaborative paradigm for Earth system research in an era of accelerating global change.

[304] A Learning-based Domain Decomposition Method

Rui Wu, Nikola Kovachki, Burigede Liu

Main category: cs.LG

TL;DR: The paper proposes a learning-based domain decomposition method (L-DDM) that uses a single pre-trained neural operator to efficiently solve PDEs on large, complex geometries by breaking them into simpler subdomains, achieving better performance than existing methods while maintaining resolution-invariance and generalization capabilities.

Details

Motivation: Traditional numerical methods like Finite Element Method struggle with computational cost and scalability for large, geometrically complex problems in mechanical, aerospace, and structural engineering. Existing neural network approaches are limited to simple domains, making them impractical for real-world PDEs with complex geometries.

Method: The authors develop a learning-based domain decomposition method (L-DDM) that employs a single pre-trained neural operator (originally trained on simple domains) as a surrogate model within a domain decomposition framework. They use a physics-pretrained neural operator (PPNO) and provide theoretical results on the existence of neural operator approximations in domain decomposition contexts.

Result: The method successfully approximates solutions to elliptic PDEs with discontinuous microstructures in complex geometries. It outperforms current state-of-the-art methods on challenging problems while demonstrating resolution-invariance and strong generalization to unseen microstructural patterns during training.

Conclusion: L-DDM effectively bridges the gap between neural operator efficiency and complex geometry handling by leveraging domain decomposition. The approach offers a scalable solution for large-scale engineering problems while maintaining accuracy and generalization capabilities beyond the training data.

Abstract: Recent developments in mechanical, aerospace, and structural engineering have driven a growing need for efficient ways to model and analyse structures at much larger and more complex scales than before. While established numerical methods like the Finite Element Method remain reliable, they often struggle with computational cost and scalability when dealing with large and geometrically intricate problems. In recent years, neural network-based methods have shown promise because of their ability to efficiently approximate nonlinear mappings. However, most existing neural approaches are still largely limited to simple domains, which makes it difficult to apply to real-world PDEs involving complex geometries. In this paper, we propose a learning-based domain decomposition method (L-DDM) that addresses this gap. Our approach uses a single, pre-trained neural operator-originally trained on simple domains-as a surrogate model within a domain decomposition scheme, allowing us to tackle large and complicated domains efficiently. We provide a general theoretical result on the existence of neural operator approximations in the context of domain decomposition solution of abstract PDEs. We then demonstrate our method by accurately approximating solutions to elliptic PDEs with discontinuous microstructures in complex geometries, using a physics-pretrained neural operator (PPNO). Our results show that this approach not only outperforms current state-of-the-art methods on these challenging problems, but also offers resolution-invariance and strong generalization to microstructural patterns unseen during training.

[305] DeCo-SGD: Joint Optimization of Delay Staleness and Gradient Compression Ratio for Distributed SGD

Rongwei Lu, Jingyan Jiang, Chunyang Li, Haotian Dong, Xingguang Wei, Delin Cai, Zhi Wang

Main category: cs.LG

TL;DR: This paper addresses the problem of distributed machine learning in challenging network environments by proposing DeCo-SGD, an adaptive algorithm that dynamically adjusts gradient compression and synchronization delay based on real-time network conditions, achieving significant speedups over existing approaches.

Details

Motivation: Distributed SGD suffers from severe throughput degradation in high-latency, low-bandwidth networks. Existing approaches use gradient compression and delayed aggregation but rely on static heuristic strategies without theoretical guidance, creating a complex three-way trade-off among compression ratio, staleness, and convergence rate that prevents optimal performance under varying network conditions.

Method: The authors introduce a new theoretical framework that decomposes the joint optimization problem into traditional convergence rate analysis with multiple analyzable noise terms. They develop DeCo-SGD, which dynamically adjusts compression ratio and staleness based on real-time network conditions and training tasks by integrating convergence rate analysis with network-aware time minimization conditions.

Result: DeCo-SGD achieves up to 5.07x speedup over distributed SGD and 1.37x speedup over static strategies in high-latency and low, varying bandwidth networks. The theoretical analysis reveals that staleness exponentially amplifies the negative impact of gradient compression on training performance.

Conclusion: The paper successfully bridges the theoretical gap in understanding how compressed and delayed gradients affect distributed training, providing the first adaptive solution that can dynamically balance compression ratio and staleness for optimal performance under varying network conditions.

Abstract: Distributed machine learning in high end-to-end latency and low, varying bandwidth network environments undergoes severe throughput degradation. Due to its low communication requirements, distributed SGD (D-SGD) remains the mainstream optimizer in such challenging networks, but it still suffers from significant throughput reduction. To mitigate these limitations, existing approaches typically employ gradient compression and delayed aggregation to alleviate low bandwidth and high latency, respectively. To address both challenges simultaneously, these strategies are often combined, introducing a complex three-way trade-off among compression ratio, staleness (delayed synchronization steps), and model convergence rate. To achieve the balance under varying bandwidth conditions, an adaptive policy is required to dynamically adjust these parameters. Unfortunately, existing works rely on static heuristic strategies due to the lack of theoretical guidance, which prevents them from achieving this goal. This study fills in this theoretical gap by introducing a new theoretical tool, decomposing the joint optimization problem into a traditional convergence rate analysis with multiple analyzable noise terms. We are the first to reveal that staleness exponentially amplifies the negative impact of gradient compression on training performance, filling a critical gap in understanding how compressed and delayed gradients affect training. Furthermore, by integrating the convergence rate with a network-aware time minimization condition, we propose DeCo-SGD, which dynamically adjusts the compression ratio and staleness based on the real-time network condition and training task. DeCo-SGD achieves up to 5.07 and 1.37 speed-ups over D-SGD and static strategy in high-latency and low, varying bandwidth networks, respectively.

[306] TOC-UCO: a comprehensive repository of tabular ordinal classification datasets

Rafael Ayllón-Gavilán, David Guijo-Rubio, Antonio Manuel Gómez-Orellana, David Guijo-Rubio, Francisco Bérchez-Moreno, Víctor Manuel Vargas-Yun, Pedro A. Gutiérrez

Main category: cs.LG

TL;DR: The University of Córdoba presents TOC-UCO, a comprehensive repository of 46 preprocessed tabular datasets specifically designed for benchmarking ordinal classification methods, addressing the field’s lack of standardized evaluation datasets.

Details

Motivation: The ordinal classification field lacks a comprehensive set of standardized datasets for benchmarking novel approaches, hindering robust validation and comparison of new methodologies in this important area of machine learning.

Method: The authors created TOC-UCO repository by collecting, preprocessing, and standardizing 46 tabular ordinal datasets under a common framework, ensuring reasonable pattern numbers and appropriate class distributions, with provided sources, preprocessing steps, and 30 randomized train-test partitions for reproducibility.

Result: A publicly available repository (TOC-UCO) containing 46 preprocessed tabular ordinal datasets with standardized formatting, documented preprocessing steps, and predefined train-test splits to enable consistent benchmarking of ordinal classification approaches.

Conclusion: The TOC-UCO repository fills a critical gap in the ordinal classification field by providing researchers with a standardized, comprehensive dataset collection that enables robust validation and fair comparison of novel ordinal classification methodologies.

Abstract: An ordinal classification (OC) problem corresponds to a special type of classification characterised by the presence of a natural order relationship among the classes. This type of problem can be found in a number of real-world applications, motivating the design and development of many ordinal methodologies over the last years. However, it is important to highlight that the development of the OC field suffers from one main disadvantage: the lack of a comprehensive set of datasets on which novel approaches to the literature have to be benchmarked. In order to approach this objective, this manuscript from the University of C'ordoba (UCO), which have previous experience on the OC field, provides the literature with a publicly available repository of tabular data for a robust validation of novel OC approaches, namely TOC-UCO (Tabular Ordinal Classification repository of the UCO). Specifically, this repository includes a set of $46$ tabular ordinal datasets, preprocessed under a common framework and ensured to have a reasonable number of patterns and an appropriate class distribution. We also provide the sources and preprocessing steps of each dataset, along with details on how to benchmark a novel approach using the TOC-UCO repository. For this, indices for $30$ different randomised train-test partitions are provided to facilitate the reproducibility of the experiments.

[307] DynaSearcher: Dynamic Knowledge Graph Augmented Search Agent via Multi-Reward Reinforcement Learning

Chuzhan Hao, Wenfeng Feng, Yuewei Zhang, Hao Wang

Main category: cs.LG

TL;DR: DynaSearcher is a multi-step search agent that uses dynamic knowledge graphs and multi-reward reinforcement learning to improve factual consistency and efficiency in complex information retrieval tasks, achieving state-of-the-art performance with smaller models.

Details

Motivation: Multi-step agentic retrieval systems face significant challenges including generating factually inconsistent intermediate queries and inefficient search trajectories, leading to reasoning deviations and redundant computations in practical applications.

Method: The paper proposes DynaSearcher, which combines: (1) dynamic knowledge graphs as external structured knowledge to guide search processes and ensure factual consistency by modeling entity relationships, and (2) multi-reward reinforcement learning framework for fine-grained control over training objectives including retrieval accuracy, efficiency, and response quality.

Result: DynaSearcher achieves state-of-the-art answer accuracy on six multi-hop question answering datasets, matching performance of frontier LLMs while using only small-scale models and limited computational resources. The approach also demonstrates strong generalization and robustness across diverse retrieval environments and larger-scale models.

Conclusion: The integration of dynamic knowledge graphs and multi-reward reinforcement learning effectively addresses key challenges in multi-step retrieval systems, enabling high-performance information search with improved factual consistency and efficiency while maintaining broad applicability across different scales and environments.

Abstract: Multi-step agentic retrieval systems based on large language models (LLMs) have demonstrated remarkable performance in complex information search tasks. However, these systems still face significant challenges in practical applications, particularly in generating factually inconsistent intermediate queries and inefficient search trajectories, which can lead to reasoning deviations or redundant computations. To address these issues, we propose DynaSearcher, an innovative search agent enhanced by dynamic knowledge graphs and multi-reward reinforcement learning (RL). Specifically, our system leverages knowledge graphs as external structured knowledge to guide the search process by explicitly modeling entity relationships, thereby ensuring factual consistency in intermediate queries and mitigating biases from irrelevant information. Furthermore, we employ a multi-reward RL framework for fine-grained control over training objectives such as retrieval accuracy, efficiency, and response quality. This framework promotes the generation of high-quality intermediate queries and comprehensive final answers, while discouraging unnecessary exploration and minimizing information omissions or redundancy. Experimental results demonstrate that our approach achieves state-of-the-art answer accuracy on six multi-hop question answering datasets, matching frontier LLMs while using only small-scale models and limited computational resources. Furthermore, our approach demonstrates strong generalization and robustness across diverse retrieval environments and larger-scale models, highlighting its broad applicability.

[308] ViRN: Variational Inference and Distribution Trilateration for Long-Tailed Continual Representation Learning

Hao Dai, Chong Tang, Jagmohan Chauhan

Main category: cs.LG

TL;DR: ViRN is a continual learning framework that combines variational inference with distributional trilateration to handle long-tailed data distributions, achieving 10.24% accuracy improvement over existing methods on classification benchmarks.

Details

Motivation: Real-world AI systems face the challenge of continual learning with long-tailed data distributions, where models must sequentially adapt to new classes while retaining old knowledge despite severe class imbalance. Existing methods struggle to balance stability and plasticity, often failing under extreme sample scarcity conditions.

Method: The paper proposes ViRN, which integrates two key components: (1) Variational Autoencoder modeling of class-conditional distributions to mitigate bias toward head classes, and (2) Wasserstein distance-based neighborhood retrieval with geometric fusion to reconstruct tail-class distributions for sample-efficient alignment of tail-class representations.

Result: ViRN was evaluated on six long-tailed classification benchmarks covering both speech (rare acoustic events, accents) and image tasks, demonstrating a 10.24% average accuracy gain over state-of-the-art continual learning methods.

Conclusion: ViRN successfully addresses the continual learning challenge in long-tailed scenarios by effectively combining variational inference with distributional trilateration, providing a robust solution that significantly outperforms existing approaches across diverse classification tasks.

Abstract: Continual learning (CL) with long-tailed data distributions remains a critical challenge for real-world AI systems, where models must sequentially adapt to new classes while retaining knowledge of old ones, despite severe class imbalance. Existing methods struggle to balance stability and plasticity, often collapsing under extreme sample scarcity. To address this, we propose ViRN, a novel CL framework that integrates variational inference (VI) with distributional trilateration for robust long-tailed learning. First, we model class-conditional distributions via a Variational Autoencoder to mitigate bias toward head classes. Second, we reconstruct tail-class distributions via Wasserstein distance-based neighborhood retrieval and geometric fusion, enabling sample-efficient alignment of tail-class representations. Evaluated on six long-tailed classification benchmarks, including speech (e.g., rare acoustic events, accents) and image tasks, ViRN achieves a 10.24% average accuracy gain over state-of-the-art methods.

[309] Continual Generalized Category Discovery: Learning and Forgetting from a Bayesian Perspective

Hao Dai, Jagmohan Chauhan

Main category: cs.LG

TL;DR: This paper proposes VB-CGCD, a variational Bayesian framework for continual generalized category discovery that addresses catastrophic forgetting by aligning class covariances and using stochastic variational updates to suppress pseudo-label noise.

Details

Motivation: Existing Continual Generalized Category Discovery (C-GCD) methods suffer from catastrophic forgetting when incrementally learning new classes from unlabeled data streams while preserving knowledge of old classes, particularly when unlabeled data contains mixed known and novel categories.

Method: The authors analyze C-GCD’s forgetting dynamics through a Bayesian perspective and propose VB-CGCD, which integrates variational inference with covariance-aware nearest-class-mean classification. The method adaptively aligns class distributions and suppresses pseudo-label noise via stochastic variational updates.

Result: VB-CGCD achieves +15.21% improvement in overall accuracy compared to prior methods on standard benchmarks. On a new challenging benchmark with only 10% labeled data, VB-CGCD reaches 67.86% final accuracy versus 38.55% for state-of-the-art methods.

Conclusion: The paper demonstrates that covariance misalignment between old and new classes is a key driver of performance degradation in C-GCD, and that variational Bayesian approaches with covariance-aware classification can effectively address this issue, showing robust performance across diverse scenarios.

Abstract: Continual Generalized Category Discovery (C-GCD) faces a critical challenge: incrementally learning new classes from unlabeled data streams while preserving knowledge of old classes. Existing methods struggle with catastrophic forgetting, especially when unlabeled data mixes known and novel categories. We address this by analyzing C-GCD’s forgetting dynamics through a Bayesian lens, revealing that covariance misalignment between old and new classes drives performance degradation. Building on this insight, we propose Variational Bayes C-GCD (VB-CGCD), a novel framework that integrates variational inference with covariance-aware nearest-class-mean classification. VB-CGCD adaptively aligns class distributions while suppressing pseudo-label noise via stochastic variational updates. Experiments show VB-CGCD surpasses prior art by +15.21% with the overall accuracy in the final session on standard benchmarks. We also introduce a new challenging benchmark with only 10% labeled data and extended online phases, VB-CGCD achieves a 67.86% final accuracy, significantly higher than state-of-the-art (38.55%), demonstrating its robust applicability across diverse scenarios. Code is available at: https://github.com/daihao42/VB-CGCD

[310] A Comprehensive Evaluation on Quantization Techniques for Large Language Models

Yutong Liu, Cairong Zhao, Guosheng Hu

Main category: cs.LG

TL;DR: This paper provides a comprehensive review and fair comparison of post-training quantization methods for large language models, decoupling them into pre-quantization transformation and quantization error mitigation components, and evaluates the latest MXFP4 data format.

Details

Motivation: The quantization field lacks fair comparisons since methods contain multiple components and are tested on different grounds. Additionally, theoretical connections among existing methods need deeper analysis for better understanding of the field.

Method: The authors conduct extensive literature review, perform comprehensive evaluations on the same experimental ground, and decouple quantization methods into two key steps: pre-quantization transformation (preprocessing to reduce outlier impact) and quantization error mitigation (techniques to offset quantization errors).

Result: Optimized rotation and scaling achieve best performance for pre-quantization transformation. Combining low-rank compensation with GPTQ occasionally outperforms GPTQ alone for error mitigation. The optimal pre-quantization strategy for INT4 does not generalize well to MXFP4 format.

Conclusion: The paper establishes a fair evaluation framework for quantization methods and reveals that different quantization formats may require different optimization strategies, highlighting the need for format-specific approaches in quantization research.

Abstract: For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is a rapidly evolving research field. Though many papers have reported breakthrough performance, they may not conduct experiments on the same ground since one quantization method usually contains multiple components. In addition, analyzing the theoretical connections among existing methods is crucial for in-depth understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations on the same ground to ensure fair comparisons. To our knowledge, this fair and extensive investigation remains critically important yet underexplored. To better understand the theoretical connections, we decouple the published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. We define the former as a preprocessing step applied before quantization to reduce the impact of outliers, making the data distribution flatter and more suitable for quantization. Quantization error mitigation involves techniques that offset the errors introduced during quantization, thereby enhancing model performance. We evaluate and analyze the impact of different components of quantization methods. Additionally, we analyze and evaluate the latest MXFP4 data format and its performance. Our experimental results demonstrate that optimized rotation and scaling yield the best performance for pre-quantization transformation, and combining low-rank compensation with GPTQ occasionally outperforms using GPTQ alone for quantization error mitigation. Furthermore, we explore the potential of the latest MXFP4 quantization and reveal that the optimal pre-quantization transformation strategy for INT4 does not generalize well to MXFP4, inspiring further investigation.

[311] Persistent Patterns in Eye Movements: A Topological Approach to Emotion Recognition

Arsha Niksa, Hooman Zare, Ali Shahrabi, Hanieh Hatami, Mohammadreza Razvan

Main category: cs.LG

TL;DR: This paper presents a novel topological approach for emotion recognition from eye-tracking data using persistent homology analysis of gaze trajectories, achieving 75.6% accuracy in classifying four emotion classes.

Details

Motivation: Traditional emotion recognition methods may not effectively capture the complex dynamics and geometric patterns in gaze trajectories. The authors aim to leverage topological data analysis to better understand and classify emotional states from eye-tracking data by analyzing the shape and structure of gaze patterns.

Method: The authors develop a topological pipeline that: (1) creates delay embeddings of gaze trajectories from eye-tracking data, (2) applies persistent homology to analyze these embeddings, (3) extracts shape-based features from persistence diagrams including mean persistence, maximum persistence, and entropy, and (4) trains a random forest classifier on these topological features.

Result: The proposed method achieves up to 75.6% accuracy in classifying four emotion classes corresponding to the quadrants of the Circumplex Model of Affect. The persistence diagram geometry successfully encodes discriminative information about gaze dynamics for emotion recognition.

Conclusion: The study demonstrates that topological features derived from persistent homology can effectively capture discriminative gaze dynamics for emotion classification. This suggests that topological approaches hold promise for affective computing and human behavior analysis applications.

Abstract: We present a topological pipeline for automated multiclass emotion recognition from eye-tracking data. Delay embeddings of gaze trajectories are analyzed using persistent homology. From the resulting persistence diagrams, we extract shape-based features such as mean persistence, maximum persistence, and entropy. A random forest classifier trained on these features achieves up to $75.6%$ accuracy on four emotion classes, which are the quadrants the Circumplex Model of Affect. The results demonstrate that persistence diagram geometry effectively encodes discriminative gaze dynamics, suggesting a promising topological approach for affective computing and human behavior analysis.

[312] Efficient Neural Network Verification via Order Leading Exploration of Branch-and-Bound Trees

Guanqin Zhang, Kota Fukuda, Zhenya Zhang, H. M. N. Dilum Bandara, Shiping Chen, Jianjun Zhao, Yulei Sui

Main category: cs.LG

TL;DR: This paper introduces Oliva, a novel neural network verification framework that improves branch-and-bound (BaB) efficiency by intelligently ordering sub-problems based on their likelihood of containing counterexamples, achieving up to 25X speedup on MNIST and 80X on CIFAR10.

Details

Motivation: Existing branch-and-bound verification methods explore sub-problems in a naive "first-come-first-serve" manner, leading to inefficiency in reaching verification conclusions. There's a need to intelligently prioritize sub-problems to accelerate the verification process.

Method: The paper proposes Oliva framework with two variants: (1) Oliva^GR - a greedy strategy that always prioritizes sub-problems most likely to contain counterexamples, and (2) Oliva^SA - a balanced strategy inspired by simulated annealing that gradually shifts from exploration to exploitation to find globally optimal sub-problems.

Result: Experimental evaluation on 690 verification problems across 5 models with MNIST and CIFAR10 datasets shows significant speedups: up to 25X faster on MNIST and up to 80X faster on CIFAR10 compared to state-of-the-art approaches.

Conclusion: The intelligent ordering of sub-problems based on counterexample likelihood significantly improves neural network verification efficiency without performance degradation, making formal verification more practical for real-world applications.

Abstract: The vulnerability of neural networks to adversarial perturbations has necessitated formal verification techniques that can rigorously certify the quality of neural networks. As the state-of-the-art, branch and bound (BaB) is a “divide-and-conquer” strategy that applies off-the-shelf verifiers to sub-problems for which they perform better. While BaB can identify the sub-problems that are necessary to be split, it explores the space of these sub-problems in a naive “first-come-first-serve” manner, thereby suffering from an issue of inefficiency to reach a verification conclusion. To bridge this gap, we introduce an order over different sub-problems produced by BaB, concerning with their different likelihoods of containing counterexamples. Based on this order, we propose a novel verification framework Oliva that explores the sub-problem space by prioritizing those sub-problems that are more likely to find counterexamples, in order to efficiently reach the conclusion of the verification. Even if no counterexample can be found in any sub-problem, it only changes the order of visiting different sub-problem and so will not lead to a performance degradation. Specifically, Oliva has two variants, including $Oliva^{GR}$, a greedy strategy that always prioritizes the sub-problems that are more likely to find counterexamples, and $Oliva^{SA}$, a balanced strategy inspired by simulated annealing that gradually shifts from exploration to exploitation to locate the globally optimal sub-problems. We experimentally evaluate the performance of Oliva on 690 verification problems spanning over 5 models with datasets MNIST and CIFAR10. Compared to the state-of-the-art approaches, we demonstrate the speedup of Oliva for up to 25X in MNIST, and up to 80X in CIFAR10.

[313] C3RL: Rethinking the Combination of Channel-independence and Channel-mixing from Representation Learning

Shusen Ma, Yun-Bo Zhao, Yu Kang

Main category: cs.LG

TL;DR: C3RL is a novel representation learning framework for multivariate time series forecasting that combines channel-mixing (CM) and channel-independence (CI) strategies using contrastive learning and siamese network architecture to achieve better forecasting performance.

Details

Motivation: Existing multivariate time series forecasting approaches face limitations: channel-mixing (CM) strategy captures inter-variable dependencies but misses variable-specific temporal patterns, while channel-independence (CI) strategy improves temporal patterns but fails to exploit cross-variable dependencies. Hybrid strategies based on feature fusion offer limited generalization and interpretability.

Method: C3RL uses a siamese network architecture that treats CM and CI strategy inputs as transposed views, inspired by contrastive learning in computer vision. One strategy serves as the backbone while the other complements it. The framework jointly optimizes contrastive and prediction losses with adaptive weighting to balance representation learning and forecasting performance.

Result: Extensive experiments on seven models demonstrate that C3RL significantly improves performance: boosting best-case performance rate to 81.4% for CI strategy-based models and to 76.3% for CM strategy-based models, showing strong generalization and effectiveness across different model architectures.

Conclusion: C3RL successfully addresses the limitations of existing approaches by effectively combining the strengths of both CM and CI strategies through contrastive learning, achieving substantial performance improvements while maintaining good generalization across different model types.

Abstract: Multivariate time series forecasting has drawn increasing attention due to its practical importance. Existing approaches typically adopt either channel-mixing (CM) or channel-independence (CI) strategies. CM strategy can capture inter-variable dependencies but fails to discern variable-specific temporal patterns. CI strategy improves this aspect but fails to fully exploit cross-variable dependencies like CM. Hybrid strategies based on feature fusion offer limited generalization and interpretability. To address these issues, we propose C3RL, a novel representation learning framework that jointly models both CM and CI strategies. Motivated by contrastive learning in computer vision, C3RL treats the inputs of the two strategies as transposed views and builds a siamese network architecture: one strategy serves as the backbone, while the other complements it. By jointly optimizing contrastive and prediction losses with adaptive weighting, C3RL balances representation and forecasting performance. Extensive experiments on seven models show that C3RL boosts the best-case performance rate to 81.4% for models based on CI strategy and to 76.3% for models based on CM strategy, demonstrating strong generalization and effectiveness. The code will be available once the paper is accepted.

[314] BGM-HAN: A Hierarchical Attention Network for Accurate and Fair Decision Assessment on Semi-Structured Profiles

Junhua Liu, Roy Ka-Wei Lee, Kwan Hui Lim

Main category: cs.LG

TL;DR: This paper presents BGM-HAN, an enhanced hierarchical attention network that improves university admissions decision-making by modeling semi-structured applicant data more effectively than existing methods, addressing cognitive biases while maintaining fairness and interpretability.

Details

Motivation: Human decision-making in high-stakes domains like university admissions is vulnerable to hard-to-detect cognitive biases that threaten fairness and long-term outcomes, despite relying on expertise and heuristics. There's a need for AI systems that can augment decision-making while maintaining structure, context, and fairness.

Method: The authors propose BGM-HAN (Byte-Pair Encoded, Gated Multi-head Hierarchical Attention Network), which integrates hierarchical learning with various enhancements to model semi-structured applicant data. The network captures multi-level representations through hierarchical attention mechanisms for nuanced assessment of university admission applications.

Result: Experimental results on real admissions data show that BGM-HAN significantly outperforms state-of-the-art baselines ranging from traditional machine learning methods to large language models, demonstrating improved both interpretability and predictive performance.

Conclusion: BGM-HAN offers a promising framework for augmenting decision-making in high-stakes domains where structure, context, and fairness are critical. The model successfully addresses cognitive biases in human decision-making while providing superior performance compared to existing approaches.

Abstract: Human decision-making in high-stakes domains often relies on expertise and heuristics, but is vulnerable to hard-to-detect cognitive biases that threaten fairness and long-term outcomes. This work presents a novel approach to enhancing complex decision-making workflows through the integration of hierarchical learning alongside various enhancements. Focusing on university admissions as a representative high-stakes domain, we propose BGM-HAN, an enhanced Byte-Pair Encoded, Gated Multi-head Hierarchical Attention Network, designed to effectively model semi-structured applicant data. BGM-HAN captures multi-level representations that are crucial for nuanced assessment, improving both interpretability and predictive performance. Experimental results on real admissions data demonstrate that our proposed model significantly outperforms both state-of-the-art baselines from traditional machine learning to large language models, offering a promising framework for augmenting decision-making in domains where structure, context, and fairness matter. Source code is available at: https://github.com/junhua/bgm-han.

[315] HOTA: Hamiltonian framework for Optimal Transport Advection

Nazar Buzun, Daniil Shlenskii, Maxim Bobrin, Dmitry V. Dylov

Main category: cs.LG

TL;DR: HOTA is a new method for optimal transport that uses Hamilton-Jacobi-Bellman equations to optimize trajectories through Kantorovich potentials, avoiding density estimation and handling non-smooth cost functions better than existing approaches.

Details

Motivation: Current generative models using optimal transport assume trivial geometry and rely on strong density-estimation assumptions, resulting in trajectories that don't respect true optimality principles in the underlying manifold. There's a need for methods that can handle complex geometries and avoid explicit density modeling.

Method: Hamiltonian Optimal Transport Advection (HOTA) uses a Hamilton-Jacobi-Bellman based approach to tackle the dual dynamical optimal transport problem explicitly through Kantorovich potentials, enabling efficient and scalable trajectory optimization without requiring explicit density modeling.

Result: HOTA outperforms all baseline methods on standard benchmarks and custom datasets with non-differentiable costs, demonstrating superior performance in both feasibility and optimality metrics. The method works effectively even when cost functionals are non-smooth.

Conclusion: HOTA provides an effective solution for optimal transport problems by avoiding density estimation requirements and handling non-smooth cost functions, making it more practical and robust than existing methods while achieving better empirical performance.

Abstract: Optimal transport (OT) has become a natural framework for guiding the probability flows. Yet, the majority of recent generative models assume trivial geometry (e.g., Euclidean) and rely on strong density-estimation assumptions, yielding trajectories that do not respect the true principles of optimality in the underlying manifold. We present Hamiltonian Optimal Transport Advection (HOTA), a Hamilton-Jacobi-Bellman based method that tackles the dual dynamical OT problem explicitly through Kantorovich potentials, enabling efficient and scalable trajectory optimization. Our approach effectively evades the need for explicit density modeling, performing even when the cost functionals are non-smooth. Empirically, HOTA outperforms all baselines in standard benchmarks, as well as in custom datasets with non-differentiable costs, both in terms of feasibility and optimality.

[316] Generalized Low-Rank Matrix Contextual Bandits with Graph Information

Yao Wang, Jiannan Li, Yue Kang, Shanxing Gao, Zhenxin Xiao

Main category: cs.LG

TL;DR: The paper proposes a novel matrix contextual bandit framework that combines low-rank structure with graph information using UCB algorithm, achieving better cumulative regret bounds than existing methods.

Details

Motivation: Existing matrix contextual bandit methods only utilize low-rank structure but fail to exploit additional graph information that naturally exists in real-world scenarios like online advertising and recommender systems, where similarity relationships among users/items can be captured through graph connectivity.

Method: The authors develop a matrix CB algorithmic framework based on the upper confidence bound (UCB) approach that integrates both low-rank structure and graph information by solving a joint nuclear norm and matrix Laplacian regularization problem, followed by implementing a graph-based generalized linear UCB algorithm.

Result: Theoretical analysis shows the proposed method achieves better cumulative regret bounds compared to popular alternatives due to effective utilization of graph information. Synthetic and real-world experiments further demonstrate the superior performance of the proposed procedure.

Conclusion: The paper successfully fills the gap in matrix contextual bandit literature by incorporating graph information alongside low-rank structure, resulting in improved decision-making policies with better theoretical guarantees and empirical performance.

Abstract: The matrix contextual bandit (CB), as an extension of the well-known multi-armed bandit, is a powerful framework that has been widely applied in sequential decision-making scenarios involving low-rank structure. In many real-world scenarios, such as online advertising and recommender systems, additional graph information often exists beyond the low-rank structure, that is, the similar relationships among users/items can be naturally captured through the connectivity among nodes in the corresponding graphs. However, existing matrix CB methods fail to explore such graph information, and thereby making them difficult to generate effective decision-making policies. To fill in this void, we propose in this paper a novel matrix CB algorithmic framework that builds upon the classical upper confidence bound (UCB) framework. This new framework can effectively integrate both the low-rank structure and graph information in a unified manner. Specifically, it involves first solving a joint nuclear norm and matrix Laplacian regularization problem, followed by the implementation of a graph-based generalized linear version of the UCB algorithm. Rigorous theoretical analysis demonstrates that our procedure outperforms several popular alternatives in terms of cumulative regret bound, owing to the effective utilization of graph information. A series of synthetic and real-world data experiments are conducted to further illustrate the merits of our procedure.

[317] Generalized Advantage Estimation for Distributional Policy Gradients

Shahil Shaik, Jonathon M. Smereka, Yue Wang

Main category: cs.LG

TL;DR: This paper proposes Distributional Generalized Advantage Estimation (DGAE), which extends traditional GAE to handle value distributions in distributional reinforcement learning using optimal transport theory and a Wasserstein-like directional metric.

Details

Motivation: Traditional GAE is not designed to handle value distributions in distributional RL, which can better capture system stochasticity and provide more robustness to system noises compared to point estimates.

Method: The authors introduce a Wasserstein-like directional metric using optimal transport theory that measures both distance and directional discrepancies between probability distributions. They then use exponentially weighted estimation to derive DGAE, which provides low-variance advantage estimates for distributional RL.

Result: DGAE was integrated into three different policy gradient methods and evaluated on various OpenAI Gym environments, showing improved performance compared to baselines using traditional GAE.

Conclusion: DGAE successfully extends GAE to distributional RL settings, providing controlled bias and low-variance advantage estimates that are well-suited for policy gradient algorithms in stochastic environments.

Abstract: Generalized Advantage Estimation (GAE) has been used to mitigate the computational complexity of reinforcement learning (RL) by employing an exponentially weighted estimation of the advantage function to reduce the variance in policy gradient estimates. Despite its effectiveness, GAE is not designed to handle value distributions integral to distributional RL, which can capture the inherent stochasticity in systems and is hence more robust to system noises. To address this gap, we propose a novel approach that utilizes the optimal transport theory to introduce a Wasserstein-like directional metric, which measures both the distance and the directional discrepancies between probability distributions. Using the exponentially weighted estimation, we leverage this Wasserstein-like directional metric to derive distributional GAE (DGAE). Similar to traditional GAE, our proposed DGAE provides a low-variance advantage estimate with controlled bias, making it well-suited for policy gradient algorithms that rely on advantage estimation for policy updates. We integrated DGAE into three different policy gradient methods. Algorithms were evaluated across various OpenAI Gym environments and compared with the baselines with traditional GAE to assess the performance.

[318] Federated Majorize-Minimization: Beyond Parameter Aggregation

Aymeric Dieuleveut, Gersende Fort, Mahmoud Hegazy, Hoi-To Wai

Main category: cs.LG

TL;DR: This paper proposes a unified framework for federated learning optimization algorithms based on Majorize-Minimization (MM) problems, introducing SSMM for centralized settings and QSMM for federated settings that aggregates surrogate function information rather than parameters.

Details

Motivation: Existing federated learning algorithms face challenges with data heterogeneity, partial participation, and communication constraints. The authors aim to develop a unified approach that can robustly scale stochastic optimization algorithms to federated settings while addressing these common bottlenecks.

Method: The paper studies Majorize-Minimization (MM) problems with linearly parameterized majorizing surrogate functions. They develop Stochastic Approximation Stochastic Surrogate MM (SSMM) as a unifying algorithm for centralized settings, then extend it to QSMM for federated learning. The key innovation is aggregating information about the surrogate majorizing function locally rather than aggregating original parameters.

Result: The framework encompasses various existing algorithms including proximal gradient methods, Expectation Maximization, and variational surrogate MM as special cases. QSMM successfully handles federated learning challenges while maintaining the theoretical properties of the MM framework. The methodology is demonstrated through an application to federated optimal transport map computation.

Conclusion: The proposed unified MM-based framework provides a robust and flexible approach for federated optimization that can handle common federated learning challenges. The key insight of aggregating surrogate function information rather than parameters offers a novel perspective for federated algorithm design, with broad applicability demonstrated through the optimal transport example.

Abstract: This paper proposes a unified approach for designing stochastic optimization algorithms that robustly scale to the federated learning setting. Our work studies a class of Majorize-Minimization (MM) problems, which possesses a linearly parameterized family of majorizing surrogate functions. This framework encompasses (proximal) gradient-based algorithms for (regularized) smooth objectives, the Expectation Maximization algorithm, and many problems seen as variational surrogate MM. We show that our framework motivates a unifying algorithm called Stochastic Approximation Stochastic Surrogate MM (\SSMM), which includes previous stochastic MM procedures as special instances. We then extend \SSMM\ to the federated setting, while taking into consideration common bottlenecks such as data heterogeneity, partial participation, and communication constraints; this yields \QSMM. The originality of \QSMM\ is to learn locally and then aggregate information characterizing the \textit{surrogate majorizing function}, contrary to classical algorithms which learn and aggregate the \textit{original parameter}. Finally, to showcase the flexibility of this methodology beyond our theoretical setting, we use it to design an algorithm for computing optimal transport maps in the federated setting.

[319] Enhancing Quantum Federated Learning with Fisher Information-Based Optimization

Amandeep Singh Bhatia, Sabre Kais

Main category: cs.LG

TL;DR: This paper proposes a Quantum Federated Learning (QFL) algorithm that uses Fisher information to identify and preserve critical parameters during model aggregation, addressing challenges like communication costs and data heterogeneity in federated learning with quantum circuits.

Details

Motivation: Federated Learning faces challenges including high communication costs, heterogeneous client data, prolonged processing times, and privacy vulnerabilities. While quantum federated learning shows promise for healthcare and finance applications, existing approaches need better methods to handle parameter aggregation in heterogeneous quantum federated settings.

Method: The authors develop a QFL algorithm that leverages Fisher information computed on local client models to identify critical parameters that significantly influence quantum model performance. These critical parameters are then preserved during the aggregation process across heterogeneous data partitions.

Result: Experimental evaluation on ADNI and MNIST datasets shows that the proposed Fisher information-based QFL approach achieves better performance and robustness compared to quantum federated averaging methods and other QFL variants.

Conclusion: The Fisher information-based approach effectively addresses key challenges in quantum federated learning by preserving critical parameters during aggregation, leading to improved performance and robustness in heterogeneous federated settings while maintaining data privacy.

Abstract: Federated Learning (FL) has become increasingly popular across different sectors, offering a way for clients to work together to train a global model without sharing sensitive data. It involves multiple rounds of communication between the global model and participating clients, which introduces several challenges like high communication costs, heterogeneous client data, prolonged processing times, and increased vulnerability to privacy threats. In recent years, the convergence of federated learning and parameterized quantum circuits has sparked significant research interest, with promising implications for fields such as healthcare and finance. By enabling decentralized training of quantum models, it allows clients or institutions to collaboratively enhance model performance and outcomes while preserving data privacy. Recognizing that Fisher information can quantify the amount of information that a quantum state carries under parameter changes, thereby providing insight into its geometric and statistical properties. We intend to leverage this property to address the aforementioned challenges. In this work, we propose a Quantum Federated Learning (QFL) algorithm that makes use of the Fisher information computed on local client models, with data distributed across heterogeneous partitions. This approach identifies the critical parameters that significantly influence the quantum model’s performance, ensuring they are preserved during the aggregation process. Our research assessed the effectiveness and feasibility of QFL by comparing its performance against other variants, and exploring the benefits of incorporating Fisher information in QFL settings. Experimental results on ADNI and MNIST datasets demonstrate the effectiveness of our approach in achieving better performance and robustness against the quantum federated averaging method.

[320] XStacking: Explanation-Guided Stacked Ensemble Learning

Moncef Garouani, Ayah Barhrhouj, Olivier Teste

Main category: cs.LG

TL;DR: XStacking is a novel ensemble machine learning framework that combines stacking techniques with Shapley additive explanations to create models that maintain high predictive accuracy while being inherently interpretable, addressing the common criticism that ensemble methods lack explainability.

Details

Motivation: Ensemble Machine Learning techniques like stacking improve predictive performance by combining multiple base models, but they suffer from a critical limitation: lack of interpretability. This creates a barrier for adoption in domains where model explanations are crucial for trust and decision-making.

Method: The paper introduces XStacking, a framework that integrates dynamic feature transformation with model-agnostic Shapley additive explanations. This approach enables stacked ensemble models to retain their predictive accuracy while becoming inherently explainable through the incorporation of interpretability mechanisms directly into the stacking process.

Result: The framework was evaluated on 29 datasets and demonstrated improvements in both predictive effectiveness of the learning space and interpretability of the resulting models. XStacking successfully maintained the performance benefits of ensemble methods while adding meaningful explainability.

Conclusion: XStacking provides a practical and scalable solution for responsible machine learning by successfully addressing the interpretability limitations of ensemble methods. The framework offers a way to achieve both high predictive performance and model explainability, making it suitable for applications where both accuracy and transparency are required.

Abstract: Ensemble Machine Learning (EML) techniques, especially stacking, have been shown to improve predictive performance by combining multiple base models. However, they are often criticized for their lack of interpretability. In this paper, we introduce XStacking, an effective and inherently explainable framework that addresses this limitation by integrating dynamic feature transformation with model-agnostic Shapley additive explanations. This enables stacked models to retain their predictive accuracy while becoming inherently explainable. We demonstrate the effectiveness of the framework on 29 datasets, achieving improvements in both the predictive effectiveness of the learning space and the interpretability of the resulting models. XStacking offers a practical and scalable solution for responsible ML.

[321] How Should We Meta-Learn Reinforcement Learning Algorithms?

Alexander David Goldie, Zilin Wang, Jakob Nicolaus Foerster, Shimon Whiteson

Main category: cs.LG

TL;DR: This paper empirically compares different meta-learning approaches (evolution, LLMs) for automatically designing reinforcement learning algorithms, evaluating their performance, interpretability, and efficiency to provide guidelines for future meta-learned RL algorithm development.

Details

Motivation: There is growing interest in meta-learning algorithms from data rather than manual design to improve ML performance, especially for RL where algorithms are often suboptimally adapted from supervised learning. However, there has been a severe lack of empirical comparison between different meta-learning approaches like evolutionary optimization and LLM-based code generation.

Method: The authors conduct an empirical comparison of different meta-learning algorithms (such as evolution-based optimization over black-box functions and LLMs for code proposal) when applied to meta-learned algorithms targeting various parts of the RL pipeline. They evaluate multiple factors including meta-train/meta-test performance, interpretability, sample cost, and training time.

Result: The paper provides empirical findings comparing the effectiveness of different meta-learning approaches for RL algorithm design across multiple evaluation criteria including performance metrics, interpretability, computational efficiency, and sample complexity.

Conclusion: Based on their empirical comparison, the authors propose several guidelines for meta-learning new RL algorithms that will help ensure future learned algorithms achieve optimal performance.

Abstract: The process of meta-learning algorithms from data, instead of relying on manual design, is growing in popularity as a paradigm for improving the performance of machine learning systems. Meta-learning shows particular promise for reinforcement learning (RL), where algorithms are often adapted from supervised or unsupervised learning despite their suboptimality for RL. However, until now there has been a severe lack of comparison between different meta-learning algorithms, such as using evolution to optimise over black-box functions or LLMs to propose code. In this paper, we carry out this empirical comparison of the different approaches when applied to a range of meta-learned algorithms which target different parts of the RL pipeline. In addition to meta-train and meta-test performance, we also investigate factors including the interpretability, sample cost and train time for each meta-learning algorithm. Based on these findings, we propose several guidelines for meta-learning new RL algorithms which will help ensure that future learned algorithms are as performant as possible.

[322] Generalized Dual Discriminator GANs

Penukonda Naga Chandana, Tejas Srivastava, Gowtham R. Kurri, V. Lalitha

Main category: cs.LG

TL;DR: This paper introduces dual discriminator α-GANs (D2α-GANs) and generalizes them to address mode collapse in GANs by using two discriminators with tunable loss functions, showing that the optimization reduces to minimizing combinations of f-divergences.

Details

Motivation: The paper is motivated by the mode collapse problem in standard GANs, where the generator fails to capture the full diversity of the data distribution. While dual discriminator GANs (D2GANs) were previously introduced to address this issue, there was a need for more flexible loss functions and a broader theoretical framework.

Method: The authors introduce D2α-GANs that combine dual discriminators with α-loss functions for tunable flexibility. They further generalize this to arbitrary functions on positive reals, creating generalized dual discriminator GANs. The approach uses two discriminators: one favoring real data samples and another favoring generated samples.

Result: The theoretical analysis shows that the min-max optimization in these models reduces to minimizing a linear combination of f-divergence and reverse f-divergence, generalizing the known result for D2-GANs (KL and reverse KL divergences). Experiments on 2D synthetic data demonstrate the advantages using multiple performance metrics.

Conclusion: The proposed generalized dual discriminator framework provides both theoretical insights and practical improvements for addressing mode collapse in GANs. The reduction to f-divergence combinations offers a unified understanding of the optimization landscape, while experimental results validate the effectiveness of the approach.

Abstract: Dual discriminator generative adversarial networks (D2 GANs) were introduced to mitigate the problem of mode collapse in generative adversarial networks. In D2 GANs, two discriminators are employed alongside a generator: one discriminator rewards high scores for samples from the true data distribution, while the other favors samples from the generator. In this work, we first introduce dual discriminator $\alpha$-GANs (D2 $\alpha$-GANs), which combines the strengths of dual discriminators with the flexibility of a tunable loss function, $\alpha$-loss. We further generalize this approach to arbitrary functions defined on positive reals, leading to a broader class of models we refer to as generalized dual discriminator generative adversarial networks. For each of these proposed models, we provide theoretical analysis and show that the associated min-max optimization reduces to the minimization of a linear combination of an $f$-divergence and a reverse $f$-divergence. This generalizes the known simplification for D2-GANs, where the objective reduces to a linear combination of the KL-divergence and the reverse KL-divergence. Finally, we perform experiments on 2D synthetic data and use multiple performance metrics to capture various advantages of our GANs.

[323] Towards Effective Open-set Graph Class-incremental Learning

Jiazhen Chen, Zheng Ma, Sichao Fu, Mingbin Feng, Tony S. Wirjanto, Weihua Ou

Main category: cs.LG

TL;DR: This paper proposes OGCIL, a framework for open-set graph class-incremental learning that enables graph neural networks to learn new classes while retaining old knowledge and detecting unknown classes during inference, using pseudo-sample generation and prototypical hypersphere classification.

Details

Motivation: Existing graph class-incremental learning methods assume a closed-set scenario where all test samples belong to known classes, which is unrealistic in real-world applications where unknown classes naturally emerge during inference. This limitation restricts their practical applicability and creates challenges in handling both catastrophic forgetting and open-set recognition simultaneously.

Method: The OGCIL framework uses: (1) a prototypical conditional variational autoencoder to synthesize node embeddings for old classes enabling knowledge replay without storing raw graph data, (2) a mixing-based strategy to generate out-of-distribution samples from pseudo in-distribution and current node embeddings, and (3) a novel prototypical hypersphere classification loss that anchors in-distribution embeddings to class prototypes while repelling OOD embeddings away.

Result: Extensive experiments on five benchmarks demonstrate that OGCIL outperforms existing graph class-incremental learning and open-set GNN methods, effectively addressing both catastrophic forgetting and unknown class detection in the challenging open-set incremental learning scenario.

Conclusion: The proposed OGCIL framework successfully tackles the dual challenges of catastrophic forgetting and inadequate open-set recognition in graph class-incremental learning by using pseudo-sample embedding generation and prototypical hypersphere classification, providing a robust solution for real-world graph analytical tasks where unknown classes emerge during inference.

Abstract: Graph class-incremental learning (GCIL) allows graph neural networks (GNNs) to adapt to evolving graph analytical tasks by incrementally learning new class knowledge while retaining knowledge of old classes. Existing GCIL methods primarily focus on a closed-set assumption, where all test samples are presumed to belong to previously known classes. Such an assumption restricts their applicability in real-world scenarios, where unknown classes naturally emerge during inference, and are absent during training. In this paper, we explore a more challenging open-set graph class-incremental learning scenario with two intertwined challenges: catastrophic forgetting of old classes, which impairs the detection of unknown classes, and inadequate open-set recognition, which destabilizes the retention of learned knowledge. To address the above problems, a novel OGCIL framework is proposed, which utilizes pseudo-sample embedding generation to effectively mitigate catastrophic forgetting and enable robust detection of unknown classes. To be specific, a prototypical conditional variational autoencoder is designed to synthesize node embeddings for old classes, enabling knowledge replay without storing raw graph data. To handle unknown classes, we employ a mixing-based strategy to generate out-of-distribution (OOD) samples from pseudo in-distribution and current node embeddings. A novel prototypical hypersphere classification loss is further proposed, which anchors in-distribution embeddings to their respective class prototypes, while repelling OOD embeddings away. Instead of assigning all unknown samples into one cluster, our proposed objective function explicitly models them as outliers through prototype-aware rejection regions, ensuring a robust open-set recognition. Extensive experiments on five benchmarks demonstrate the effectiveness of OGCIL over existing GCIL and open-set GNN methods.

[324] Joint Asymmetric Loss for Learning with Noisy Labels

Jialiang Wang, Xianming Liu, Xiong Zhou, Gangfeng Hu, Deming Zhai, Junjun Jiang, Xiangyang Ji

Main category: cs.LG

TL;DR: This paper proposes Joint Asymmetric Loss (JAL), a novel robust loss framework that combines asymmetric losses with the Active Passive Loss optimization framework to better handle noisy labels in deep neural network training, addressing the underfitting issues of symmetric losses while leveraging the superior theoretical properties of asymmetric losses.

Details

Motivation: Existing symmetric robust loss functions suffer from underfitting due to overly strict constraints, while theoretically superior asymmetric losses are incompatible with advanced optimization frameworks like Active Passive Loss (APL), creating a gap between theoretical advantages and practical applicability in noisy label learning.

Method: The authors extend asymmetric loss to the passive loss scenario and propose Asymmetric Mean Square Error (AMSE), establishing necessary and sufficient conditions for asymmetric properties. They then substitute the traditional symmetric passive loss in APL with AMSE to create the Joint Asymmetric Loss (JAL) framework.

Result: Extensive experiments demonstrate that the proposed JAL method effectively mitigates label noise and outperforms existing approaches, successfully combining the theoretical advantages of asymmetric losses with practical optimization frameworks.

Conclusion: The paper successfully bridges the gap between asymmetric loss theory and practical optimization by proposing JAL, which leverages the superior properties of asymmetric losses while maintaining compatibility with advanced optimization frameworks, leading to improved performance in noisy label learning scenarios.

Abstract: Learning with noisy labels is a crucial task for training accurate deep neural networks. To mitigate label noise, prior studies have proposed various robust loss functions, particularly symmetric losses. Nevertheless, symmetric losses usually suffer from the underfitting issue due to the overly strict constraint. To address this problem, the Active Passive Loss (APL) jointly optimizes an active and a passive loss to mutually enhance the overall fitting ability. Within APL, symmetric losses have been successfully extended, yielding advanced robust loss functions. Despite these advancements, emerging theoretical analyses indicate that asymmetric losses, a new class of robust loss functions, possess superior properties compared to symmetric losses. However, existing asymmetric losses are not compatible with advanced optimization frameworks such as APL, limiting their potential and applicability. Motivated by this theoretical gap and the prospect of asymmetric losses, we extend the asymmetric loss to the more complex passive loss scenario and propose the Asymetric Mean Square Error (AMSE), a novel asymmetric loss. We rigorously establish the necessary and sufficient condition under which AMSE satisfies the asymmetric condition. By substituting the traditional symmetric passive loss in APL with our proposed AMSE, we introduce a novel robust loss framework termed Joint Asymmetric Loss (JAL). Extensive experiments demonstrate the effectiveness of our method in mitigating label noise. Code available at: https://github.com/cswjl/joint-asymmetric-loss

[325] HydraOpt: Navigating the Efficiency-Performance Trade-off of Adapter Merging

Taha Ceritli, Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli

Main category: cs.LG

TL;DR: HydraOpt is a novel model merging technique for low-rank adapters in LLMs that reduces storage requirements by 48% while maintaining competitive performance, outperforming existing merging methods.

Details

Motivation: Large language models require separate adapters for each downstream task, leading to significant memory requirements that pose challenges for resource-constrained environments like mobile devices. Existing model merging techniques reduce storage but cause substantial performance degradation.

Method: HydraOpt leverages inherent similarities between matrices of low-rank adapters to enable flexible navigation of the efficiency-performance spectrum, unlike existing methods that provide fixed trade-offs between storage size and performance.

Result: HydraOpt achieves 48% reduction in storage size compared to storing all adapters while maintaining competitive performance with only 0.2-1.8% performance drop. It outperforms existing merging techniques at similar or slightly worse storage efficiency levels.

Conclusion: HydraOpt successfully addresses the storage challenge of adapter-based LLMs by providing a flexible merging technique that maintains strong performance while significantly reducing memory requirements, making it suitable for resource-constrained deployment scenarios.

Abstract: Large language models (LLMs) often leverage adapters, such as low-rank-based adapters, to achieve strong performance on downstream tasks. However, storing a separate adapter for each task significantly increases memory requirements, posing a challenge for resource-constrained environments such as mobile devices. Although model merging techniques can reduce storage costs, they typically result in substantial performance degradation. In this work, we introduce HydraOpt, a new model merging technique that capitalizes on the inherent similarities between the matrices of low-rank adapters. Unlike existing methods that produce a fixed trade-off between storage size and performance, HydraOpt allows us to navigate this spectrum of efficiency and performance. Our experiments show that HydraOpt significantly reduces storage size (48% reduction) compared to storing all adapters, while achieving competitive performance (0.2-1.8% drop). Furthermore, it outperforms existing merging techniques in terms of performance at the same or slightly worse storage efficiency.

[326] On the Interaction of Compressibility and Adversarial Robustness

Melih Barsbey, Antônio H. Ribeiro, Umut Şimşekli, Tolga Birdal

Main category: cs.LG

TL;DR: This paper investigates the fundamental trade-off between neural network compression and adversarial robustness, showing that compressed models create vulnerable directions that adversaries can exploit, leading to reduced security regardless of the compression method used.

Details

Motivation: While neural networks need to be simultaneously accurate, generalizable, efficient, and robust to adversarial attacks, the relationship between compressibility and robustness remains poorly understood despite extensive individual study of each property.

Method: The authors develop a principled theoretical framework to analyze how different compression forms (neuron-level sparsity and spectral compressibility) affect adversarial robustness, deriving robustness bounds that reveal the impact on L∞ and L2 robustness through learned representations.

Result: The analysis shows that compression induces highly sensitive directions in representation space that adversaries can exploit, with vulnerabilities persisting across different compression methods (regularization, architectural bias, implicit dynamics), adversarial training, and transfer learning scenarios, and contributing to universal adversarial perturbations.

Conclusion: There exists a fundamental tension between structured compressibility and robustness in neural networks, suggesting the need for new approaches to design models that can achieve both efficiency and security simultaneously.

Abstract: Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact $L_\infty$ and $L_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.

[327] Flow Matching Meets Biology and Life Science: A Survey

Zihao Li, Zhichen Zeng, Xiao Lin, Feihao Fang, Yanru Qu, Zhe Xu, Zhining Liu, Xuying Ning, Tianxin Wei, Ge Liu, Hanghang Tong, Jingrui He

Main category: cs.LG

TL;DR: This paper presents the first comprehensive survey of flow matching methods and their applications in biological domains, covering foundations, variants, and three major application areas: biological sequence modeling, molecule generation, and protein generation.

Details

Motivation: Flow matching has emerged as a powerful and efficient alternative to diffusion-based generative modeling with growing interest in biology applications. However, there was no comprehensive survey reviewing flow matching developments and biological applications, creating a need for systematic categorization and analysis of this rapidly evolving field.

Method: The authors conduct a systematic literature survey by: (1) reviewing foundations and variants of flow matching methods, (2) categorizing biological applications into three major areas (biological sequence modeling, molecule generation/design, peptide/protein generation), (3) providing in-depth reviews of recent progress in each area, and (4) summarizing commonly used datasets and software tools.

Result: The survey successfully categorizes and reviews flow matching applications across biological domains, identifies key datasets and software tools used in the field, and provides a curated resource repository. The paper establishes a comprehensive framework for understanding current progress in flow matching for biology applications.

Conclusion: Flow matching represents a significant advancement in generative modeling for biological applications, offering efficient alternatives to diffusion models. The survey identifies substantial progress across biological sequence modeling, molecule generation, and protein design, while highlighting potential future research directions in this rapidly growing interdisciplinary field.

Abstract: Over the past decade, advances in generative modeling, such as generative adversarial networks, masked autoencoders, and diffusion models, have significantly transformed biological research and discovery, enabling breakthroughs in molecule design, protein generation, drug discovery, and beyond. At the same time, biological applications have served as valuable testbeds for evaluating the capabilities of generative models. Recently, flow matching has emerged as a powerful and efficient alternative to diffusion-based generative modeling, with growing interest in its application to problems in biology and life sciences. This paper presents the first comprehensive survey of recent developments in flow matching and its applications in biological domains. We begin by systematically reviewing the foundations and variants of flow matching, and then categorize its applications into three major areas: biological sequence modeling, molecule generation and design, and peptide and protein generation. For each, we provide an in-depth review of recent progress. We also summarize commonly used datasets and software tools, and conclude with a discussion of potential future directions. The corresponding curated resources are available at https://github.com/Violet24K/Awesome-Flow-Matching-Meets-Biology.

[328] Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility

Melih Barsbey, Lucas Prieto, Stefanos Zafeiriou, Tolga Birdal

Main category: cs.LG

TL;DR: This paper shows that using large learning rates during training can simultaneously achieve model robustness to spurious correlations and network compressibility, while also producing beneficial representation properties like invariant features and activation sparsity.

Details

Motivation: Modern machine learning models need to be both robust and resource-efficient, but achieving these two properties together is challenging. The authors investigate whether high learning rates can serve as a simple solution to obtain both robustness to spurious correlations and network compressibility simultaneously.

Method: The authors systematically evaluate the effect of large learning rates across diverse spurious correlation datasets, different model architectures, and various optimizers. They analyze representation properties including invariant feature utilization, class separation, and activation sparsity to understand how large learning rates contribute to both robustness and compressibility.

Result: Large learning rates consistently outperform other hyperparameters and regularization methods in achieving both robustness and compressibility. They produce desirable representation properties such as invariant feature utilization, better class separation, and activation sparsity. The results are consistent across multiple datasets, models, and optimizers.

Conclusion: Large learning rates can effectively serve as a facilitator for jointly achieving robustness to spurious correlations and network compressibility. The previously documented success of large learning rates in standard classification may be attributed to their ability to address hidden spurious correlations in training data, suggesting a broader utility beyond what was previously understood.

Abstract: Robustness and resource-efficiency are two highly desirable properties for modern machine learning models. However, achieving them jointly remains a challenge. In this paper, we position high learning rates as a facilitator for simultaneously achieving robustness to spurious correlations and network compressibility. We demonstrate that large learning rates also produce desirable representation properties such as invariant feature utilization, class separation, and activation sparsity. Importantly, our findings indicate that large learning rates compare favorably to other hyperparameters and regularization methods, in consistently satisfying these properties in tandem. In addition to demonstrating the positive effect of large learning rates across diverse spurious correlation datasets, models, and optimizers, we also present strong evidence that the previously documented success of large learning rates in standard classification tasks is likely due to its effect on addressing hidden/rare spurious correlations in the training dataset.

[329] LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong

Main category: cs.LG

TL;DR: This paper proposes Low-Rank Extrapolation (LoX), a training-free method that enhances LLM safety robustness by extrapolating safety-critical subspaces to reduce vulnerability to fine-tuning attacks that can undermine safety protections.

Details

Motivation: Large Language Models face significant safety concerns as their safety protections can be undermined by subsequent fine-tuning, even with benign training data. This vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning modifications.

Method: The authors propose Low-Rank Extrapolation (LoX), a training-free method that enhances safety robustness by extrapolating the safety subspace of aligned LLMs. The method works by moving LLM parameters to flatter zones that are less sensitive to perturbations from fine-tuning.

Result: LoX demonstrates significant improvements in robustness against both benign and malicious fine-tuning attacks, achieving 11% to 54% absolute reductions in attack success rates (ASR) while preserving the model’s adaptability to new tasks. The method successfully moves parameters to less sensitive regions.

Conclusion: LoX effectively enhances LLM safety robustness through parameter extrapolation that reduces sensitivity to fine-tuning perturbations. The method provides a practical solution for maintaining safety protections while preserving model adaptability, with the success attributed to moving parameters to flatter, more robust zones.

Abstract: Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at github.com/VITA-Group/LoX.

[330] ACMP: Allen-Cahn Message Passing with Attractive and Repulsive Forces for Graph Neural Networks

Yuelin Wang, Kai Yi, Xinliang Liu, Yu Guang Wang, Shi Jin

Main category: cs.LG

TL;DR: This paper proposes Allen-Cahn Message Passing (ACMP) for Graph Neural Networks, which models neural message passing as an interacting particle system with Allen-Cahn forces to enable very deep GNNs (up to 100 layers) while avoiding oversmoothing problems.

Details

Motivation: Traditional Graph Neural Networks suffer from oversmoothing problems when depth increases, limiting their ability to capture complex graph structures. The authors aim to develop a deep GNN model that can handle many layers without losing node distinctiveness.

Method: The method models neural message passing as an interacting particle system with attractive and repulsive forces plus Allen-Cahn forces from phase transition modeling. This creates a reaction-diffusion process that separates particles without explosion. The solution is implemented using neural ODE solvers for message passing propagation.

Result: ACMP enables GNNs to scale up to 100+ layers with theoretically proven strictly positive lower bound of Dirichlet energy. The method achieves state-of-the-art performance on real-world node classification tasks for both homophilic and heterophilic datasets.

Conclusion: Allen-Cahn Message Passing successfully addresses the oversmoothing problem in deep GNNs by providing a principled way to maintain node separation through particle dynamics, enabling much deeper networks with superior performance on node classification benchmarks.

Abstract: Neural message passing is a basic feature extraction unit for graph-structured data considering neighboring node features in network propagation from one layer to the next. We model such process by an interacting particle system with attractive and repulsive forces and the Allen-Cahn force arising in the modeling of phase transition. The dynamics of the system is a reaction-diffusion process which can separate particles without blowing up. This induces an Allen-Cahn message passing (ACMP) for graph neural networks where the numerical iteration for the particle system solution constitutes the message passing propagation. ACMP which has a simple implementation with a neural ODE solver can propel the network depth up to one hundred of layers with theoretically proven strictly positive lower bound of the Dirichlet energy. It thus provides a deep model of GNNs circumventing the common GNN problem of oversmoothing. GNNs with ACMP achieve state of the art performance for real-world node classification tasks on both homophilic and heterophilic datasets. Codes are available at https://github.com/ykiiiiii/ACMP.

[331] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia

Main category: cs.LG

TL;DR: This survey paper reviews efficient serving methodologies for large language models (LLMs) from a machine learning systems perspective, addressing computational and memory challenges in deployment scenarios requiring low latency and high throughput.

Details

Motivation: The computational intensity and memory consumption of deploying generative LLMs present substantial challenges for serving efficiency, particularly in scenarios demanding low latency and high throughput. There is an imperative need to bridge the gap between advanced AI innovations and practical system optimizations for effective LLM deployment.

Method: The paper provides a comprehensive survey methodology that covers a spectrum of solutions from algorithmic modifications to system design changes. The analysis is conducted from a machine learning system (MLSys) research perspective, examining various approaches to efficient LLM serving.

Result: The survey provides in-depth analysis and comprehensive understanding of current state-of-the-art solutions for efficient LLM serving, covering both algorithmic and system-level optimizations that address deployment challenges.

Conclusion: The survey offers valuable insights for researchers and practitioners to overcome barriers in effective LLM deployment, providing future directions that could reshape AI applications by making LLM serving more efficient and practical.

Abstract: In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.

[332] The FIX Benchmark: Extracting Features Interpretable to eXperts

Helen Jin, Shreya Havaldar, Chaehyeon Kim, Anton Xue, Weiqiu You, Helen Qu, Marco Gatti, Daniel A Hashimoto, Bhuvnesh Jain, Amin Madani, Masao Sako, Lyle Ungar, Eric Wong

Main category: cs.LG

TL;DR: The paper introduces FIX, a benchmark for measuring how well feature-based explanations align with expert knowledge, revealing that current explanation methods poorly match expert understanding across multiple domains.

Details

Motivation: Feature-based explanation methods assume interpretable features are readily available, but this is often not true for high-dimensional data where even domain experts struggle to specify important features mathematically. There's a need to automatically extract feature collections that align with expert knowledge.

Method: The authors developed FIX (Features Interpretable to eXperts) benchmark and FIXScore, a unified expert alignment measure. They collaborated with domain experts across cosmology, psychology, and medicine to create evaluations spanning vision, language, and time series data modalities.

Result: Popular feature-based explanation methods showed poor alignment with expert-specified knowledge across the tested domains and data modalities, demonstrating significant gaps between current methods and expert understanding.

Conclusion: Current feature-based explanation methods inadequately align with expert knowledge, highlighting the urgent need for new methods that can better identify features that are truly interpretable to domain experts.

Abstract: Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that are aligned with expert knowledge? To address this gap, we present FIX (Features Interpretable to eXperts), a benchmark for measuring how well a collection of features aligns with expert knowledge. In collaboration with domain experts, we propose FIXScore, a unified expert alignment measure applicable to diverse real-world settings across cosmology, psychology, and medicine domains in vision, language, and time series data modalities. With FIXScore, we find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts.

[333] Language model developers should report train-test overlap

Andy K Zhang, Kevin Klyman, Yifan Mai, Yoav Levine, Yian Zhang, Rishi Bommasani, Percy Liang

Main category: cs.LG

TL;DR: This paper investigates the lack of transparency around train-test overlap in language model evaluations, finding that only 9 out of 30 model developers provide adequate information about whether their models were trained on test data, and advocates for mandatory disclosure of train-test overlap statistics to improve evaluation trustworthiness.

Details

Motivation: Language model evaluation results are difficult to interpret correctly without knowing the extent of train-test overlap (whether models were trained on their test data). The public lacks adequate information about this overlap since most developers don't report it and third parties can't measure it without access to training data, undermining trust in model evaluations.

Method: The authors documented and analyzed the practices of 30 language model developers regarding train-test overlap disclosure. They categorized developers based on whether they report overlap statistics, release training data, or publish methodology. They also engaged directly with some developers to obtain additional information about their train-test overlap practices.

Result: Only 9 out of 30 developers provide adequate train-test overlap information: 4 release open-source training data enabling direct measurement, and 5 publish their overlap methodology and statistics. Through direct engagement, the authors obtained novel train-test overlap information for 3 additional developers. The majority of developers (21 out of 30) provide insufficient transparency about train-test overlap.

Conclusion: Language model developers should publish train-test overlap statistics and/or training data whenever reporting evaluation results on public test sets. Increased transparency around train-test overlap is essential for building community-wide trust in model evaluations and properly interpreting evaluation results.

Abstract: Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 model developers, finding that just 9 developers report train-test overlap: 4 developers release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 developers publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional developers. Overall, we take the position that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.

[334] Optimizing Privacy-Utility Trade-off in Decentralized Learning with Generalized Correlated Noise

Angelo Rodio, Zheng Chen, Erik G. Larsson

Main category: cs.LG

TL;DR: CorN-DSGD is a novel framework that generates correlated privacy noise across agents in decentralized learning to achieve better noise cancellation while maintaining privacy guarantees, improving model performance compared to existing methods.

Details

Motivation: In decentralized learning, agents need to share local models for collaborative training, but this exposes private information about local datasets to adversaries. Traditional privacy protection methods inject random noise, but this degrades model utility due to cumulated artificial noise effects.

Method: The paper introduces CorN-DSGD, a covariance-based framework that generates correlated privacy noise across agents. It leverages network topology and mixing weights to optimize noise covariance for achieving network-wide noise cancellation, unifying several state-of-the-art methods as special cases.

Result: Experimental results demonstrate that CorN-DSGD cancels more noise than existing pairwise correlation schemes, leading to improved model performance while maintaining formal privacy guarantees.

Conclusion: CorN-DSGD successfully addresses the utility-privacy trade-off in decentralized learning by optimizing correlated noise generation, achieving better noise cancellation and model performance compared to existing approaches while preserving privacy protection.

Abstract: Decentralized learning enables distributed agents to collaboratively train a shared machine learning model without a central server, through local computation and peer-to-peer communication. Although each agent retains its dataset locally, sharing local models can still expose private information about the local training datasets to adversaries. To mitigate privacy attacks, a common strategy is to inject random artificial noise at each agent before exchanging local models between neighbors. However, this often leads to utility degradation due to the negative effects of cumulated artificial noise on the learning algorithm. In this work, we introduce CorN-DSGD, a novel covariance-based framework for generating correlated privacy noise across agents, which unifies several state-of-the-art methods as special cases. By leveraging network topology and mixing weights, CorN-DSGD optimizes the noise covariance to achieve network-wide noise cancellation. Experimental results show that CorN-DSGD cancels more noise than existing pairwise correlation schemes, improving model performance under formal privacy guarantees.

[335] On the Lipschitz Constant of Deep Networks and Double Descent

Matteo Gamba, Hossein Azizpour, Mårten Björkman

Main category: cs.LG

TL;DR: This paper studies the empirical Lipschitz constant of deep networks during double descent, revealing non-monotonic trends that correlate with test error and identifying loss landscape curvature and parameter distance from initialization as key factors controlling generalization.

Details

Motivation: Existing generalization bounds for deep networks assume smooth or bounded input dependence but fail to investigate the practical mechanisms controlling these factors. The authors aim to understand what actually controls generalization in practice beyond theoretical assumptions.

Method: The authors conduct extensive experimental studies of the empirical Lipschitz constant in deep networks experiencing double descent. They establish connections between parameter-space and input-space gradients for SGD around critical points to isolate key controlling factors.

Result: The study reveals non-monotonic trends in empirical Lipschitz constants that strongly correlate with test error. Two critical factors are identified: loss landscape curvature (controlling optimization dynamics around critical points) and distance of parameters from initialization (bounding model function complexity beyond training data).

Conclusion: The work provides novel insights into implicit regularization through overparameterization and effective model complexity for practically trained networks, demonstrating that generalization is controlled by specific geometric properties of the loss landscape and parameter trajectories.

Abstract: Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an extensive experimental study of the empirical Lipschitz constant of deep networks undergoing double descent, and highlight non-monotonic trends strongly correlating with the test error. Building a connection between parameter-space and input-space gradients for SGD around a critical point, we isolate two important factors – namely loss landscape curvature and distance of parameters from initialization – respectively controlling optimization dynamics around a critical point and bounding model function complexity, even beyond the training data. Our study presents novels insights on implicit regularization via overparameterization, and effective model complexity for networks trained in practice.

[336] Gathering and Exploiting Higher-Order Information when Training Large Structured Models

Pierre Wolinski

Main category: cs.LG

TL;DR: This paper presents a method for efficiently computing projections of Hessian and higher-order derivatives on parameter subspaces for neural network optimization, enabling better learning rates, second-order optimization with Hessian information, and third-order regularization while maintaining computational feasibility.

Details

Motivation: Training large neural networks faces computational challenges with second-order optimization methods due to the prohibitive cost of computing full Hessians and higher-order derivatives. Existing methods like quasi-Newton and K-FAC bypass this by using first-order information, but this limits the optimization quality and misses important higher-order information that could improve training.

Method: The authors propose computing exact projections of Hessian and higher-order derivatives on carefully chosen parameter subspaces defined by parameter partitions. They develop tensors representing “higher-order derivatives according to the partition” that can be computed at reasonable cost when the number of partition subsets is small. These tensors are then used for: (1) computing subset-specific learning rates, (2) constructing second-order optimization using Hessian information, and (3) third-order regularization using third derivatives.

Result: The resulting optimization method demonstrates several key properties: it captures long-range interactions between neural network layers (unlike methods such as K-FAC), provides layer-wise affine reparameterization invariance, and enables practical use of higher-order derivative information for improved optimization while maintaining computational tractability.

Conclusion: The paper successfully develops a computationally feasible approach to incorporate exact higher-order derivative information into neural network optimization. By focusing on projections onto parameter subspaces rather than full derivatives, the method achieves better optimization properties than first-order methods while avoiding the computational burden of full second and third-order methods.

Abstract: When training large models, such as neural networks, the full derivatives of order 2 and beyond are usually inaccessible, due to their computational cost. Therefore, among the second-order optimization methods, it is common to bypass the computation of the Hessian by using first-order information, such as the gradient of the parameters (e.g., quasi-Newton methods) or the activations (e.g., K-FAC). In this paper, we focus on the exact and explicit computation of projections of the Hessian and higher-order derivatives on well-chosen subspaces relevant for optimization. Namely, for a given partition of the set of parameters, we compute tensors that can be seen as “higher-order derivatives according to the partition”, at a reasonable cost as long as the number of subsets of the partition remains small. Then, we give some examples of how these tensors can be used. First, we show how to compute a learning rate per subset of parameters, which can be used for hyperparameter tuning. Second, we show how to use these tensors at order 2 to construct an optimization method that uses information contained in the Hessian. Third, we show how to use these tensors at order 3 (information contained in the third derivative of the loss) to regularize this optimization method. The resulting training step has several interesting properties, including: it takes into account long-range interactions between the layers of the trained neural network, which is usually not the case in similar methods (e.g., K-FAC); the trajectory of the optimization is invariant under affine layer-wise reparameterization.

[337] Sampling-enabled scalable manifold learning unveils discriminative cluster structure of high-dimensional data

Dehua Peng, Zhipeng Gui, Wenzhang Wei, Fa Li, Jie Gui, Huayi Wu, Jianya Gong

Main category: cs.LG

TL;DR: The paper proposes SUDE, a sampling-based scalable manifold learning technique that uses landmarks and constrained locally linear embedding to handle large-scale high-dimensional data while preserving cluster structure and global patterns.

Details

Motivation: Existing manifold learning techniques suffer from extensive distortions of cluster structure that hinder pattern understanding, and have scalability issues that limit their applicability to large-scale data processing.

Method: SUDE employs a two-step approach: first selecting landmarks to construct a low-dimensional skeleton of the data, then incorporating non-landmarks into the learned space using constrained locally linear embedding (CLLE).

Result: SUDE demonstrates superior scalability with respect to data size and embedding dimension, shows promising performance in cluster separation, integrity, and global structure preservation, and exhibits notable robustness in embedding quality even as sampling rates decrease.

Conclusion: SUDE effectively addresses the scalability and cluster distortion problems of traditional manifold learning methods, making it suitable for large-scale applications including single-cell data analysis and ECG anomaly detection.

Abstract: As a pivotal branch of machine learning, manifold learning uncovers the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space for visualization, classification, clustering, and gaining key insights. Although existing techniques have achieved remarkable successes, they suffer from extensive distortions of cluster structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. We hence propose a sampling-based Scalable manifold learning technique that enables Uniform and Discriminative Embedding, namely SUDE, for large-scale and high-dimensional data. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the non-landmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of SUDE on synthetic datasets and real-world benchmarks, and applied it to analyze single-cell data and detect anomalies in electrocardiogram (ECG) signals. SUDE exhibits distinct advantage in scalability with respect to data size and embedding dimension, and has promising performance in cluster separation, integrity, and global structure preservation. The experiments also demonstrate notable robustness in embedding quality as the sampling rate decreases.

[338] Attention-Based Multiscale Temporal Fusion Network for Uncertain-Mode Fault Diagnosis in Multimode Processes

Guangqiang Li, M. Amine Atoui, Xiangshun Li

Main category: cs.LG

TL;DR: This paper proposes an attention-based multiscale temporal fusion network (AMTFNet) for fault diagnosis in multimode industrial processes, which addresses the challenge of distributional differences across modes by extracting shared features and focusing on critical time points with cross-mode information.

Details

Motivation: Fault diagnosis in multimode processes faces the significant challenge that distributional differences among monitoring data from multiple modes make it difficult for models to extract shared feature representations related to system health conditions, which is critical for ensuring safe operation of industrial systems.

Method: The method employs multiscale depthwise convolution and gated recurrent units to extract multiscale contextual local features and long-short-term features, applies instance normalization to suppress mode-specific information, and designs a temporal attention mechanism to focus on critical time points with higher cross-mode shared information.

Result: Experiments on Tennessee Eastman process dataset and three-phase flow facility dataset demonstrate that the proposed model achieves superior diagnostic performance while maintaining a small model size compared to existing approaches.

Conclusion: The attention-based multiscale temporal fusion network effectively addresses the distributional differences challenge in multimode fault diagnosis by extracting shared features across modes and focusing on critical temporal information, resulting in improved diagnostic accuracy with computational efficiency.

Abstract: Fault diagnosis in multimode processes plays a critical role in ensuring the safe operation of industrial systems across multiple modes. It faces a great challenge yet to be addressed - that is, the significant distributional differences among monitoring data from multiple modes make it difficult for the models to extract shared feature representations related to system health conditions. In response to this problem, this paper introduces a novel method called attention-based multiscale temporal fusion network. The multiscale depthwise convolution and gated recurrent unit are employed to extract multiscale contextual local features and long-short-term features. Instance normalization is applied to suppress mode-specific information. Furthermore, a temporal attention mechanism is designed to focus on critical time points with higher cross-mode shared information, thereby enhancing the accuracy of fault diagnosis. The proposed model is applied to Tennessee Eastman process dataset and three-phase flow facility dataset. The experiments demonstrate that the proposed model achieves superior diagnostic performance and maintains a small model size. The source code will be available on GitHub at https://github.com/GuangqiangLi/AMTFNet.

[339] Trusted Multi-view Learning under Noisy Supervision

Yilin Zhang, Cai Xu, Han Jiang, Ziyu Guan, Wei Zhao, Xiaofei He, Murat Sensoy

Main category: cs.LG

TL;DR: This paper proposes TMNR and TMNR^2 methods for trusted multi-view learning that can handle noisy labels by modeling uncertainty through evidential deep neural networks and noise correlation matrices, achieving 7% accuracy improvement on datasets with 50% label noise.

Details

Motivation: Existing multi-view learning methods focus on accuracy but neglect decision uncertainty, limiting their use in safety-critical scenarios. While trusted multi-view learning methods exist, they require high-quality ground-truth labels, creating a need for reliable multi-view learning models that can handle noisy labels.

Method: The paper proposes TMNR (Trusted Multi-view Noise Refining) using evidential deep neural networks to construct view-specific opinions capturing beliefs and uncertainty, with noise correlation matrices constrained by sample uncertainty. TMNR^2 (improved version) disentangles the co-training problem by establishing different training objectives - using clean samples for evidential networks and noisy samples for noise correlation matrices, while generating pseudo-labels from neighboring information.

Result: TMNR^2 significantly outperforms baseline methods with average accuracy improvements of 7% on datasets containing 50% label noise, demonstrating effective handling of noisy supervision in multi-view learning scenarios.

Conclusion: The proposed TMNR^2 method successfully addresses the challenge of trusted multi-view learning under noisy labels by effectively modeling uncertainty and noise correlation, achieving stable training and significant performance improvements over existing methods.

Abstract: Multi-view learning methods often focus on improving decision accuracy while neglecting the decision uncertainty, which significantly restricts their applications in safety-critical scenarios. To address this, trusted multi-view learning methods estimate prediction uncertainties by learning class distributions from each instance. However, these methods heavily rely on high quality ground-truth labels. This motivates us to delve into a new problem: how to develop a reliable multi-view learning model under the guidance of noisy labels? We propose the Trusted Multi view Noise Refining (TMNR) method to address this challenge by modeling label noise arising from low-quality data features and easily-confused classes. TMNR employs evidential deep neural networks to construct view-specific opinions that capture both beliefs and uncertainty. These opinions are then transformed through noise correlation matrices to align with the noisy supervision, where matrix elements are constrained by sample uncertainty to reflect label reliability. Furthermore, considering the challenge of jointly optimizing the evidence network and noise correlation matrices under noisy supervision, we further propose Trusted Multi-view Noise Re-Refining (TMNR^2 ), which disentangles this complex co-training problem by establishing different training objectives for distinct modules. TMNR^2 identifies potentially mislabeled samples through evidence-label consistency and generates pseudo-labels from neighboring information. By assigning clean samples to optimize evidential networks and noisy samples to guide noise correlation matrices, respectively, TMNR^2 reduces mapping interference and achieves stabilizes training. Experimental results demonstrate that TMNR^2 significantly outperforms baseline methods, with average accuracy improvements of 7% on datasets with 50% label noise.

[340] Federated Behavioural Planes: Explaining the Evolution of Client Behaviour in Federated Learning

Dario Fenoglio, Gabriele Dominici, Pietro Barbiero, Alberto Tonda, Martin Gjoreski, Marc Langheinrich

Main category: cs.LG

TL;DR: This paper introduces Federated Behavioural Planes (FBPs) to analyze and visualize client behavior in federated learning systems, and proposes Federated Behavioural Shields as a robust aggregation technique to detect malicious clients and improve FL security.

Details

Motivation: Current federated learning systems lack effective methods to understand and monitor client behavior evolution, which is crucial for building human trust and control. There's a need to identify whether clients contribute beneficially or detrimentally to the training process while maintaining privacy.

Method: The authors propose Federated Behavioural Planes (FBPs) that analyze client dynamics through two behavioral spaces: error behavioral space (focusing on predictive performance) and counterfactual behavioral space (focusing on decision-making processes). Based on FBP patterns, they develop Federated Behavioural Shields for robust aggregation to detect malicious or noisy clients.

Result: FBPs successfully provide informative trajectories describing client states and contributions to the global model, enabling identification of client clusters with similar behaviors. The proposed Federated Behavioural Shields defense mechanism outperforms existing state-of-the-art FL defense methods in detecting malicious clients and enhancing system security.

Conclusion: FBPs offer an effective solution for analyzing and visualizing federated learning client behavior, improving system transparency and trust. The resulting Federated Behavioural Shields technique provides superior defense against malicious clients compared to existing methods, enhancing the overall security and robustness of federated learning systems.

Abstract: Federated Learning (FL), a privacy-aware approach in distributed deep learning environments, enables many clients to collaboratively train a model without sharing sensitive data, thereby reducing privacy risks. However, enabling human trust and control over FL systems requires understanding the evolving behaviour of clients, whether beneficial or detrimental for the training, which still represents a key challenge in the current literature. To address this challenge, we introduce Federated Behavioural Planes (FBPs), a novel method to analyse, visualise, and explain the dynamics of FL systems, showing how clients behave under two different lenses: predictive performance (error behavioural space) and decision-making processes (counterfactual behavioural space). Our experiments demonstrate that FBPs provide informative trajectories describing the evolving states of clients and their contributions to the global model, thereby enabling the identification of clusters of clients with similar behaviours. Leveraging the patterns identified by FBPs, we propose a robust aggregation technique named Federated Behavioural Shields to detect malicious or noisy client models, thereby enhancing security and surpassing the efficacy of existing state-of-the-art FL defense mechanisms. Our code is publicly available on GitHub.

[341] Enhancing supply chain security with automated machine learning

Haibo Wang, Lutfu S. Sua, Bahram Alidaee

Main category: cs.LG

TL;DR: This paper presents an automated machine learning framework for supply chain security that detects fraud, predicts maintenance needs, and forecasts material backorders, achieving high accuracy rates (88-93.4%) across different applications using techniques like XGBoost and LightGBM.

Details

Motivation: Global supply chains face increasing challenges including disruptions, material shortages, and inflation due to their growing scale and complexity. The availability of vast amounts of data presents an opportunity to apply machine learning methods to tackle these challenges more efficiently than traditional solutions.

Method: The authors developed an automated ML framework incorporating multiple techniques including Random Forest, XGBoost, LightGBM, and Neural Networks. The framework includes streamlined data preprocessing, feature selection, hyperparameter tuning, model optimization, and inference deployment to address three key supply chain problems: fraud detection, maintenance prediction, and material backorder forecasting.

Result: The framework achieved strong performance across all applications: fraud detection reached 88% accuracy using sampling methods, machine failure prediction achieved 93.4% accuracy, and material backorder prediction reached 89.3% accuracy. Hyperparameter tuning significantly improved model performance, with XGBoost and LightGBM achieving up to 100% precision in certain supervised tasks.

Conclusion: The automated ML framework successfully enhances supply chain security by addressing critical operational challenges. The research demonstrates that machine learning techniques can effectively streamline supply chain operations, improve security through fraud detection, optimize maintenance scheduling, and enhance inventory management through accurate backorder prediction, ultimately boosting operational efficiency.

Abstract: The increasing scale and complexity of global supply chains have led to new challenges spanning various fields, such as supply chain disruptions due to long waiting lines at the ports, material shortages, and inflation. Coupled with the size of supply chains and the availability of vast amounts of data, efforts towards tackling such challenges have led to an increasing interest in applying machine learning methods in many aspects of supply chains. Unlike other solutions, ML techniques, including Random Forest, XGBoost, LightGBM, and Neural Networks, make predictions and approximate optimal solutions faster. This paper presents an automated ML framework to enhance supply chain security by detecting fraudulent activities, predicting maintenance needs, and forecasting material backorders. Using datasets of varying sizes, results show that fraud detection achieves an 88% accuracy rate using sampling methods, machine failure prediction reaches 93.4% accuracy, and material backorder prediction achieves 89.3% accuracy. Hyperparameter tuning significantly improved the performance of these models, with certain supervised techniques like XGBoost and LightGBM reaching up to 100% precision. This research contributes to supply chain security by streamlining data preprocessing, feature selection, model optimization, and inference deployment, addressing critical challenges and boosting operational efficiency.

[342] Unmasking Trees for Tabular Data

Calvin McCarter

Main category: cs.LG

TL;DR: UnmaskingTrees is a simple gradient-boosted decision tree method for tabular data imputation and generation that outperforms advanced deep learning methods on benchmarks, using incremental feature unmasking and a novel conditional generation approach called BaltoBot.

Details

Motivation: Traditional methods continue to outperform advanced deep learning and generative modeling techniques on tabular data imputation benchmarks, indicating a need for simpler yet more effective approaches that can handle the unique challenges of tabular data including missingness and mixed data types.

Method: The paper presents UnmaskingTrees, which uses gradient-boosted decision trees to incrementally unmask individual features for imputation and generation. For conditional generation, they propose BaltoBot, a tabular probabilistic prediction method that fits a balanced tree of boosted tree classifiers without parametric assumptions on conditional distributions.

Result: On 27 small tabular datasets, UnmaskingTrees achieved leading performance on imputation, state-of-the-art performance on generation with missing data, and competitive performance on vanilla generation without missingness. The method offers fast sampling, closed-form density estimation, and flexible handling of discrete variables compared to diffusion methods.

Conclusion: The study demonstrates that simple tree-based methods can outperform complex deep learning approaches for tabular data tasks. The proposed UnmaskingTrees and BaltoBot methods provide effective solutions for tabular imputation and generation while offering practical advantages like fast sampling and flexible data type handling.

Abstract: Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. On a benchmark for out-of-the-box performance on 27 small tabular datasets, UnmaskingTrees offers leading performance on imputation; state-of-the-art performance on generation given data with missingness; and competitive performance on vanilla generation given data without missingness. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions; unlike newer diffusion methods, it offers fast sampling, closed-form density estimation, and flexible handling of discrete variables. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.

[343] EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles

Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, Bang An, Bayan Bruss, John Langford, Furong Huang

Main category: cs.LG

TL;DR: The paper proposes EnsemW2S, a token-level ensemble method that combines multiple weak expert models to better supervise stronger student models, addressing the weak-to-strong generalization challenge where smaller human-level models need to effectively guide more powerful LLMs.

Details

Motivation: As Large Language Models approach or exceed human-level performance, there's a critical need to develop methods that allow smaller, human-level models (trained only on human-level data) to effectively supervise and enhance these more powerful models. This weak-to-strong generalization challenge is imperative for safely controlling super-human AI systems.

Method: EnsemW2S uses a token-level ensemble strategy that iteratively combines multiple weak expert models. The method systematically identifies and addresses shortcomings from previous iterations, continuously refining weak models to enhance their collective supervisory capability for stronger student models. The ensemble approach enables better generalization to complex, super-human-level tasks.

Result: The method shows significant improvements across both in-distribution (ID) and out-of-distribution (OOD) datasets. Specifically, it achieves 4% and 3.2% improvements on ID datasets, and up to 6% and 2.28% improvements on OOD datasets for expert and student models respectively. The evaluation includes question difficulty as an additional dimension for measuring distributional shifts.

Conclusion: EnsemW2S effectively addresses the weak-to-strong generalization challenge by demonstrating that ensemble methods can significantly improve the ability of weak expert models to supervise stronger student models. The empirical results across various datasets validate the effectiveness of the proposed approach in advancing weak-to-strong generalization capabilities.

Abstract: With Large Language Models (LLMs) rapidly approaching and potentially surpassing human-level performance, it has become imperative to develop approaches capable of effectively supervising and enhancing these powerful models using smaller, human-level models exposed to only human-level data. We address this critical weak-to-strong (W2S) generalization challenge by proposing a novel method aimed at improving weak experts, by training on the same limited human-level data, enabling them to generalize to complex, super-human-level tasks. Our approach, called EnsemW2S, employs a token-level ensemble strategy that iteratively combines multiple weak experts, systematically addressing the shortcomings identified in preceding iterations. By continuously refining these weak models, we significantly enhance their collective ability to supervise stronger student models. We extensively evaluate the generalization performance of both the ensemble of weak experts and the subsequent strong student model across in-distribution (ID) and out-of-distribution (OOD) datasets. For OOD, we specifically introduce question difficulty as an additional dimension for defining distributional shifts. Our empirical results demonstrate notable improvements, achieving 4%, and 3.2% improvements on ID datasets and, upto 6% and 2.28% on OOD datasets for experts and student models respectively, underscoring the effectiveness of our proposed method in advancing W2S generalization.

[344] Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased

Nathan Phelps, Daniel J. Lizotte, Douglas G. Woolford

Main category: cs.LG

TL;DR: This paper investigates issues with analytical calibration of random forests trained on undersampled imbalanced datasets, finding that it can lead to upwardly biased prevalence estimates and revealing that decision trees can actually be biased towards minority classes contrary to common belief.

Details

Motivation: Imbalanced binary classification is common across many fields, and while undersampling the majority class is a standard approach, the resulting bias in model predictions needs to be corrected. Current analytical calibration methods may not work well for all machine learning models, particularly random forests.

Method: The authors analyze the effects of analytical calibration on random forests trained with undersampled majority class data, examining how prevalence estimates are affected by the number of predictors considered at each split and the sampling rate used during training.

Result: Analytical calibration of random forests leads to upwardly biased prevalence estimates that depend on both the number of predictors per split and the sampling rate. The study also discovered that decision trees can be biased towards minority classes, contradicting the widespread belief that they favor majority classes.

Conclusion: Random forests should not be calibrated using standard analytical methods when trained on undersampled data, as this introduces systematic bias. Additionally, the common assumption that decision trees are biased towards majority classes is incorrect - they can actually favor minority classes.

Abstract: Imbalanced binary classification problems arise in many fields of study. When using machine learning models for these problems, it is common to subsample the majority class (i.e., undersampling) to create a (more) balanced dataset for model training. This biases the model’s predictions because the model learns from a dataset that does not follow the same data generating process as new data. One way of accounting for this bias is to analytically map the resulting predictions to new values based on the sampling rate for the majority class, which was used to create the training dataset. While this approach may work well for some machine learning models, we show that calibrating a random forest this way has unintended negative consequences, including prevalence estimates that can be upwardly biased. These prevalence estimates depend on both i) the number of predictors considered at each split in the random forest; and ii) the sampling rate used. We explain the former using known properties of random forests and analytical calibration. However, in investigating the latter issue, we made a surprising discovery - contrary to the widespread belief that decision trees are biased towards the majority class, they actually can be biased towards the minority class.

Farzan Moosavi, Bilal Farooq

Main category: cs.LG

TL;DR: A multi-modal autonomous delivery optimization framework using coalition game theory for UAVs and ADRs that leverages deep reinforcement learning with graph attention networks to solve last-mile delivery problems in urban environments with operational constraints.

Details

Motivation: Address last-mile delivery challenges in urban environments, particularly high-density areas and time-critical applications, by enabling strategic collaboration between UAVs and ADRs (Autonomous Delivery Robots) to improve overall routing efficiency while handling real-world operational constraints like battery limitations and building obstructions.

Method: A coalition game theory framework combined with generalized reinforcement learning that uses an end-to-end deep multi-agent policy gradient method augmented by a novel spatio-temporal adjacency neighbourhood graph attention network with heterogeneous edge-enhanced attention model and transformer architecture to evaluate cost-sharing and learn cooperative behavior between UAVs and ADRs.

Result: Numerical experiments in Mississauga city showed the model achieves high-quality solutions compared to existing transformer-based and classical methods, performs well on non-homogeneous data distribution, generalizes across different scales and configurations, and demonstrates robust cooperative performance under stochastic scenarios with effective coalition analysis and cost allocation.

Conclusion: The proposed framework successfully addresses realistic operational constraints in multi-modal autonomous delivery systems and demonstrates the significant advantages of cooperation between UAVs and ADRs, showing superior performance compared to existing methods while maintaining robustness across various scenarios and configurations.

Abstract: We introduce a multi-modal autonomous delivery optimization framework as a coalition game for a fleet of UAVs and ADRs operating in two overlaying networks to address last-mile delivery in urban environments, including high-density areas and time-critical applications. The problem is defined as multiple depot pickup and delivery with time windows constrained over operational restrictions, such as vehicle battery limitation, precedence time window, and building obstruction. Utilizing the coalition game theory, we investigate cooperation structures among the modes to capture how strategic collaboration can improve overall routing efficiency. To do so, a generalized reinforcement learning model is designed to evaluate the cost-sharing and allocation to different modes to learn the cooperative behaviour with respect to various realistic scenarios. Our methodology leverages an end-to-end deep multi-agent policy gradient method augmented by a novel spatio-temporal adjacency neighbourhood graph attention network using a heterogeneous edge-enhanced attention model and transformer architecture. Several numerical experiments on last-mile delivery applications have been conducted, showing the results from the case study in the city of Mississauga, which shows that despite the incorporation of an extensive network in the graph for two modes and a complex training structure, the model addresses realistic operational constraints and achieves high-quality solutions compared with the existing transformer-based and classical methods. It can perform well on non-homogeneous data distribution, generalizes well on different scales and configurations, and demonstrates a robust cooperative performance under stochastic scenarios across various tasks, which is effectively reflected by coalition analysis and cost allocation to signify the advantage of cooperation.

[346] Unified Sparse-Matrix Representations for Diverse Neural Architectures

Yuzhou Zhu

Main category: cs.LG

TL;DR: This paper introduces a unified matrix-order framework that represents convolutional, recurrent, and self-attention operations as sparse matrix multiplications, proving algebraic equivalence with standard CNN/RNN/Transformer layers while achieving comparable performance across vision, sequential, and language tasks.

Details

Motivation: The proliferation of specialized neural network architectures for different tasks (vision, sequential, language) obscures their underlying commonalities, creating a need for a unified mathematical framework that can reveal shared principles and enable more principled architecture design.

Method: The authors develop a matrix-order framework that casts different neural operations as sparse matrix multiplications: convolution as upper-triangular matrix first-order transformations, recurrence as lower-triangular matrix stepwise updates, and attention as third-order tensor factorization. They prove algebraic isomorphism with standard layers under mild assumptions.

Result: Empirical evaluations across multiple domains (image classification on MNIST/CIFAR-10/100/Tiny ImageNet, time-series forecasting on ETTh1/Electricity Load, and language tasks on AG News/WikiText-2/Penn Treebank) demonstrate that sparse-matrix formulations match or exceed native model performance while converging in comparable or fewer epochs.

Conclusion: The matrix perspective reduces architecture design to sparse pattern selection, aligns with GPU parallelism, leverages mature algebraic optimization tools, and establishes a mathematically rigorous substrate for diverse neural architectures, opening avenues for principled and hardware-aware network design.

Abstract: Deep neural networks employ specialized architectures for vision, sequential and language tasks, yet this proliferation obscures their underlying commonalities. We introduce a unified matrix-order framework that casts convolutional, recurrent and self-attention operations as sparse matrix multiplications. Convolution is realized via an upper-triangular weight matrix performing first-order transformations; recurrence emerges from a lower-triangular matrix encoding stepwise updates; attention arises naturally as a third-order tensor factorization. We prove algebraic isomorphism with standard CNN, RNN and Transformer layers under mild assumptions. Empirical evaluations on image classification (MNIST, CIFAR-10/100, Tiny ImageNet), time-series forecasting (ETTh1, Electricity Load Diagrams) and language modeling/classification (AG News, WikiText-2, Penn Treebank) confirm that sparse-matrix formulations match or exceed native model performance while converging in comparable or fewer epochs. By reducing architecture design to sparse pattern selection, our matrix perspective aligns with GPU parallelism and leverages mature algebraic optimization tools. This work establishes a mathematically rigorous substrate for diverse neural architectures and opens avenues for principled, hardware-aware network design.

[347] GenMol: A Drug Discovery Generalist with Discrete Diffusion

Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Paliwal, Weili Nie, Arash Vahdat

Main category: cs.LG

TL;DR: GenMol is a unified molecular generative model that uses a single discrete diffusion framework to handle multiple drug discovery tasks, achieving state-of-the-art performance across de novo generation, fragment-constrained generation, hit generation, and lead optimization.

Details

Motivation: Existing molecular generative models can only tackle specific tasks in drug discovery, lacking versatility to handle the complex, multi-stage drug discovery process that requires a unified approach capable of addressing diverse scenarios within a single framework.

Method: GenMol employs a single discrete diffusion model that generates SAFE sequences through non-autoregressive bidirectional parallel decoding, uses fragments as molecular building blocks, implements fragment remasking for chemical space exploration, and introduces molecular context guidance (MCG) specifically designed for masked discrete diffusion.

Result: GenMol significantly outperforms previous GPT-based models in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization tasks, demonstrating superior versatility across multiple drug discovery scenarios.

Conclusion: GenMol successfully provides a unified and versatile approach for molecular design that can tackle a wide range of drug discovery tasks using a single framework, representing a significant advancement in computational drug discovery methodology.

Abstract: Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We present Generalist Molecular generative model (GenMol), a versatile framework that uses only a single discrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachment-based Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introduces fragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further propose molecular context guidance (MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design. Our code is available at https://github.com/NVIDIA-Digital-Bio/genmol.

[348] The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach

Main category: cs.LG

TL;DR: This paper demonstrates that learning-rate schedules for large language models follow optimization theory bounds and uses this insight to improve training by extending schedules and transferring optimal learning rates across different schedules.

Details

Motivation: The authors observed that learning-rate schedules for large model training behave surprisingly similar to performance bounds from non-smooth convex optimization theory, motivating them to explore whether this theoretical connection could be exploited for practical learning-rate tuning improvements.

Method: The researchers provide theoretical bounds for constant schedules with linear cooldown and demonstrate the close match between optimization theory and practice. They then exploit this connection by (1) extending training schedules with optimal learning rates for continued training and (2) transferring optimal learning rates across different scheduling strategies.

Result: The method achieved noticeable improvements when training 124M and 210M parameter Llama-type models. The theoretical bound shows that the practical benefit of cooldown is reflected due to the absence of logarithmic terms, and the optimization theory-practice alignment enables effective learning-rate transfer and schedule extension.

Conclusion: Learning-rate schedules for large model training align closely with non-smooth convex optimization theory bounds, and this theoretical insight can be practically exploited to improve training performance through optimal learning-rate scheduling, extension, and transfer strategies.

Abstract: We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.

[349] EXGnet: a single-lead explainable-AI guided multiresolution network with train-only quantitative features for trustworthy ECG arrhythmia classification

Tushar Talukder Showrav, Soyabul Islam Lincoln, Md. Kamrul Hasan

Main category: cs.LG

TL;DR: EXGnet is a novel ECG arrhythmia classification network that combines high accuracy (98.762% and 96.932% on benchmark datasets), explainability through XAI supervision, and edge device compatibility for single-lead ECG signals, addressing the clinical adoption challenges of deep learning models.

Details

Motivation: Deep learning models for ECG arrhythmia classification face challenges in clinical adoption due to lack of interpretability and difficulty deploying on resource-constrained edge devices, creating a need for models that balance high accuracy, explainability, and edge compatibility.

Method: EXGnet integrates XAI supervision during training using normalized cross-correlation loss to focus on clinically relevant ECG regions, incorporates automatically generated ground truth via heart rate variability-based approach, uses quantitative ECG features during training (excluded at inference), and employs an innovative multiresolution block for efficient feature capture.

Result: EXGnet achieved average five-fold accuracies of 98.762% and 96.932% on Chapman and Ningbo datasets respectively, with F1-scores of 97.910% and 95.527%. Ablation studies and interpretability assessments confirmed the effectiveness of XAI guidance in enhancing model focus and trustworthiness.

Conclusion: EXGnet successfully combines high-performance arrhythmia classification with interpretability, setting a new benchmark and paving the way for more trustworthy and accessible portable ECG-based health monitoring systems suitable for edge deployment.

Abstract: Deep learning has significantly propelled the performance of ECG arrhythmia classification, yet its clinical adoption remains hindered by challenges in interpretability and deployment on resource-constrained edge devices. To bridge this gap, we propose EXGnet, a novel and reliable ECG arrhythmia classification network tailored for single-lead signals, specifically designed to balance high accuracy, explainability, and edge compatibility. EXGnet integrates XAI supervision during training via a normalized cross-correlation based loss, directing the model’s attention to clinically relevant ECG regions, similar to a cardiologist’s focus. This supervision is driven by automatically generated ground truth, derived through an innovative heart rate variability-based approach, without the need for manual annotation. To enhance classification accuracy without compromising deployment simplicity, we incorporate quantitative ECG features during training. These enrich the model with multi-domain knowledge but are excluded during inference, keeping the model lightweight for edge deployment. Additionally, we introduce an innovative multiresolution block to efficiently capture both short and long-term signal features while maintaining computational efficiency. Rigorous evaluation on the Chapman and Ningbo benchmark datasets validates the supremacy of EXGnet, which achieves average five-fold accuracies of 98.762% and 96.932%, and F1-scores of 97.910% and 95.527%, respectively. Comprehensive ablation studies and both quantitative and qualitative interpretability assessment confirm that the XAI guidance is pivotal, demonstrably enhancing the model’s focus and trustworthiness. Overall, EXGnet sets a new benchmark by combining high-performance arrhythmia classification with interpretability, paving the way for more trustworthy and accessible portable ECG based health monitoring systems.

[350] MIRA: Medical Time Series Foundation Model for Real-World Health Data

Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, Jiang Bian

Main category: cs.LG

TL;DR: MIRA is a unified foundation model for medical time series forecasting that handles irregular intervals, heterogeneous sampling rates, and missing values through continuous-time encoding, frequency-specific mixture-of-experts, and neural ODE-based dynamics modeling, achieving 10% and 7% error reductions in out-of-distribution and in-distribution scenarios respectively.

Details

Motivation: Existing generalist time series foundation models struggle with medical time series data due to irregular intervals, heterogeneous sampling rates, and frequent missing values. There's a need for a unified foundation model that can reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, especially in data-scarce or privacy-constrained environments.

Method: MIRA incorporates three key components: (1) Continuous-Time Rotary Positional Encoding for fine-grained modeling of variable time intervals, (2) frequency-specific mixture-of-experts layer that routes computation across latent frequency regimes for temporal specialization, and (3) Continuous Dynamics Extrapolation Block based on Neural ODE that models continuous trajectory of latent states for forecasting at arbitrary timestamps. The model is pretrained on over 454 billion time points from publicly available medical datasets.

Result: MIRA achieves reductions in forecasting errors by an average of 10% in out-of-distribution scenarios and 7% in in-distribution scenarios compared to other zero-shot and fine-tuned baselines. The authors also introduce a comprehensive benchmark spanning multiple downstream clinical tasks.

Conclusion: MIRA successfully addresses the challenges of medical time series modeling through its specialized architecture, demonstrating superior performance over existing baselines and establishing a foundation for future research in medical time series modeling with the introduction of a comprehensive benchmark.

Abstract: A unified foundation model for medical time series – pretrained on open access and ethics board-approved medical corpora – offers the potential to reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, particularly in data-scarce or privacy-constrained environments. However, existing generalist time series foundation models struggle to handle medical time series data due to their inherent challenges, including irregular intervals, heterogeneous sampling rates, and frequent missing values. To address these challenges, we introduce MIRA, a unified foundation model specifically designed for medical time series forecasting. MIRA incorporates a Continuous-Time Rotary Positional Encoding that enables fine-grained modeling of variable time intervals, a frequency-specific mixture-of-experts layer that routes computation across latent frequency regimes to further promote temporal specialization, and a Continuous Dynamics Extrapolation Block based on Neural ODE that models the continuous trajectory of latent states, enabling accurate forecasting at arbitrary target timestamps. Pretrained on a large-scale and diverse medical corpus comprising over 454 billion time points collect from publicly available datasets, MIRA achieves reductions in forecasting errors by an average of 10% and 7% in out-of-distribution and in-distribution scenarios, respectively, when compared to other zero-shot and fine-tuned baselines. We also introduce a comprehensive benchmark spanning multiple downstream clinical tasks, establishing a foundation for future research in medical time series modeling.

[351] Data-Driven Exploration for a Class of Continuous-Time Indefinite Linear–Quadratic Reinforcement Learning Problems

Yilie Huang, Xun Yu Zhou

Main category: cs.LG

TL;DR: This paper proposes an adaptive exploration mechanism for reinforcement learning in continuous-time stochastic linear-quadratic control problems that adjusts entropy regularization and policy variance during learning, achieving sublinear regret bounds while improving convergence speed compared to fixed exploration schedules.

Details

Motivation: Existing model-free RL methods for continuous-time stochastic LQ control problems use constant or deterministic exploration schedules that require extensive tuning and ignore learning progress during iterations, leading to inefficient learning performance.

Method: A model-free, data-driven exploration mechanism that adaptively adjusts entropy regularization by the critic and policy variance by the actor, rather than using fixed exploration schedules throughout the learning process.

Result: The adaptive exploration method achieves sublinear regret bounds matching the best-known model-free results for LQ problems, while numerical experiments show accelerated convergence and improved regret performance compared to non-adaptive model-free and model-based approaches.

Conclusion: Adaptive exploration mechanisms can significantly improve learning efficiency in continuous-time stochastic LQ control problems while maintaining theoretical guarantees, offering a practical advantage over fixed exploration schedules with minimal tuning requirements.

Abstract: We study reinforcement learning (RL) for the same class of continuous-time stochastic linear–quadratic (LQ) control problems as in \cite{huang2024sublinear}, where volatilities depend on both states and controls while states are scalar-valued and running control rewards are absent. We propose a model-free, data-driven exploration mechanism that adaptively adjusts entropy regularization by the critic and policy variance by the actor. Unlike the constant or deterministic exploration schedules employed in \cite{huang2024sublinear}, which require extensive tuning for implementations and ignore learning progresses during iterations, our adaptive exploratory approach boosts learning efficiency with minimal tuning. Despite its flexibility, our method achieves a sublinear regret bound that matches the best-known model-free results for this class of LQ problems, which were previously derived only with fixed exploration schedules. Numerical experiments demonstrate that adaptive explorations accelerate convergence and improve regret performance compared to the non-adaptive model-free and model-based counterparts.

[352] The Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks

João Manoel Herrera Pinheiro, Suzana Vilas Boas de Oliveira, Thiago Henrique Segreto Silva, Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Ricardo V. Godoy, Leonardo André Ambrosio, Marcelo Becker

Main category: cs.LG

TL;DR: This study comprehensively evaluates 12 feature scaling techniques across 14 ML algorithms and 16 datasets, finding that ensemble methods are robust to scaling choices while other models like Logistic Regression and SVMs show significant performance variations depending on the scaler used.

Details

Motivation: There is a critical lack of comprehensive studies on feature scaling effects across different machine learning algorithms, making it difficult for practitioners to choose optimal scaling techniques for their specific models and datasets.

Method: Systematic evaluation of 12 scaling techniques (including less common transformations) across 14 different ML algorithms and 16 datasets for both classification and regression tasks, analyzing predictive performance metrics (accuracy, MAE, MSE, R²) and computational costs (training time, inference time, memory usage).

Result: Ensemble methods (Random Forest, XGBoost, CatBoost, LightGBM) show robust performance largely independent of feature scaling, while other widely used models (Logistic Regression, SVMs, TabNet, MLPs) exhibit significant performance variations highly dependent on the chosen scaler.

Conclusion: The study provides model-specific guidance for practitioners on optimal feature scaling selection, with all source code and experimental results made publicly available for transparency and reproducibility.

Abstract: This research addresses the critical lack of comprehensive studies on feature scaling by systematically evaluating 12 scaling techniques - including several less common transformations - across 14 different Machine Learning algorithms and 16 datasets for classification and regression tasks. We meticulously analyzed impacts on predictive performance (using metrics such as accuracy, MAE, MSE, and $R^2$) and computational costs (training time, inference time, and memory usage). Key findings reveal that while ensemble methods (such as Random Forest and gradient boosting models like XGBoost, CatBoost and LightGBM) demonstrate robust performance largely independent of scaling, other widely used models such as Logistic Regression, SVMs, TabNet, and MLPs show significant performance variations highly dependent on the chosen scaler. This extensive empirical analysis, with all source code, experimental results, and model parameters made publicly available to ensure complete transparency and reproducibility, offers model-specific crucial guidance to practitioners on the need for an optimal selection of feature scaling techniques.

[353] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

Prajwal Koirala, Cody Fleming

Main category: cs.LG

TL;DR: The paper proposes Single-Step Completion Policy (SSCP), a generative policy that uses flow-matching to enable one-shot action generation, combining the expressiveness of generative models with the efficiency of unimodal policies for reinforcement learning.

Details

Motivation: Existing generative models like diffusion and flow-matching for offline RL suffer from high inference costs and training instability due to iterative sampling and gradient propagation across multiple sampling steps, despite their ability to capture rich, multimodal action distributions.

Method: SSCP uses an augmented flow-matching objective to train a generative policy that predicts direct completion vectors from intermediate flow samples, enabling accurate one-shot action generation. It operates within an off-policy actor-critic framework and extends to goal-conditioned RL settings.

Result: SSCP achieves strong performance across standard offline RL and behavior cloning benchmarks, demonstrating substantial gains in speed and adaptability over diffusion-based baselines. It scales effectively to offline, offline-to-online, and online RL settings.

Conclusion: SSCP successfully combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making without requiring long backpropagation chains.

Abstract: Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.

[354] Leveraging RAG-LLMs for Urban Mobility Simulation and Analysis

Yue Ding, Conor McCarthy, Kevin O’Shea, Mingming Liu

Main category: cs.LG

TL;DR: This paper presents a cloud-based e-mobility platform that uses Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to provide personalized route recommendations through a mobile app, achieving high query accuracy for both system operators and users.

Details

Motivation: With the growing demand for smart mobility and shared e-mobility services, there is a need for comprehensive end-to-end solutions that can provide intelligent decision-making, user interaction, and real-time traffic analysis to deliver personalized mobility experiences.

Method: The authors developed a cloud-based platform integrating LLMs with a RAG framework and XiYanSQL for schema-level operations. The system includes a mobile application for route recommendations and an optimization module that considers travel time and cost across different traffic scenarios.

Result: The LLM-powered RAG framework with XiYanSQL achieved an average execution accuracy of 0.81 for system operator queries and 0.98 for user queries. The optimization module was successfully evaluated across different traffic scenarios for travel time and cost optimization.

Conclusion: The proposed cloud-based, LLM-powered shared e-mobility platform successfully demonstrates the potential of integrating advanced AI technologies with mobility services, achieving high accuracy in query processing and providing effective personalized route recommendations for users.

Abstract: With the rise of smart mobility and shared e-mobility services, numerous advanced technologies have been applied to this field. Cloud-based traffic simulation solutions have flourished, offering increasingly realistic representations of the evolving mobility landscape. LLMs have emerged as pioneering tools, providing robust support for various applications, including intelligent decision-making, user interaction, and real-time traffic analysis. As user demand for e-mobility continues to grow, delivering comprehensive end-to-end solutions has become crucial. In this paper, we present a cloud-based, LLM-powered shared e-mobility platform, integrated with a mobile application for personalized route recommendations. The optimization module is evaluated based on travel time and cost across different traffic scenarios. Additionally, the LLM-powered RAG framework is evaluated at the schema level for different users, using various evaluation methods. Schema-level RAG with XiYanSQL achieves an average execution accuracy of 0.81 on system operator queries and 0.98 on user queries.

[355] Fake or Real: The Impostor Hunt in Texts for Space Operations

Agata Kaczmarek, Dawid Płudowski, Piotr Wilczyński, Krzysztof Kotowski, Ramez Shendy, Evridiki Ntagiou, Jakub Nalepa, Artur Janicki, Przemysław Biecek

Main category: cs.LG

TL;DR: A Kaggle competition focused on detecting malicious modifications in Large Language Models by distinguishing between proper LLM outputs and outputs generated under adversarial conditions, addressing real AI security threats in space domain applications.

Details

Motivation: The competition addresses two critical AI security threats identified in the "Assurance for Space Domain AI Applications" project: data poisoning and overreliance in Large Language Models. These are real-life security concerns that require new detection techniques due to limited existing research in this area.

Method: Participants are challenged to develop new techniques or adapt existing methods to distinguish between legitimate LLM outputs and outputs generated from maliciously modified LLMs. The competition format allows for innovative approaches to tackle this under-researched problem.

Result: This is a competition announcement rather than a research paper with results. The competition is part of a series related to space domain AI assurance, hosted on Kaggle as the “Fake or Real” impostor hunt challenge.

Conclusion: The competition represents an important step in addressing AI security vulnerabilities in space applications, specifically targeting the detection of malicious LLM modifications. It calls for innovative solutions to an under-researched but critical security problem.

Abstract: The “Fake or Real” competition hosted on Kaggle (https://www.kaggle.com/competitions/fake-or-real-the-impostor-hunt ) is the second part of a series of follow-up competitions and hackathons related to the “Assurance for Space Domain AI Applications” project funded by the European Space Agency (https://assurance-ai.space-codev.org/ ). The competition idea is based on two real-life AI security threats identified within the project – data poisoning and overreliance in Large Language Models. The task is to distinguish between the proper output from LLM and the output generated under malicious modification of the LLM. As this problem was not extensively researched, participants are required to develop new techniques to address this issue or adjust already existing ones to this problem’s statement.

[356] Artificial Intelligence for Green Hydrogen Yield Prediction and Site Suitability using SHAP-Based Composite Index: Focus on Oman

Obumneme Zimuzor Nwafor, Mohammed Abdul Majeed Al Hooti

Main category: cs.LG

TL;DR: This study develops an AI framework using machine learning and SHAP analysis to identify optimal locations for green hydrogen production in solar-rich arid regions, achieving 98% predictive accuracy and revealing that water proximity, elevation, and seasonal variation are the most critical factors for site suitability.

Details

Motivation: Nations seeking sustainable alternatives to fossil fuels face challenges in identifying optimal locations for green hydrogen production due to complex environmental factors and limited direct hydrogen yield data, particularly in solar-rich arid regions with green hydrogen potential.

Method: A novel AI framework consisting of a multi-stage pipeline: (1) unsupervised multi-variable clustering, (2) supervised machine learning classifier, and (3) SHAP (SHapley Additive exPlanations) algorithm for computing green hydrogen yield and site suitability index using integrated meteorological, topographic, and temporal datasets.

Result: The framework achieved 98% model predictive accuracy and identified distinct spatial patterns of suitability. Water proximity (2.470891), elevation (2.376296), and seasonal variation (1.273216) emerged as the most influential factors for green hydrogen site suitability in Oman based on mean absolute SHAP values.

Conclusion: The study provides an objective and reproducible alternative to subjective expert weightings for green hydrogen site selection, offering industry stakeholders and policymakers a replicable and scalable tool for infrastructure planning and decision-making in data-scarce regions with green hydrogen ambitions.

Abstract: As nations seek sustainable alternatives to fossil fuels, green hydrogen has emerged as a promising strategic pathway toward decarbonisation, particularly in solar-rich arid regions. However, identifying optimal locations for hydrogen production requires the integration of complex environmental, atmospheric, and infrastructural factors, often compounded by limited availability of direct hydrogen yield data. This study presents a novel Artificial Intelligence (AI) framework for computing green hydrogen yield and site suitability index using mean absolute SHAP (SHapley Additive exPlanations) values. This framework consists of a multi-stage pipeline of unsupervised multi-variable clustering, supervised machine learning classifier and SHAP algorithm. The pipeline trains on an integrated meteorological, topographic and temporal dataset and the results revealed distinct spatial patterns of suitability and relative influence of the variables. With model predictive accuracy of 98%, the result also showed that water proximity, elevation and seasonal variation are the most influential factors determining green hydrogen site suitability in Oman with mean absolute shap values of 2.470891, 2.376296 and 1.273216 respectively. Given limited or absence of ground-truth yield data in many countries that have green hydrogen prospects and ambitions, this study offers an objective and reproducible alternative to subjective expert weightings, thus allowing the data to speak for itself and potentially discover novel latent groupings without pre-imposed assumptions. This study offers industry stakeholders and policymakers a replicable and scalable tool for green hydrogen infrastructure planning and other decision making in data-scarce regions.

[357] Deep RL Dual Sourcing Inventory Management with Supply and Capacity Risk Awareness

Defeng Liu, Ying Liu, Carson Eisenach

Main category: cs.LG

TL;DR: This paper proposes using reinforcement learning with intervention models to solve large-scale stochastic optimization problems, specifically applied to multi-sourcing multi-period inventory management in supply chains by breaking down complex processes into composable deep learning modules.

Details

Motivation: Traditional approaches to large-scale stochastic optimization problems in supply chain management struggle with complex physical constraints and stochastic processes. There is a need for more efficient methods that can better explore solution spaces and handle the complexity of real-world supply chain optimization problems.

Method: The approach leverages reinforcement learning with intervention models and pre-trained deep learning models to simulate and compose stochastic processes. Key components include: (1) Deep RL models for learning and forecasting stochastic supply chain processes, (2) A constraint coordination mechanism to forecast dual costs for cross-product constraints, and (3) Breaking down complex supply chain processes into scalable and composable DL modules instead of directly modeling physical constraints into the RL optimization problem.

Result: The proposed methodology demonstrates improved performance on large real-world datasets for the multi-sourcing multi-period inventory management problem. The modular approach successfully handles the complexity of supply chain optimization by decomposing the problem into manageable components.

Conclusion: The paper shows that decomposing complex stochastic optimization problems into scalable and composable deep learning modules can lead to better performance than traditional monolithic approaches. The authors identify this as a promising direction and outline open problems for future research to further investigate the efficacy of such modular RL-based models.

Abstract: In this work, we study how to efficiently apply reinforcement learning (RL) for solving large-scale stochastic optimization problems by leveraging intervention models. The key of the proposed methodology is to better explore the solution space by simulating and composing the stochastic processes using pre-trained deep learning (DL) models. We demonstrate our approach on a challenging real-world application, the multi-sourcing multi-period inventory management problem in supply chain optimization. In particular, we employ deep RL models for learning and forecasting the stochastic supply chain processes under a range of assumptions. Moreover, we also introduce a constraint coordination mechanism, designed to forecast dual costs given the cross-products constraints in the inventory network. We highlight that instead of directly modeling the complex physical constraints into the RL optimization problem and solving the stochastic problem as a whole, our approach breaks down those supply chain processes into scalable and composable DL modules, leading to improved performance on large real-world datasets. We also outline open problems for future research to further investigate the efficacy of such models.

[358] HyDRA: A Hybrid-Driven Reasoning Architecture for Verifiable Knowledge Graphs

Adrian Kaiser, Claudiu Leoveanu-Condrei, Ryan Gold, Marius-Constantin Dinu, Markus Hofmarcher

Main category: cs.LG

TL;DR: HyDRA is a hybrid neurosymbolic architecture that automates Knowledge Graph construction by first building ontologies through collaborative agents and competency questions, then using design-by-contract principles to guide LLM-based triplet extraction, with symbolic verification for ensuring functional correctness.

Details

Motivation: Automated Knowledge Graph construction faces critical challenges including output reliability, consistency, and verifiability issues that manifest as structural inconsistencies like isolated data islands and incorrect conflation of abstract classes with specific instances, creating a bottleneck for advancing neurosymbolic AI.

Method: HyDRA employs a two-stage approach: (1) collaborative neurosymbolic agents construct domain ontologies by agreeing on competency questions that define scope and requirements, (2) the resulting ontology graph guides automated triplet extraction from documents using design-by-contract principles as control mechanisms for Large Language Models.

Result: The approach produces verifiable Knowledge Graphs with improved reliability and structural consistency. An evaluation framework using symbolic verifications from the SymbolicAI framework demonstrates the functional correctness of generated KGs beyond standard benchmarks.

Conclusion: HyDRA successfully addresses key challenges in automated KG construction by combining neurosymbolic collaboration for ontology design with contract-driven LLM guidance, providing both improved reliability and novel evaluation methods for measuring functional integrity of generated Knowledge Graphs.

Abstract: The synergy between symbolic knowledge, often represented by Knowledge Graphs (KGs), and the generative capabilities of neural networks is central to advancing neurosymbolic AI. A primary bottleneck in realizing this potential is the difficulty of automating KG construction, which faces challenges related to output reliability, consistency, and verifiability. These issues can manifest as structural inconsistencies within the generated graphs, such as the formation of disconnected $\textit{isolated islands}$ of data or the inaccurate conflation of abstract classes with specific instances. To address these challenges, we propose HyDRA, a $\textbf{Hy}$brid-$\textbf{D}$riven $\textbf{R}$easoning $\textbf{A}$rchitecture designed for verifiable KG automation. Given a domain or an initial set of documents, HyDRA first constructs an ontology via a panel of collaborative neurosymbolic agents. These agents collaboratively agree on a set of competency questions (CQs) that define the scope and requirements the ontology must be able to answer. Given these CQs, we build an ontology graph that subsequently guides the automated extraction of triplets for KG generation from arbitrary documents. Inspired by design-by-contracts (DbC) principles, our method leverages verifiable contracts as the primary control mechanism to steer the generative process of Large Language Models (LLMs). To verify the output of our approach, we extend beyond standard benchmarks and propose an evaluation framework that assesses the functional correctness of the resulting KG by leveraging symbolic verifications as described by the neurosymbolic AI framework, $\textit{SymbolicAI}$. This work contributes a hybrid-driven architecture for improving the reliability of automated KG construction and the exploration of evaluation methods for measuring the functional integrity of its output. The code is publicly available.

[359] RIS-aided Latent Space Alignment for Semantic Channel Equalization

Tomás Hüttebräucker, Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo

Main category: cs.LG

TL;DR: This paper proposes a joint physical and semantic channel equalization framework using Reconfigurable Intelligent Surfaces (RIS) to address semantic mismatches in MIMO semantic communication systems, where independently trained AI devices may have divergent latent representations that impede mutual understanding.

Details

Motivation: In semantic communication systems using Deep Neural Networks, multi-user settings with independently trained agents can lead to semantic mismatches due to divergent latent representations across AI-native devices, causing communication failures even without traditional transmission errors. This problem needs to be addressed to enable effective semantic communication in practical multi-user scenarios.

Method: The authors propose a joint physical and semantic channel equalization framework leveraging RIS in MIMO channels. The semantic equalization consists of three stages: (i) pre-equalization at transmitter, (ii) propagation through RIS-aided channel, and (iii) post-equalization at receiver. They formulate this as a constrained MMSE optimization problem and develop two solutions: a linear semantic equalization chain and a non-linear DNN-based semantic equalizer, both operating under semantic compression and power constraints.

Result: Extensive evaluations demonstrate that the proposed joint equalization strategies consistently outperform conventional disjoint approaches to physical and semantic channel equalization across various scenarios and wireless channel conditions, showing improved performance in addressing semantic mismatches while maintaining communication efficiency.

Conclusion: The joint physical and semantic channel equalization framework with RIS successfully addresses semantic mismatch problems in multi-user semantic communication systems, providing superior performance compared to traditional separated approaches and enabling more effective semantic communication in practical wireless environments.

Abstract: Semantic communication systems introduce a new paradigm in wireless communications, focusing on transmitting the intended meaning rather than ensuring strict bit-level accuracy. These systems often rely on Deep Neural Networks (DNNs) to learn and encode meaning directly from data, enabling more efficient communication. However, in multi-user settings where interacting agents are trained independently-without shared context or joint optimization-divergent latent representations across AI-native devices can lead to semantic mismatches, impeding mutual understanding even in the absence of traditional transmission errors. In this work, we address semantic mismatch in Multiple-Input Multiple-Output (MIMO) channels by proposing a joint physical and semantic channel equalization framework that leverages the presence of Reconfigurable Intelligent Surfaces (RIS). The semantic equalization is implemented as a sequence of transformations: (i) a pre-equalization stage at the transmitter; (ii) propagation through the RIS-aided channel; and (iii) a post-equalization stage at the receiver. We formulate the problem as a constrained Minimum Mean Squared Error (MMSE) optimization and propose two solutions: (i) a linear semantic equalization chain, and (ii) a non-linear DNN-based semantic equalizer. Both methods are designed to operate under semantic compression in the latent space and adhere to transmit power constraints. Through extensive evaluations, we show that the proposed joint equalization strategies consistently outperform conventional, disjoint approaches to physical and semantic channel equalization across a broad range of scenarios and wireless channel conditions.

cs.MA

[360] Budget Allocation Policies for Real-Time Multi-Agent Path Finding

Raz Beck, Roni Stern

Main category: cs.MA

TL;DR: This paper addresses Real-Time Multi-Agent Pathfinding (RT-MAPF) by exploring different policies for allocating limited planning budgets across agents, showing that distributing budgets among agents outperforms shared budget approaches in constrained scenarios.

Details

Motivation: Existing RT-MAPF solutions use windowed MAPF algorithms without explicitly considering planning budget constraints. Real-world scenarios require agents to commit to movements within fixed time limits, but current approaches don't effectively allocate computational resources when planning budgets are limited.

Method: The authors explore different policies for allocating planning budgets in windowed versions of standard MAPF algorithms (Prioritized Planning and MAPF-LNS2). They compare baseline shared budget approaches against policies that distribute the planning budget across individual agents.

Result: The baseline approach where all agents share a common planning budget pool performs poorly in over-constrained situations. Budget distribution policies that allocate planning time among individual agents can solve more problems and achieve smaller makespan compared to shared budget approaches.

Conclusion: Distributing planning budgets across agents is more effective than shared budget pooling for RT-MAPF in constrained environments. This approach enables better problem-solving capabilities and improved path quality (smaller makespan) when computational resources are limited.

Abstract: Multi-Agent Pathfinding (MAPF) is the problem of finding paths for a set of agents such that each agent reaches its desired destination while avoiding collisions with the other agents. Many MAPF solvers are designed to run offline, that is, first generate paths for all agents and then execute them. Real-Time MAPF (RT-MAPF) embodies a realistic MAPF setup in which one cannot wait until a complete path for each agent has been found before they start to move. Instead, planning and execution are interleaved, where the agents must commit to a fixed number of steps in a constant amount of computation time, referred to as the planning budget. Existing solutions to RT-MAPF iteratively call windowed versions of MAPF algorithms in every planning period, without explicitly considering the size of the planning budget. We address this gap and explore different policies for allocating the planning budget in windowed versions of standard MAPF algorithms, namely Prioritized Planning (PrP) and MAPF-LNS2. Our exploration shows that the baseline approach in which all agents draw from a shared planning budget pool is ineffective in over-constrained situations. Instead, policies that distribute the planning budget over the agents are able to solve more problems with a smaller makespan.

[361] Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems

Chengxuan Xia, Qianye Wu, Sixuan Tian, Yilun Hao

Main category: cs.MA

TL;DR: This paper presents a coordination framework for LLM agents that improves collaborative task completion through dynamic task routing, bidirectional feedback, and parallel agent evaluation, showing significant improvements over static multi-agent systems.

Details

Motivation: Existing multi-agent LLM frameworks rely on static workflows, fixed roles, and limited inter-agent communication, which reduces their effectiveness in open-ended, high-complexity domains. There is a need for more adaptive coordination mechanisms that can handle complex collaborative tasks more effectively.

Method: The framework introduces three core mechanisms: (1) dynamic task routing that allows agents to reallocate tasks based on confidence and workload, (2) bidirectional feedback enabling agents to exchange structured critiques for iterative improvement, and (3) parallel agent evaluation where multiple agents compete on high-ambiguity subtasks with evaluator-driven selection of the best result.

Result: The framework demonstrates substantial improvements in factual coverage, coherence, and efficiency compared to static and partially adaptive baselines when implemented in a modular architecture.

Conclusion: The study highlights the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems, showing that dynamic coordination mechanisms significantly enhance collaborative task completion performance.

Abstract: Large language model (LLM) agents have shown increasing promise for collaborative task completion. However, existing multi-agent frameworks often rely on static workflows, fixed roles, and limited inter-agent communication, reducing their effectiveness in open-ended, high-complexity domains. This paper proposes a coordination framework that enables adaptiveness through three core mechanisms: dynamic task routing, bidirectional feedback, and parallel agent evaluation. The framework allows agents to reallocate tasks based on confidence and workload, exchange structured critiques to iteratively improve outputs, and crucially compete on high-ambiguity subtasks with evaluator-driven selection of the most suitable result. We instantiate these principles in a modular architecture and demonstrate substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines. Our findings highlight the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems.

[362] Resilient Multi-Agent Negotiation for Medical Supply Chains:Integrating LLMs and Blockchain for Transparent Coordination

Mariam ALMutairi, Hyungmin Kim

Main category: cs.MA

TL;DR: This paper proposes a hybrid framework combining blockchain technology with LLM-powered multi-agent systems to improve medical supply chain resilience during health emergencies like COVID-19, enabling autonomous agents to negotiate resource allocation while ensuring transparency and accountability through smart contracts.

Details

Motivation: Global health emergencies like COVID-19 exposed critical weaknesses in traditional medical supply chains, including inefficiencies in resource allocation, lack of transparency, and poor adaptability to dynamic disruptions, creating an urgent need for more resilient and accountable supply chain systems.

Method: The framework integrates blockchain technology with a decentralized LLM-powered multi-agent negotiation system, where autonomous agents representing manufacturers, distributors, and healthcare institutions engage in structured negotiations facilitated by LLMs. The system features an off-chain agent layer for adaptive reasoning and a on-chain blockchain layer for immutable enforcement via smart contracts, connected through a formal cross-layer communication protocol.

Result: Simulation experiments emulating pandemic scenarios demonstrated improvements in negotiation efficiency, fairness of allocation, supply chain responsiveness, and auditability compared to traditional approaches, validating the system’s enhanced performance during crisis situations.

Conclusion: The research successfully demonstrates that synergizing blockchain trust guarantees with LLM-driven agent intelligence provides a robust and scalable solution for critical supply chain coordination under uncertainty, offering an innovative approach to address supply chain vulnerabilities during global health emergencies.

Abstract: Global health emergencies, such as the COVID-19 pandemic, have exposed critical weaknesses in traditional medical supply chains, including inefficiencies in resource allocation, lack of transparency, and poor adaptability to dynamic disruptions. This paper presents a novel hybrid framework that integrates blockchain technology with a decentralized, large language model (LLM) powered multi-agent negotiation system to enhance the resilience and accountability of medical supply chains during crises. In this system, autonomous agents-representing manufacturers, distributors, and healthcare institutions-engage in structured, context-aware negotiation and decision-making processes facilitated by LLMs, enabling rapid and ethical allocation of scarce medical resources. The off-chain agent layer supports adaptive reasoning and local decision-making, while the on-chain blockchain layer ensures immutable, transparent, and auditable enforcement of decisions via smart contracts. The framework also incorporates a formal cross-layer communication protocol to bridge decentralized negotiation with institutional enforcement. A simulation environment emulating pandemic scenarios evaluates the system’s performance, demonstrating improvements in negotiation efficiency, fairness of allocation, supply chain responsiveness, and auditability. This research contributes an innovative approach that synergizes blockchain trust guarantees with the adaptive intelligence of LLM-driven agents, providing a robust and scalable solution for critical supply chain coordination under uncertainty.

[363] Fair Compromises in Participatory Budgeting: a Multi-Agent Deep Reinforcement Learning Approach

Hugh Adams, Srijoni Majumdar, Evangelos Pournaras

Main category: cs.MA

TL;DR: This paper proposes a multi-agent deep reinforcement learning approach to support decision-making in participatory budgeting, using a branching neural network to help voters develop better voting strategies and achieve fairer compromises through smaller-cost projects.

Details

Motivation: Participatory budgeting faces the challenge of "choice overload" where voters struggle to make decisions among numerous projects. There's a need for decision support systems that can help voters make better choices while ensuring fair distribution of public funds and enabling policymakers to design more equitable elections.

Method: The paper develops a multi-agent deep reinforcement learning approach with a novel branching neural network architecture to overcome scalability challenges in decentralized environments. The method optimizes voter actions to increase representation of voter preferences in winning project sets, providing decision support for both voters and policymakers.

Result: Experimental evaluation using real-world participatory budgeting data shows that fair compromises can be achieved, with a identified pattern that fairness is more achievable through projects with smaller costs. The approach successfully helps voters increase the winning proportion of their preferred projects.

Conclusion: The multi-agent reinforcement learning approach with branching neural networks provides an effective and ethically aligned solution for participatory budgeting decision support. The key finding that smaller-cost projects enable fairer compromises offers practical insights for both voters and election designers to improve participatory budgeting outcomes.

Abstract: Participatory budgeting is a method of collectively understanding and addressing spending priorities where citizens vote on how a budget is spent, it is regularly run to improve the fairness of the distribution of public funds. Participatory budgeting requires voters to make decisions on projects which can lead to ``choice overload". A multi-agent reinforcement learning approach to decision support can make decision making easier for voters by identifying voting strategies that increase the winning proportion of their vote. This novel approach can also support policymakers by highlighting aspects of election design that enable fair compromise on projects. This paper presents a novel, ethically aligned approach to decision support using multi-agent deep reinforcement learning modelling. This paper introduces a novel use of a branching neural network architecture to overcome scalability challenges of multi-agent reinforcement learning in a decentralized way. Fair compromises are found through optimising voter actions towards greater representation of voter preferences in the winning set. Experimental evaluation with real-world participatory budgeting data reveals a pattern in fair compromise: that it is achievable through projects with smaller cost.

cs.MM

[364] A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task

Mashiro Toyooka, Kiyoharu Aizawa, Yoko Yamakata

Main category: cs.MM

TL;DR: This paper introduces state probing for cooking recipes to evaluate how well Large Language Models (LLMs) can track ingredient state changes during cooking procedures, using a new Japanese recipe dataset with detailed ingredient state annotations.

Details

Motivation: LLMs are trained on procedural texts but don't observe real-world phenomena, making it challenging for them to track intermediate ingredient states in cooking recipes where these states are often omitted from the text.

Method: The authors construct a Japanese recipe dataset with clear ingredient state change annotations from well-structured recipe texts, then design three novel tasks to evaluate LLMs’ ability to track ingredient state transitions and identify ingredients at intermediate cooking steps.

Result: Experiments with widely used LLMs (Llama3.1-70B and Qwen2.5-72B) demonstrate that learning ingredient state knowledge improves their understanding of cooking processes, achieving performance comparable to commercial LLMs.

Conclusion: State probing is an effective method for evaluating and improving LLMs’ understanding of cooking procedures, and incorporating ingredient state knowledge enhances model performance in recipe comprehension tasks.

Abstract: Large Language Models (LLMs) are trained on a vast amount of procedural texts, but they do not directly observe real-world phenomena. In the context of cooking recipes, this poses a challenge, as intermediate states of ingredients are often omitted, making it difficult for models to track ingredient states and understand recipes accurately. In this paper, we apply state probing, a method for evaluating a language model’s understanding of the world, to the domain of cooking. We propose a new task and dataset for evaluating how well LLMs can recognize intermediate ingredient states during cooking procedures. We first construct a new Japanese recipe dataset with clear and accurate annotations of ingredient state changes, collected from well-structured and controlled recipe texts. Using this dataset, we design three novel tasks to evaluate whether LLMs can track ingredient state transitions and identify ingredients present at intermediate steps. Our experiments with widely used LLMs, such as Llama3.1-70B and Qwen2.5-72B, show that learning ingredient state knowledge improves their understanding of cooking processes, achieving performance comparable to commercial LLMs.

[365] QuMAB: Query-based Multi-annotator Behavior Pattern Learning

Liyun Zhang, Zheng Lian, Hong Liu, Takanori Takebe, Yuta Nakashima

Main category: cs.MM

TL;DR: The paper proposes QuMATL, a paradigm shift from aggregating annotations to modeling individual annotator behavior patterns, treating disagreements as valuable information rather than noise to improve annotation efficiency and understanding.

Details

Motivation: Traditional multi-annotator learning aggregates diverse annotations to find a single ground truth, treating disagreements as noise. However, subjective tasks often lack absolute ground truth, and sparse annotation coverage makes aggregation statistically unreliable, creating fundamental challenges in the field.

Method: QuMATL (Query-based Multi-Annotator Behavior Pattern Learning) uses lightweight queries to model individual annotator behavior patterns while capturing inter-annotator correlations as implicit regularization. This prevents overfitting to sparse individual data while maintaining individualization and improving generalization, with visualization of annotator focus regions for explainable analysis.

Result: The approach can reconstruct unlabeled data to reduce annotation cost, enhance aggregation reliability, and explain annotator decision behavior. Two large-scale datasets are contributed: STREET (4,300 labels/annotator) and AMER (average 3,118 labels/annotator), with AMER being the first multimodal multi-annotator dataset.

Conclusion: The paradigm shift from sample-wise aggregation to annotator-wise behavior modeling successfully treats annotator disagreements as valuable information, leading to improved annotation efficiency, better understanding of annotator behavior, and more reliable multi-annotator learning systems.

Abstract: Multi-annotator learning traditionally aggregates diverse annotations to approximate a single ground truth, treating disagreements as noise. However, this paradigm faces fundamental challenges: subjective tasks often lack absolute ground truth, and sparse annotation coverage makes aggregation statistically unreliable. We introduce a paradigm shift from sample-wise aggregation to annotator-wise behavior modeling. By treating annotator disagreements as valuable information rather than noise, modeling annotator-specific behavior patterns can reconstruct unlabeled data to reduce annotation cost, enhance aggregation reliability, and explain annotator decision behavior. To this end, we propose QuMATL (Query-based Multi-Annotator Behavior Pattern Learning), which uses light-weight queries to model individual annotators while capturing inter-annotator correlations as implicit regularization, preventing overfitting to sparse individual data while maintaining individualization and improving generalization, with a visualization of annotator focus regions offering an explainable analysis of behavior understanding. We contribute two large-scale datasets with dense per-annotator labels: STREET (4,300 labels/annotator) and AMER (average 3,118 labels/annotator), the first multimodal multi-annotator dataset.

Arka Ujjal Dey, Muhammad Junaid Awan, Georgia Channing, Christian Schroeder de Witt, John Collomosse

Main category: cs.MM

TL;DR: CRAVE is a novel fact-checking framework that combines retrieval-augmented LLMs with clustering techniques to automatically verify social media claims by retrieving multimodal evidence, organizing it into coherent narratives, and providing explained verdicts.

Details

Motivation: Social media fact-checking faces challenges with diverse and contradictory sources of information. There is a need for automated systems that can handle multimodal evidence (text and images) from various sources and provide explainable fact-checking decisions to support human fact-checkers.

Method: CRAVE integrates retrieval-augmented Large Language Models with clustering techniques. The framework automatically retrieves multimodal evidence from diverse sources, clusters this evidence into coherent narratives, and uses an LLM-based judge to evaluate the evidence and deliver fact-checking verdicts with explanatory summaries. The system incorporates agent-based refinement to ensure consistency and diversity in evidence representation.

Result: Comprehensive experiments show that CRAVE demonstrates strong performance across multiple metrics: high retrieval precision in gathering relevant evidence, good clustering quality in organizing evidence into coherent groups, and accurate judgment in fact-checking decisions. The system successfully handles both text and image modalities.

Conclusion: CRAVE represents a robust decision-support tool for fact-checkers, effectively combining multimodal evidence retrieval, clustering, and LLM-based evaluation to provide accurate and explainable fact-checking verdicts for social media content.

Abstract: We propose CRAVE (Cluster-based Retrieval Augmented Verification with Explanation); a novel framework that integrates retrieval-augmented Large Language Models (LLMs) with clustering techniques to address fact-checking challenges on social media. CRAVE automatically retrieves multimodal evidence from diverse, often contradictory, sources. Evidence is clustered into coherent narratives, and evaluated via an LLM-based judge to deliver fact-checking verdicts explained by evidence summaries. By synthesizing evidence from both text and image modalities and incorporating agent-based refinement, CRAVE ensures consistency and diversity in evidence representation. Comprehensive experiments demonstrate CRAVE’s efficacy in retrieval precision, clustering quality, and judgment accuracy, showcasing its potential as a robust decision-support tool for fact-checkers.

eess.AS

[367] Does Language Matter for Early Detection of Parkinson’s Disease from Speech?

Peter Plantinga, Briac Cordelle, Dominique Louër, Mirco Ravanelli, Denise Klein

Main category: eess.AS

TL;DR: This study investigates the role of language in detecting Parkinson’s disease from speech samples by comparing different pretrained models and data types, finding that text-only models can match vocal-feature models and that multilingual approaches outperform monolingual ones.

Details

Motivation: There is considerable disagreement in the literature about how to best collect and analyze speech data for Parkinson's disease detection and monitoring, with traditional methods using sustained vowel phonation while recent research explores more cognitively demanding tasks.

Method: The researchers tested pretrained models with varying data types and pretraining objectives, comparing text-only models, vocal-feature models, multilingual vs monolingual Whisper models, and AudioSet pretrained models on both sustained vowel phonation (SVP) and spontaneous speech tasks.

Result: Key findings include: (1) text-only models achieve performance comparable to vocal-feature models, (2) multilingual Whisper outperforms self-supervised models while monolingual Whisper performs worse, and (3) AudioSet pretraining enhances performance on SVP tasks but not on spontaneous speech.

Conclusion: The findings highlight the critical role of language content (rather than just acoustic features) for early detection of Parkinson’s disease, suggesting that linguistic analysis may be as important as traditional vocal biomarkers in PD detection systems.

Abstract: Using speech samples as a biomarker is a promising avenue for detecting and monitoring the progression of Parkinson’s disease (PD), but there is considerable disagreement in the literature about how best to collect and analyze such data. Early research in detecting PD from speech used a sustained vowel phonation (SVP) task, while some recent research has explored recordings of more cognitively demanding tasks. To assess the role of language in PD detection, we tested pretrained models with varying data types and pretraining objectives and found that (1) text-only models match the performance of vocal-feature models, (2) multilingual Whisper outperforms self-supervised models whereas monolingual Whisper does worse, and (3) AudioSet pretraining improves performance on SVP but not spontaneous speech. These findings together highlight the critical role of language for the early detection of Parkinson’s disease.

[368] Towards Robust Speech Recognition for Jamaican Patois Music Transcription

Jordan Madden, Matthew Stone, Dimitri Johnson, Daniel Geddez

Main category: eess.AS

TL;DR: Researchers created a 40+ hour manually transcribed Jamaican Patois music dataset to improve speech recognition systems, which currently perform poorly on Patois audio, and used it to fine-tune ASR models and develop scaling laws for Whisper models.

Details

Motivation: Current speech recognition systems perform poorly on Jamaican Patois music, producing inaccurate captions that limit accessibility and hinder downstream applications for this widely spoken language.

Method: Data-centric approach involving curation of more than 40 hours of manually transcribed Jamaican Patois music dataset, followed by fine-tuning state-of-the-art automatic speech recognition (ASR) models using this dataset.

Result: Successfully fine-tuned ASR models on Jamaican Patois music and developed scaling laws for the performance of Whisper models on Jamaican Patois audio.

Conclusion: This work aims to have a positive impact on the accessibility of Jamaican Patois music and contribute to the future development of Jamaican Patois language modeling.

Abstract: Although Jamaican Patois is a widely spoken language, current speech recognition systems perform poorly on Patois music, producing inaccurate captions that limit accessibility and hinder downstream applications. In this work, we take a data-centric approach to this problem by curating more than 40 hours of manually transcribed Patois music. We use this dataset to fine-tune state-of-the-art automatic speech recognition (ASR) models, and use the results to develop scaling laws for the performance of Whisper models on Jamaican Patois audio. We hope that this work will have a positive impact on the accessibility of Jamaican Patois music and the future of Jamaican Patois language modeling.

[369] Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

Nima Yazdani, Ali Ansari, Aruj Mahajan, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi

Main category: eess.AS

TL;DR: This paper presents a large-scale empirical study comparing different combinations of speech-to-text, large language models, and text-to-speech components in voice-based conversational AI systems using data from over 300,000 AI-conducted job interviews.

Details

Motivation: Systematic evaluation of different component combinations in cascaded voice AI architectures (STT x LLM x TTS) in production settings remains understudied, despite their increasing use in conversational AI systems.

Method: The researchers developed an automated evaluation framework using LLM-as-a-Judge to assess conversational quality, technical accuracy, and skill assessment capabilities across four production configurations using real-world data from AI-conducted job interviews.

Result: Google STT paired with GPT-4.1 significantly outperformed other configurations in conversational and technical quality metrics. However, objective quality metrics showed weak correlation with user satisfaction scores, indicating that user experience depends on factors beyond technical performance.

Conclusion: The study provides practical guidance for selecting components in multimodal conversational AI systems and contributes a validated evaluation methodology for voice-based interactions, while revealing the complexity of user satisfaction in voice AI systems.

Abstract: Voice-based conversational AI systems increasingly rely on cascaded architectures combining speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. However, systematic evaluation of different component combinations in production settings remains understudied. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data from over 300,000 AI-conducted job interviews. We develop an automated evaluation framework using LLM-as-a-Judge to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of four production configurations reveals that Google STT paired with GPT-4.1 significantly outperforms alternatives in both conversational and technical quality metrics. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversational AI systems and contribute a validated evaluation methodology for voice-based interactions.

[370] From Black Box to Biomarker: Sparse Autoencoders for Interpreting Speech Models of Parkinson’s Disease

Peter Plantinga, Jen-Kai Chen, Roozbeh Sattari, Mirco Ravanelli, Denise Klein

Main category: eess.AS

TL;DR: This paper uses sparse autoencoders (SAEs) to make speech-based Parkinson’s disease detection models more interpretable by uncovering meaningful internal representations that correspond to known speech deficits in PD patients.

Details

Motivation: Deep learning systems for PD detection from speech can find subtle signals but are black-box models that hinder clinical adoption due to lack of interpretability. There's a need to understand what these models learn to make them clinically useful.

Method: The authors apply sparse autoencoders (SAEs) with a novel mask-based activation mechanism adapted for small biomedical datasets to extract interpretable dictionary representations from a speech-based PD detection system’s internal representations.

Result: The SAE dictionary entries showed strong associations with characteristic PD speech deficits including reduced spectral flux and increased spectral flatness in low-energy regions. Spectral flux was found to correlate with putamen volumetric measurements from MRI scans.

Conclusion: SAEs can successfully reveal clinically relevant and interpretable biomarkers from black-box speech-based PD detection models, demonstrating potential for disease monitoring and diagnosis while bridging the gap between model performance and clinical interpretability.

Abstract: Speech holds promise as a cost-effective and non-invasive biomarker for neurological conditions such as Parkinson’s disease (PD). While deep learning systems trained on raw audio can find subtle signals not available from hand-crafted features, their black-box nature hinders clinical adoption. To address this, we apply sparse autoencoders (SAEs) to uncover interpretable internal representations from a speech-based PD detection system. We introduce a novel mask-based activation for adapting SAEs to small biomedical datasets, creating sparse disentangled dictionary representations. These dictionary entries are found to have strong associations with characteristic articulatory deficits in PD speech, such as reduced spectral flux and increased spectral flatness in the low-energy regions highlighted by the model attention. We further show that the spectral flux is related to volumetric measurements of the putamen from MRI scans, demonstrating the potential of SAEs to reveal clinically relevant biomarkers for disease monitoring and diagnosis.

[371] Segmentation-free Goodness of Pronunciation

Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi

Main category: eess.AS

TL;DR: This paper proposes improved goodness of pronunciation (GOP) methods for mispronunciation detection that work with modern CTC-based acoustic models, achieving state-of-the-art results on phoneme-level pronunciation assessment without requiring pre-segmentation of speech.

Details

Motivation: Traditional GOP methods for mispronunciation detection require pre-segmentation of speech into phonetic units, which limits accuracy and prevents the use of modern CTC-based acoustic models. There is a need for alignment-free methods that can leverage these advanced models for better pronunciation assessment in computer-aided language learning systems.

Method: The authors propose two new GOP variants: (1) GOP-SA (self-alignment GOP) that enables CTC-trained ASR models for MDD, and (2) GOP-AF (alignment-free GOP) that considers all possible alignments of target phonemes. They provide theoretical foundations, numerical solutions, and proper normalization techniques to handle acoustic models with different temporal characteristics.

Result: Extensive experiments on CMU Kids and Speechocean762 datasets demonstrate the effectiveness of the proposed methods. The feature vectors derived from GOP-AF achieve state-of-the-art results on phoneme-level pronunciation assessment, outperforming recent studies on the Speechocean762 dataset.

Conclusion: The proposed GOP-SA and GOP-AF methods successfully enable the use of modern CTC-based acoustic models for mispronunciation detection without requiring speech pre-segmentation. The alignment-free approach (GOP-AF) particularly shows superior performance and represents a significant advancement in phoneme-level pronunciation assessment for language learning applications.

Abstract: Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

[372] Enhancing Lung Disease Diagnosis via Semi-Supervised Machine Learning

Xiaoran Xua, In-Ho Rab, Ravi Sankarc

Main category: eess.AS

TL;DR: This study applies semi-supervised learning methods (MixMatch, Co-Refinement, Co-Refurbishing) to MFCC+CNN models for lung sound detection, achieving 92.9% accuracy with 3.8% improvement over baseline, addressing the challenge of limited labeled data in lung disease diagnosis.

Details

Motivation: Traditional lung disease diagnostic methods (for lung cancer and COPD) are costly, time-consuming, and invasive. There's a need for non-invasive detection methods that can work effectively with limited manually annotated data due to individual differences and insufficient labeled datasets.

Method: The researchers used a combination of MFCC (Mel-Frequency Cepstral Coefficients) feature extraction with CNN (Convolutional Neural Network) as the baseline model, then enhanced it with semi-supervised learning modules including MixMatch, Co-Refinement, and Co-Refurbishing to reduce dependence on manual annotations.

Result: The enhanced MFCC+CNN model with semi-supervised learning modules achieved 92.9% accuracy, representing a 3.8% improvement over the baseline model without these modules.

Conclusion: Semi-supervised learning methods can significantly improve lung sound detection performance while reducing reliance on labeled data. This approach successfully addresses key challenges in lung disease detection including individual patient differences and limited annotated datasets, offering a promising direction for non-invasive lung disease diagnosis.

Abstract: Lung diseases, including lung cancer and COPD, are significant health concerns globally. Traditional diagnostic methods can be costly, time-consuming, and invasive. This study investigates the use of semi supervised learning methods for lung sound signal detection using a model combination of MFCC+CNN. By introducing semi supervised learning modules such as Mix Match, Co-Refinement, and Co Refurbishing, we aim to enhance the detection performance while reducing dependence on manual annotations. With the add-on semi-supervised modules, the accuracy rate of the MFCC+CNN model is 92.9%, an increase of 3.8% to the baseline model. The research contributes to the field of lung disease sound detection by addressing challenges such as individual differences, feature insufficient labeled data.

[373] Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages

Isha Pandey, Pranav Gaikwad, Amruta Parulekar, Ganesh Ramakrishnan

Main category: eess.AS

TL;DR: This paper investigates duration prediction strategies in speech generation for low-resource Indian languages, comparing different approaches using a Continuous Normalizing Flow model to understand trade-offs between intelligibility and speaker characteristics preservation.

Details

Motivation: High-quality speech generation for low-resource Indian languages faces challenges due to limited data and diverse linguistic structures. While some recent approaches omit explicit duration modeling, the authors want to explore its impact in linguistically rich and data-scarce Indian language contexts to understand its continued value.

Method: The authors train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation through comparative analysis on speech-infilling tasks.

Result: The comparative analysis reveals nuanced trade-offs: infilling-based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. The findings show language-specific preferences for different duration prediction strategies.

Conclusion: The study demonstrates the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings. The findings inform the design and selection of duration strategies tailored to specific languages and tasks, highlighting the importance of considering linguistic diversity in speech generation models.

Abstract: High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.

[374] Clustering-based hard negative sampling for supervised contrastive speaker verification

Piotr Masztalski, Michał Romaniuk, Jakub Żak, Mateusz Matuszewski, Konrad Kowalczyk

Main category: eess.AS

TL;DR: The paper proposes CHNS, a clustering-based hard negative sampling method for supervised contrastive speaker verification that clusters similar speaker embeddings and optimizes batch composition to improve performance by up to 18% relative EER on VoxCeleb dataset.

Details

Motivation: Traditional classification-based speaker verification approaches are being challenged by contrastive learning methods, but effective utilization of hard negative pairs (similar but different-class samples) remains a key challenge for improving verification model performance in contrastive learning frameworks.

Method: CHNS (Clustering-based Hard Negative Sampling) clusters embeddings of similar speakers and adjusts batch composition to achieve an optimal ratio of hard and easy negatives during contrastive loss calculation, specifically designed for supervised contrastive speaker representation learning.

Result: CHNS outperforms baseline supervised contrastive approaches (with and without loss-based hard negative sampling) and state-of-the-art classification-based methods by up to 18% relative improvement in EER and minDCF metrics on the VoxCeleb dataset using two lightweight model architectures.

Conclusion: The clustering-based hard negative sampling approach effectively improves speaker verification performance by strategically selecting challenging negative samples, demonstrating that proper negative sampling strategies can significantly enhance contrastive learning methods for speaker verification tasks.

Abstract: In speaker verification, contrastive learning is gaining popularity as an alternative to the traditionally used classification-based approaches. Contrastive methods can benefit from an effective use of hard negative pairs, which are different-class samples particularly challenging for a verification model due to their similarity. In this paper, we propose CHNS - a clustering-based hard negative sampling method, dedicated for supervised contrastive speaker representation learning. Our approach clusters embeddings of similar speakers, and adjusts batch composition to obtain an optimal ratio of hard and easy negatives during contrastive loss calculation. Experimental evaluation shows that CHNS outperforms a baseline supervised contrastive approach with and without loss-based hard negative sampling, as well as a state-of-the-art classification-based approach to speaker verification by as much as 18 % relative EER and minDCF on the VoxCeleb dataset using two lightweight model architectures.

[375] SLASH: Self-Supervised Speech Pitch Estimation Leveraging DSP-derived Absolute Pitch

Ryo Terashima, Yuma Shirahata, Masaya Kawamura

Main category: eess.AS

TL;DR: SLASH is a pitch estimation method that combines self-supervised learning (SSL) with digital signal processing (DSP) to improve speech pitch estimation by incorporating both relative and absolute pitch information, achieving better performance than baseline methods.

Details

Motivation: Conventional SSL-based pitch estimation methods primarily rely on relative pitch differences from pitch shifting, which limits their performance. There's a need to incorporate absolute pitch values and better handle aperiodic components in speech to enhance pitch estimation accuracy.

Method: The method introduces: 1) a prior pitch distribution derived from DSP, 2) optimization of absolute pitch through gradient descent with loss between target and differentiable DSP-derived spectrograms, 3) a novel spectrogram generation method that skips waveform generation for stability, and 4) differentiable DSP for predicting aperiodic speech components.

Result: SLASH outperformed both baseline DSP and SSL-based pitch estimation methods, demonstrating the effectiveness of integrating SSL and DSP approaches for speech pitch estimation.

Conclusion: The effective integration of self-supervised learning with digital signal processing techniques enables superior pitch estimation performance by combining the strengths of both relative pitch learning and absolute pitch constraints, while properly handling aperiodic speech components.

Abstract: We present SLASH, a pitch estimation method of speech signals based on self-supervised learning (SSL). To enhance the performance of conventional SSL-based approaches that primarily depend on the relative pitch difference derived from pitch shifting, our method incorporates absolute pitch values by

introducing a prior pitch distribution derived from digital signal processing (DSP), and 2) optimizing absolute pitch through gradient descent with a loss between the target and differentiable DSP-derived spectrograms. To stabilize the optimization, a novel spectrogram generation method is used that skips complicated waveform generation. In addition, the aperiodic components in speech are accurately predicted through differentiable DSP, enhancing the method’s applicability to speech signal processing. Experimental results showed that the proposed method outperformed both baseline DSP and SSL-based pitch estimation methods, attributed to the effective integration of SSL and DSP.

[376] ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao

Main category: eess.AS

TL;DR: This paper introduces the first multimodal immersive spatial drama generation system that creates continuous multi-speaker binaural speech with dramatic prosody for AR/VR applications, along with the first dataset MRSDrama and model ISDrama for this task.

Details

Motivation: Current drama generation lacks spatial audio capabilities and multimodal integration for immersive experiences in AR/VR applications. The high cost of data collection and the need to simultaneously model spatial information and dramatic prosody from multimodal inputs creates significant challenges that haven't been addressed before.

Method: The authors propose ISDrama model with two key components: 1) Multimodal Pose Encoder using contrastive learning that accounts for Doppler effects from moving speakers to extract unified pose information, and 2) Immersive Drama Transformer, a flow-based mamba-transformer with Drama-MOE for expert selection and enhanced prosody/pose control, plus context-consistent classifier-free guidance for coherent drama generation.

Result: ISDrama outperforms baseline models on both objective and subjective metrics for multimodal immersive spatial drama generation. The authors also constructed MRSDrama, the first multimodal recorded spatial drama dataset containing binaural drama audios, scripts, videos, geometric poses, and textual prompts.

Conclusion: This work successfully addresses the novel challenge of multimodal immersive spatial drama generation by creating the first dataset and model for this task, demonstrating superior performance and opening new possibilities for AR/VR applications with spatially-aware dramatic audio content.

Abstract: Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components:

Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics. The demos are available at https://aaronz345.github.io/ISDramaDemo. We provide the dataset and the evaluation code at https://huggingface.co/datasets/AaronZ345/MRSDrama and https://github.com/AaronZ345/ISDrama.

[377] Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data

Qibing Bai, Sho Inoue, Shuai Wang, Zhongjie Jiang, Yannan Wang, Haizhou Li

Main category: eess.AS

TL;DR: A novel accent normalization system that converts foreign-accented speech to native-like speech using self-supervised discrete tokens and non-parallel training data, while preserving speaker identity and timbre.

Details

Motivation: Current accent normalization methods face challenges in preserving speaker identity while effectively reducing accentedness. The need for a system that can work with non-parallel training data and maintain naturalness across multiple English accents drives this research.

Method: The proposed pipeline uses self-supervised discrete tokens extracted from source speech, converts them through a dedicated model, and synthesizes output using flow matching. The system operates on token-level representations rather than frame-to-frame processing and includes two duration preservation methods for specific applications like dubbing.

Result: The method demonstrates superior performance compared to frame-to-frame baselines across three key metrics: naturalness, accentedness reduction, and timbre preservation. Token-level phonetic analysis validates the effectiveness of the token-based approach across multiple English accents.

Conclusion: The token-based approach with flow matching synthesis successfully achieves accent normalization while preserving speaker identity. The system’s effectiveness is validated through both objective phonetic analysis and superior performance metrics, making it suitable for practical applications including dubbing scenarios.

Abstract: Accent normalization converts foreign-accented speech into native-like speech while preserving speaker identity. We propose a novel pipeline using self-supervised discrete tokens and non-parallel training data. The system extracts tokens from source speech, converts them through a dedicated model, and synthesizes the output using flow matching. Our method demonstrates superior performance over a frame-to-frame baseline in naturalness, accentedness reduction, and timbre preservation across multiple English accents. Through token-level phonetic analysis, we validate the effectiveness of our token-based approach. We also develop two duration preservation methods, suitable for applications such as dubbing.

[378] Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Yifan Yang, Shujie Liu, Jinyu Li, Yuxuan Hu, Haibin Wu, Hui Wang, Jianwei Yu, Lingwei Meng, Haiyang Sun, Yanqing Liu, Yan Lu, Kai Yu, Xie Chen

Main category: eess.AS

TL;DR: This paper introduces PALLE, a two-stage zero-shot text-to-speech system that combines pseudo-autoregressive (PAR) and non-autoregressive (NAR) modeling to achieve faster inference while maintaining high speech quality and controllability.

Details

Motivation: Existing zero-shot TTS systems face a trade-off: autoregressive models are slow and lack duration control, while non-autoregressive models lack temporal modeling and require complex designs. The paper aims to unify the benefits of both approaches.

Method: The authors propose a pseudo-autoregressive (PAR) codec language modeling approach that generates dynamic-length spans at fixed time steps. PALLE uses a two-stage system: (1) PAR progressively generates speech tokens along time dimension, predicting all positions in parallel but retaining only the left-most span, (2) NAR refinement iteratively improves low-confidence tokens using global contextual information.

Result: PALLE outperforms state-of-the-art systems (F5-TTS, E2-TTS, MaskGCT) on LibriSpeech test-clean set in speech quality, speaker similarity, and intelligibility metrics, while achieving up to 10x faster inference speed despite being trained on smaller LibriTTS dataset compared to competitors’ large-scale training data.

Conclusion: The PAR approach successfully unifies autoregressive and non-autoregressive modeling benefits, enabling PALLE to achieve superior performance in both quality and speed for zero-shot text-to-speech synthesis, demonstrating the effectiveness of combining explicit temporal modeling with parallel generation.

Abstract: Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information.Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://microsoft.com/research/project/vall-e-x/palle.

[379] Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

Yu Zhang, Baotong Tian, Zhiyao Duan

Main category: eess.AS

TL;DR: Conan is a real-time zero-shot voice conversion model that converts voice while preserving content and adapting to unseen speakers using three components: Stream Content Extractor, Adaptive Style Encoder, and Causal Shuffle Vocoder.

Details

Motivation: Current voice conversion models face challenges in real-time scenarios: they struggle to preserve semantic fidelity under real-time constraints, fail to deliver natural-sounding conversions, and cannot adapt effectively to unseen speaker characteristics for zero-shot applications.

Method: The paper proposes Conan, a chunkwise online zero-shot voice conversion system with three core components: 1) Stream Content Extractor using Emformer for low-latency streaming content encoding, 2) Adaptive Style Encoder for extracting fine-grained stylistic features from reference speech, and 3) Causal Shuffle Vocoder implementing fully causal HiFiGAN with pixel-shuffle mechanism.

Result: Experimental evaluations show that Conan outperforms baseline models in both subjective and objective metrics for zero-shot online voice conversion tasks.

Conclusion: Conan successfully addresses the key challenges in real-time voice conversion by preserving source content while matching reference voice timbre and styles, demonstrating superior performance compared to existing baseline models.

Abstract: Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the voice timbre and styles of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics. Audio samples can be found at https://aaronz345.github.io/ConanDemo.

eess.IV

[380] A Hybrid CNN-VSSM model for Multi-View, Multi-Task Mammography Analysis: Robust Diagnosis with Attention-Based Fusion

Yalda Zafari, Roaa Elalfy, Mohamed Mabrok, Somaya Al-Maadeed, Tamer Khattab, Essam A. Rashed

Main category: eess.IV

TL;DR: A novel multi-view, multitask deep learning framework combining CNN and Visual State Space Models (VSSMs) processes all four mammography views to jointly predict diagnostic labels and BI-RADS scores, achieving superior performance with AUC of 0.9967 for binary classification while demonstrating the potential and limitations of multitask learning in mammography analysis.

Details

Motivation: Existing AI approaches for mammography screening are limited by focusing on single view inputs or single-task outputs, which reduces their clinical utility. Early and accurate interpretation of screening mammograms remains challenging due to subtle imaging findings and diagnostic ambiguity, necessitating a more comprehensive approach that can handle multiple views and tasks simultaneously.

Method: The authors propose a hybrid CNN-VSSM backbone that combines convolutional encoders for local feature extraction with Visual State Space Models for capturing global contextual dependencies. The framework processes all four standard mammography views and includes a gated attention-based fusion module that dynamically weights information across views and handles missing data. The system jointly predicts diagnostic labels and BI-RADS scores for each breast in a multitask learning setting.

Result: The hybrid models consistently outperformed baseline CNN and VSSM architectures across all tasks. For binary BI-RADS 1 vs. 5 classification, the shared hybrid model achieved AUC of 0.9967 and F1 score of 0.9830. In ternary classification, it attained F1 score of 0.7790, while the five-class BI-RADS task reached F1 score of 0.4904. Performance decreased with increasing task complexity, highlighting both capabilities and limitations.

Conclusion: The proposed hybrid framework effectively combines the strengths of CNNs and VSSMs for mammography analysis, demonstrating superior performance over baseline models. The results underscore both the potential of multitask learning for improving diagnostic performance and its limitations in more complex classification tasks, providing a foundation for clinically meaningful mammography analysis while identifying areas for future improvement.

Abstract: Early and accurate interpretation of screening mammograms is essential for effective breast cancer detection, yet it remains a complex challenge due to subtle imaging findings and diagnostic ambiguity. Many existing AI approaches fall short by focusing on single view inputs or single-task outputs, limiting their clinical utility. To address these limitations, we propose a novel multi-view, multitask hybrid deep learning framework that processes all four standard mammography views and jointly predicts diagnostic labels and BI-RADS scores for each breast. Our architecture integrates a hybrid CNN VSSM backbone, combining convolutional encoders for rich local feature extraction with Visual State Space Models (VSSMs) to capture global contextual dependencies. To improve robustness and interpretability, we incorporate a gated attention-based fusion module that dynamically weights information across views, effectively handling cases with missing data. We conduct extensive experiments across diagnostic tasks of varying complexity, benchmarking our proposed hybrid models against baseline CNN architectures and VSSM models in both single task and multi task learning settings. Across all tasks, the hybrid models consistently outperform the baselines. In the binary BI-RADS 1 vs. 5 classification task, the shared hybrid model achieves an AUC of 0.9967 and an F1 score of 0.9830. For the more challenging ternary classification, it attains an F1 score of 0.7790, while in the five-class BI-RADS task, the best F1 score reaches 0.4904. These results highlight the effectiveness of the proposed hybrid framework and underscore both the potential and limitations of multitask learning for improving diagnostic performance and enabling clinically meaningful mammography analysis.

[381] Harmonization in Magnetic Resonance Imaging: A Survey of Acquisition, Image-level, and Feature-level Methods

Qinqin Yang, Firoozeh Shomal-Zadeh, Ali Gholipour

Main category: eess.IV

TL;DR: This review comprehensively examines medical image harmonization methods, particularly for MRI, to address batch effects and site-related variability that impair data comparability and model generalizability across different scanners and imaging protocols.

Details

Motivation: Medical imaging data collected across different scanners, protocols, or sites exhibit substantial heterogeneity ("batch effects" or "site effects") that obscure biological signals, reduce reproducibility and statistical power, and severely impair the generalizability of learning-based models across datasets.

Method: The paper systematically categorizes harmonization approaches into: (1) prospective acquisition and reconstruction strategies, (2) retrospective image-level and feature-level methods, and (3) traveling-subject-based techniques. The review focuses on representative methods with particular emphasis on deep learning-based approaches across the full imaging pipeline.

Result: The review provides a comprehensive overview of key concepts, methodological advances, publicly available datasets, and current challenges in medical image harmonization, with systematic coverage of the complete imaging pipeline and various harmonization strategies.

Conclusion: The paper identifies major remaining challenges in the field and outlines promising avenues for future research in medical image harmonization, emphasizing the need to eliminate site-related biases while preserving meaningful biological information to improve data comparability and consistency.

Abstract: Modern medical imaging technologies have greatly advanced neuroscience research and clinical diagnostics. However, imaging data collected across different scanners, acquisition protocols, or imaging sites often exhibit substantial heterogeneity, known as “batch effects” or “site effects”. These non-biological sources of variability can obscure true biological signals, reduce reproducibility and statistical power, and severely impair the generalizability of learning-based models across datasets. Image harmonization aims to eliminate or mitigate such site-related biases while preserving meaningful biological information, thereby improving data comparability and consistency. This review provides a comprehensive overview of key concepts, methodological advances, publicly available datasets, current challenges, and future directions in the field of medical image harmonization, with a focus on magnetic resonance imaging (MRI). We systematically cover the full imaging pipeline, and categorize harmonization approaches into prospective acquisition and reconstruction strategies, retrospective image-level and feature-level methods, and traveling-subject-based techniques. Rather than providing an exhaustive survey, we focus on representative methods, with particular emphasis on deep learning-based approaches. Finally, we summarize the major challenges that remain and outline promising avenues for future research.

[382] MyGO: Make your Goals Obvious, Avoiding Semantic Confusion in Prostate Cancer Lesion Region Segmentation

Zhengcheng Lin, Zuobin Ying, Zhenyu Li, Zhenyu Liu, Jian Lu, Weiping Ding

Main category: eess.IV

TL;DR: This paper proposes a Pixel Anchor Module with self-attention-based Top_k selection strategy to improve prostate cancer lesion segmentation by addressing semantic confusion between lesion and non-lesion areas, achieving state-of-the-art performance on the PI-CAI dataset.

Details

Motivation: Existing medical image segmentation methods struggle with accurate prostate cancer lesion detection due to high semantic homogeneity between lesion and non-lesion areas, leading to semantic confusion that hinders accurate diagnosis and treatment planning.

Method: The authors developed a novel Pixel Anchor Module that discovers sparse feature anchors to capture global contextual information, combined with a self-attention-based Top_k selection strategy for anchor refinement and focal loss function to address class imbalance issues.

Result: The method achieved state-of-the-art performance on the PI-CAI dataset with 69.73% IoU and 74.32% Dice scores, significantly improving prostate cancer lesion detection accuracy compared to existing approaches.

Conclusion: The proposed Pixel Anchor Module with Top_k selection strategy effectively enhances nonlinear representation capacity and segmentation accuracy for prostate cancer lesions, demonstrating superior performance in addressing semantic confusion challenges in medical image segmentation.

Abstract: Early diagnosis and accurate identification of lesion location and progression in prostate cancer (PCa) are critical for assisting clinicians in formulating effective treatment strategies. However, due to the high semantic homogeneity between lesion and non-lesion areas, existing medical image segmentation methods often struggle to accurately comprehend lesion semantics, resulting in the problem of semantic confusion. To address this challenge, we propose a novel Pixel Anchor Module, which guides the model to discover a sparse set of feature anchors that serve to capture and interpret global contextual information. This mechanism enhances the model’s nonlinear representation capacity and improves segmentation accuracy within lesion regions. Moreover, we design a self-attention-based Top_k selection strategy to further refine the identification of these feature anchors, and incorporate a focal loss function to mitigate class imbalance, thereby facilitating more precise semantic interpretation across diverse regions. Our method achieves state-of-the-art performance on the PI-CAI dataset, demonstrating 69.73% IoU and 74.32% Dice scores, and significantly improving prostate cancer lesion detection.

[383] A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model

Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Zhengrui Guo, Fengtao Zhou, Yingxue Xu, Xi Wang, Ronald Cheong Kin Chan, Li Liang, Hao Chen

Main category: eess.IV

TL;DR: SmartPath-R1 is a versatile multimodal large language model for computational pathology that can handle both region-of-interest (ROI) and whole-slide-image (WSI) level tasks with enhanced reasoning capabilities, using scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning to overcome the limitations of existing chain-of-thought dependent approaches.

Details

Motivation: Current multimodal large language models in pathology have constrained reasoning capabilities due to reliance on expensive chain-of-thought annotations and are limited to simple visual question answering at ROI level, failing to address the full spectrum of diagnostic needs including classification, detection, segmentation, and WSI analysis required in clinical practice.

Method: The authors developed SmartPath-R1 using a framework that combines scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning to leverage intrinsic MLLM knowledge without requiring chain-of-thought supervision. The model integrates multiscale and multitask analysis through a mixture-of-experts mechanism for dynamic processing across diverse pathology tasks.

Result: SmartPath-R1 was trained and evaluated on a large-scale dataset comprising 2.3M ROI samples and 188K WSI samples. Extensive experiments across 72 tasks validated the effectiveness and superiority of the proposed approach, demonstrating the model’s capability to simultaneously handle both ROI-level and WSI-level tasks with robust pathological reasoning.

Conclusion: This work represents a significant advancement toward developing versatile, reasoning-enhanced AI systems for precision pathology by successfully creating a multimodal large language model that can address the comprehensive diagnostic needs in computational pathology while overcoming the limitations of existing approaches.

Abstract: Multimodal large language models (MLLMs) have emerged as powerful tools for computational pathology, offering unprecedented opportunities to integrate pathological images with language context for comprehensive diagnostic analysis. These models hold particular promise for automating complex tasks that traditionally require expert interpretation of pathologists. However, current MLLM approaches in pathology demonstrate significantly constrained reasoning capabilities, primarily due to their reliance on expensive chain-of-thought annotations. Additionally, existing methods remain limited to simplex application of visual question answering (VQA) at region-of-interest (ROI) level, failing to address the full spectrum of diagnostic needs such as ROI classification, detection, segmentation, whole-slide-image (WSI) classification and VQA in clinical practice. In this study, we present SmartPath-R1, a versatile MLLM capable of simultaneously addressing both ROI-level and WSI-level tasks while demonstrating robust pathological reasoning capability. Our framework combines scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning, which circumvents the requirement for chain-of-thought supervision by leveraging the intrinsic knowledge within MLLM. Furthermore, SmartPath-R1 integrates multiscale and multitask analysis through a mixture-of-experts mechanism, enabling dynamic processing for diverse tasks. We curate a large-scale dataset comprising 2.3M ROI samples and 188K WSI samples for training and evaluation. Extensive experiments across 72 tasks validate the effectiveness and superiority of the proposed approach. This work represents a significant step toward developing versatile, reasoning-enhanced AI systems for precision pathology.

[384] Efficient and Robust Semantic Image Communication via Stable Cascade

Bilal Khalid, Pedro Freire, Sergei K. Turitsyn, Jaroslaw E. Prilepsky

Main category: eess.IV

TL;DR: A novel Semantic Image Communication framework inspired by Stable Cascade that uses extremely compact latent embeddings (0.29% of original size) to achieve superior reconstruction quality and 3-16x faster inference compared to existing diffusion-based approaches.

Details

Motivation: Existing Diffusion Model-based Semantic Image Communication systems suffer from slow inference speed and generation randomness, which significantly limit their reliability and practical deployment in real-world applications.

Method: The paper proposes a SIC framework inspired by Stable Cascade architecture that uses extremely compact latent image embeddings as conditioning for the diffusion process, dramatically reducing transmission overhead while maintaining reconstruction quality.

Result: The method compresses transmitted embeddings to just 0.29% of original image size, outperforms three benchmark approaches (GESCO, Img2Img-SC, JPEG2000+LDPC) in reconstruction quality under noisy channels, and achieves 3x faster reconstruction for 512x512 images and 16x faster for 1024x1024 images compared to Img2Img-SC.

Conclusion: The proposed framework successfully addresses the key limitations of diffusion-based SIC systems by achieving both superior compression efficiency and computational performance while maintaining high reconstruction quality under challenging channel conditions.

Abstract: Diffusion Model (DM) based Semantic Image Communication (SIC) systems face significant challenges, such as slow inference speed and generation randomness, that limit their reliability and practicality. To overcome these issues, we propose a novel SIC framework inspired by Stable Cascade, where extremely compact latent image embeddings are used as conditioning to the diffusion process. Our approach drastically reduces the data transmission overhead, compressing the transmitted embedding to just 0.29% of the original image size. It outperforms three benchmark approaches - the diffusion SIC model conditioned on segmentation maps (GESCO), the recent Stable Diffusion (SD)-based SIC framework (Img2Img-SC), and the conventional JPEG2000 + LDPC coding - by achieving superior reconstruction quality under noisy channel conditions, as validated across multiple metrics. Notably, it also delivers significant computational efficiency, enabling over 3x faster reconstruction for 512 x 512 images and more than 16x faster for 1024 x 1024 images as compared to the approach adopted in Img2Img-SC.

[385] Mammo-Mamba: A Hybrid State-Space and Transformer Architecture with Sequential Mixture of Experts for Multi-View Mammography

Farnoush Bayatmakou, Reza Taleei, Nicole Simone, Arash Mohammadi

Main category: eess.IV

TL;DR: Mammo-Mamba is a novel AI framework for breast cancer detection in mammograms that combines Selective State-Space Models with transformer attention and expert-driven features, achieving superior performance while being more computationally efficient than traditional Transformer-based models.

Details

Motivation: Breast cancer remains a leading cause of cancer-related mortality among women. While current state-of-the-art multi-view mammogram classification models use Transformer architectures, their computational complexity scales quadratically with image patches, creating a need for more efficient alternatives for accurate early detection.

Method: The authors propose Mammo-Mamba, which integrates Selective State-Space Models (SSMs), transformer-based attention, and expert-driven feature refinement. The framework extends MambaVision backbone by introducing Sequential Mixture of Experts (SeqMoE) mechanism through customized SecMamba blocks that enable content-adaptive feature refinement and dynamic expert gating in deeper network stages.

Result: Mammo-Mamba achieved superior classification performance across all key metrics on the CBIS-DDSM benchmark dataset while maintaining computational efficiency compared to traditional Transformer models.

Conclusion: The proposed Mammo-Mamba framework successfully addresses the computational limitations of Transformer-based mammogram classification models while achieving better performance, offering a more efficient solution for AI-powered breast cancer detection in multi-view mammograms.

Abstract: Breast cancer (BC) remains one of the leading causes of cancer-related mortality among women, despite recent advances in Computer-Aided Diagnosis (CAD) systems. Accurate and efficient interpretation of multi-view mammograms is essential for early detection, driving a surge of interest in Artificial Intelligence (AI)-powered CAD models. While state-of-the-art multi-view mammogram classification models are largely based on Transformer architectures, their computational complexity scales quadratically with the number of image patches, highlighting the need for more efficient alternatives. To address this challenge, we propose Mammo-Mamba, a novel framework that integrates Selective State-Space Models (SSMs), transformer-based attention, and expert-driven feature refinement into a unified architecture. Mammo-Mamba extends the MambaVision backbone by introducing the Sequential Mixture of Experts (SeqMoE) mechanism through its customized SecMamba block. The SecMamba is a modified MambaVision block that enhances representation learning in high-resolution mammographic images by enabling content-adaptive feature refinement. These blocks are integrated into the deeper stages of MambaVision, allowing the model to progressively adjust feature emphasis through dynamic expert gating, effectively mitigating the limitations of traditional Transformer models. Evaluated on the CBIS-DDSM benchmark dataset, Mammo-Mamba achieves superior classification performance across all key metrics while maintaining computational efficiency.

[386] MCM: Mamba-based Cardiac Motion Tracking using Sequential Images in MRI

Jiahui Yin, Xinxing Cheng, Jinming Duan, Yan Pang, Declan O’Regan, Hadrien Reynaud, Qingjie Meng

Main category: eess.IV

TL;DR: A novel Mamba-based cardiac motion tracking network (MCM) that uses sequential cardiac images to achieve smooth and temporally consistent myocardial motion tracking, outperforming existing methods that rely on single image pairs.

Details

Motivation: Existing cardiac motion tracking methods learn from single image pairs (reference + random target frame), which overlooks the continuous nature of cardiac motion and produces inconsistent, non-smooth motion estimations. There's a need for methods that incorporate temporal continuity for better cardiac function assessment.

Method: The authors propose MCM (Mamba-based Cardiac Motion tracking network) with: (1) bi-directional Mamba block with bi-directional scanning mechanism for plausible deformation field estimation, (2) motion decoder that integrates information from frames adjacent to target frame for temporal coherence, and (3) leveraging Mamba’s structured state-space formulation to learn continuous myocardial dynamics without increased computational complexity.

Result: The proposed method quantitatively and qualitatively outperforms both conventional and state-of-the-art learning-based cardiac motion tracking methods on two public datasets, achieving smooth and temporally consistent motion tracking.

Conclusion: The MCM network successfully addresses the limitations of existing methods by incorporating sequential cardiac images and bi-directional processing, demonstrating superior performance in cardiac motion tracking while maintaining computational efficiency through Mamba’s structured formulation.

Abstract: Myocardial motion tracking is important for assessing cardiac function and diagnosing cardiovascular diseases, for which cine cardiac magnetic resonance (CMR) has been established as the gold standard imaging modality. Many existing methods learn motion from single image pairs consisting of a reference frame and a randomly selected target frame from the cardiac cycle. However, these methods overlook the continuous nature of cardiac motion and often yield inconsistent and non-smooth motion estimations. In this work, we propose a novel Mamba-based cardiac motion tracking network (MCM) that explicitly incorporates target image sequence from the cardiac cycle to achieve smooth and temporally consistent motion tracking. By developing a bi-directional Mamba block equipped with a bi-directional scanning mechanism, our method facilitates the estimation of plausible deformation fields. With our proposed motion decoder that integrates motion information from frames adjacent to the target frame, our method further enhances temporal coherence. Moreover, by taking advantage of Mamba’s structured state-space formulation, the proposed method learns the continuous dynamics of the myocardium from sequential images without increasing computational complexity. We evaluate the proposed method on two public datasets. The experimental results demonstrate that the proposed method quantitatively and qualitatively outperforms both conventional and state-of-the-art learning-based cardiac motion tracking methods. The code is available at https://github.com/yjh-0104/MCM.

[387] Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography

Tianao Li, Manxiu Cui, Cheng Ma, Emma Alexander

Main category: eess.IV

TL;DR: This paper presents a self-supervised learning method for joint reconstruction of speed of sound (SOS) and high-quality images in photoacoustic computed tomography (PACT), achieving 35x faster processing than current state-of-the-art while removing SOS aberrations more accurately.

Details

Motivation: Conventional PACT images suffer from wavefront distortion due to heterogeneous speed of sound in tissue. While accounting for SOS effects can improve image quality, direct SOS measurement is burdensome and existing joint reconstruction methods are computationally expensive. Traditional supervised learning is inaccessible due to limited data availability in this domain.

Method: The authors develop an efficient self-supervised joint reconstruction method that parametrizes the SOS using either a pixel grid or neural field (NF) and updates it directly by backpropagating gradients through a differentiable imaging forward model. This approach solves the semi-blind inverse problem without requiring labeled training data.

Result: The proposed method removes SOS aberrations more accurately and achieves 35x faster processing compared to the current state-of-the-art. The effectiveness is demonstrated quantitatively in simulation studies and qualitatively on both experimentally-collected and in vivo data.

Conclusion: The self-supervised approach successfully addresses the computational bottleneck and accuracy limitations of existing joint reconstruction methods in PACT, providing an efficient solution for recovering both SOS and high-quality images simultaneously in ring array PACT systems.

Abstract: Photoacoustic computed tomography (PACT) is a non-invasive imaging modality, similar to ultrasound, with wide-ranging medical applications. Conventional PACT images are degraded by wavefront distortion caused by the heterogeneous speed of sound (SOS) in tissue. Accounting for these effects can improve image quality and provide medically useful information, but measuring the SOS directly is burdensome and the existing joint reconstruction method is computationally expensive. Traditional supervised learning techniques are currently inaccessible in this data-starved domain. In this work, we introduce an efficient, self-supervised joint reconstruction method that recovers SOS and high-quality images for ring array PACT systems. To solve this semi-blind inverse problem, we parametrize the SOS using either a pixel grid or a neural field (NF) and update it directly by backpropagating the gradients through a differentiable imaging forward model. Our method removes SOS aberrations more accurately and 35x faster than the current SOTA. We demonstrate the success of our method quantitatively in simulation and qualitatively on experimentally-collected and in vivo data. Our code and synthetic numerical phantoms are available on our project page: https://lukeli0425.github.io/Coord-SoS-PACT/.

[388] Vascular Segmentation of Functional Ultrasound Images using Deep Learning

Hana Sebia, Thomas Guyet, Mickaël Pereira, Marco Valdebenito, Hugues Berry, Benjamin Vidal

Main category: eess.IV

TL;DR: This paper presents the first deep learning-based segmentation tool for functional ultrasound (fUS) images that can differentiate vascular compartments (arterioles vs venules) and enable dynamic cerebral blood volume quantification, achieving 90% accuracy using UNet architectures trained on rat brain data.

Details

Motivation: Functional ultrasound (fUS) is a promising non-invasive imaging modality for measuring cerebral blood volume changes, but distinguishing arterioles from venules is challenging due to opposing blood flow directions within pixels. Current enhancement methods like ultrasound localization microscopy (ULM) are invasive and lack dynamic quantification capabilities, creating a need for automated segmentation tools.

Method: The authors developed a deep learning-based segmentation pipeline using various UNet architectures trained on fUS images of rat brains. They used ULM automatic annotation as ground truth and evaluated the models’ ability to differentiate signals from different vascular compartments using only 100 temporal frames from fUS stacks.

Result: The best performing model achieved 90% accuracy, 71% F1 score, and 0.59 IoU for vascular compartment segmentation. Models trained on resting-state data successfully generalized to visual stimulation scenarios. The pipeline demonstrated high linear correlation coefficients between predicted and actual compartment signals in both cortical and deeper brain regions.

Conclusion: This work provides a non-invasive, cost-effective alternative to ULM for fUS image analysis, enabling better interpretation of fUS data and improved understanding of vessel function. The robust segmentation performance and good generalization across different brain states demonstrate the clinical potential of this deep learning approach for functional brain imaging.

Abstract: Segmentation of medical images is a fundamental task with numerous applications. While MRI, CT, and PET modalities have significantly benefited from deep learning segmentation techniques, more recent modalities, like functional ultrasound (fUS), have seen limited progress. fUS is a non invasive imaging method that measures changes in cerebral blood volume (CBV) with high spatio-temporal resolution. However, distinguishing arterioles from venules in fUS is challenging due to opposing blood flow directions within the same pixel. Ultrasound localization microscopy (ULM) can enhance resolution by tracking microbubble contrast agents but is invasive, and lacks dynamic CBV quantification. In this paper, we introduce the first deep learning-based segmentation tool for fUS images, capable of differentiating signals from different vascular compartments, based on ULM automatic annotation and enabling dynamic CBV quantification. We evaluate various UNet architectures on fUS images of rat brains, achieving competitive segmentation performance, with 90% accuracy, a 71% F1 score, and an IoU of 0.59, using only 100 temporal frames from a fUS stack. These results are comparable to those from tubular structure segmentation in other imaging modalities. Additionally, models trained on resting-state data generalize well to images captured during visual stimulation, highlighting robustness. This work offers a non-invasive, cost-effective alternative to ULM, enhancing fUS data interpretation and improving understanding of vessel function. Our pipeline shows high linear correlation coefficients between signals from predicted and actual compartments in both cortical and deeper regions, showcasing its ability to accurately capture blood flow dynamics.

[389] LanPaint: Training-Free Diffusion Inpainting with Asymptotically Exact and Fast Conditional Sampling

Candi Zheng, Yuan Lan, Yang Wang

Main category: eess.IV

TL;DR: LanPaint is a training-free method for partial conditional sampling in diffusion models that uses Langevin dynamics to achieve fast, accurate inpainting without requiring gradients or backpropagation.

Details

Motivation: Existing diffusion models lack efficient training-free methods for partial conditional sampling like inpainting. Current approaches suffer from inaccurate distributional matching, require expensive gradient computations, and are incompatible with fast ODE-based samplers.

Method: LanPaint leverages carefully designed Langevin dynamics to enable asymptotically exact partial conditional sampling for ODE-based and rectified flow diffusion models, providing backpropagation-free Monte Carlo sampling.

Result: The method achieves superior performance with precise partial conditioning and visually coherent inpainting across diverse tasks, demonstrating both accuracy and efficiency compared to existing approaches.

Conclusion: LanPaint successfully addresses the limitations of existing partial conditional sampling methods by providing a training-free, gradient-free solution that maintains compatibility with fast ODE samplers while achieving accurate distributional matching in inpainting tasks.

Abstract: Diffusion models excel at joint pixel sampling for image generation but lack efficient training-free methods for partial conditional sampling (e.g., inpainting with known pixels). Prior work typically formulates this as an intractable inverse problem, relying on coarse variational approximations, heuristic losses requiring expensive backpropagation, or slow stochastic sampling. These limitations preclude: (1) accurate distributional matching in inpainting results, (2) efficient inference modes without gradient, (3) compatibility with fast ODE-based samplers. To address these limitations, we propose \textbf{LanPaint}: a training-free, asymptotically exact partial conditional sampling methods for ODE-based and rectified flow diffusion models. By leveraging carefully designed Langevin dynamics, LanPaint enables fast, backpropagation-free Monte Carlo sampling. Experiments demonstrate that our approach achieves superior performance with precise partial conditioning and visually coherent inpainting across diverse tasks.

[390] Beyond Single-Channel: Multichannel Signal Imaging for PPG-to-ECG Reconstruction with Vision Transformers

Xiaoyan Li, Shixin Xu, Faisal Habib, Arvind Gupta, Huaxiong Huang

Main category: eess.IV

TL;DR: This paper proposes a Vision Transformer-based method for reconstructing ECG signals from PPG using a four-channel signal image representation, achieving significant improvements over existing approaches with up to 29% reduction in PRD and 15% reduction in RMSE.

Details

Motivation: Current ECG reconstruction from PPG faces challenges in accurately capturing fine-grained waveform features. Existing single-channel PPG approaches with conventional methods have limitations in preserving temporal and physiological variations, leading to suboptimal reconstruction quality.

Method: The authors propose a Vision Transformer (ViT)-based approach using a four-channel signal image representation that includes: (1) original PPG, (2) first-order difference, (3) second-order difference, and (4) area under the curve. The ViT’s self-attention mechanism captures both inter-beat and intra-beat dependencies for robust ECG reconstruction.

Result: The method achieves up to 29% reduction in PRD (Percent Root-mean-square Difference) and 15% reduction in RMSE compared to existing 1D convolution-based approaches. The approach also shows improvements in other evaluation metrics and introduces new clinically relevant metrics including QRS area error, PR interval error, RT interval error, and RT amplitude difference error.

Conclusion: The integration of four-channel signal image representation with ViT’s self-attention mechanism enables more effective PPG feature extraction and improved modeling of beat-to-beat variations. This demonstrates PPG’s potential as a viable alternative for heart activity monitoring and opens new avenues for cyclic signal analysis and prediction.

Abstract: Reconstructing ECG from PPG is a promising yet challenging task. While recent advancements in generative models have significantly improved ECG reconstruction, accurately capturing fine-grained waveform features remains a key challenge. To address this, we propose a novel PPG-to-ECG reconstruction method that leverages a Vision Transformer (ViT) as the core network. Unlike conventional approaches that rely on single-channel PPG, our method employs a four-channel signal image representation, incorporating the original PPG, its first-order difference, second-order difference, and area under the curve. This multi-channel design enriches feature extraction by preserving both temporal and physiological variations within the PPG. By leveraging the self-attention mechanism in ViT, our approach effectively captures both inter-beat and intra-beat dependencies, leading to more robust and accurate ECG reconstruction. Experimental results demonstrate that our method consistently outperforms existing 1D convolution-based approaches, achieving up to 29% reduction in PRD and 15% reduction in RMSE. The proposed approach also produces improvements in other evaluation metrics, highlighting its robustness and effectiveness in reconstructing ECG signals. Furthermore, to ensure a clinically relevant evaluation, we introduce new performance metrics, including QRS area error, PR interval error, RT interval error, and RT amplitude difference error. Our findings suggest that integrating a four-channel signal image representation with the self-attention mechanism of ViT enables more effective extraction of informative PPG features and improved modeling of beat-to-beat variations for PPG-to-ECG mapping. Beyond demonstrating the potential of PPG as a viable alternative for heart activity monitoring, our approach opens new avenues for cyclic signal analysis and prediction.

[391] MRI-CORE: A Foundation Model for Magnetic Resonance Imaging

Haoyu Dong, Yuwen Chen, Hanxue Gu, Nicholas Konz, Yaqian Chen, Qihang Li, Maciej A. Mazurowski

Main category: eess.IV

TL;DR: Researchers developed MRI-CORE, a vision foundation model trained on 6+ million MRI slices from 110k+ volumes across 18 body locations, which significantly outperforms existing methods in data-restricted medical imaging tasks and is publicly available.

Details

Motivation: Training deep learning models for MRI analysis requires large amounts of labeled data, but obtaining precise medical annotations is expensive and data privacy concerns limit data sharing. This creates a bottleneck for developing automated diagnostic and prognostic tools using MRI and deep learning.

Method: The authors created MRI-CORE, a vision foundation model trained on a massive dataset of over 6 million MRI slices from more than 110,000 MRI volumes spanning 18 different body locations. They evaluated the model on various downstream tasks and analyzed which pre-training strategies are most effective for foundation models.

Result: MRI-CORE demonstrated notable improvements over state-of-the-art methods across 13 data-restricted segmentation tasks, as well as in image classification and zero-shot segmentation. The study also revealed insights about which foundation model strategies are most useful and established relationships between pre-training/downstream task data similarity and transfer learning performance.

Conclusion: MRI-CORE shows strong potential for enabling data-efficient development of AI models for medical imaging. The foundation model approach can significantly improve performance in data-restricted scenarios, making it easier to develop diagnostic tools even with limited labeled data. The model is made publicly available with a permissive license to benefit the research community.

Abstract: The widespread use of Magnetic Resonance Imaging (MRI) in combination with deep learning shows promise for many high-impact automated diagnostic and prognostic tools. However, training new models requires large amounts of labeled data, a challenge due to high cost of precise annotations and data privacy. To address this issue, we introduce the MRI-CORE, a vision foundation model trained using more than 6 million slices from over 110 thousand MRI volumes across 18 body locations. Our experiments show notable improvements in performance over state-of-the-art methods in 13 data-restricted segmentation tasks, as well as in image classification, and zero-shot segmentation, showing the strong potential of MRI-CORE to enable data-efficient development of artificial intelligence models. We also present data on which strategies yield most useful foundation models and a novel analysis relating similarity between pre-training and downstream task data with transfer learning performance. Our model is publicly available with a permissive license.

[392] Advanced U-Net Architectures with CNN Backbones for Automated Lung Cancer Detection and Segmentation in Chest CT Images

Alireza Golkarieh, Kiana Kiashemshaki, Sajjad Rezvani Boroujeni, Nasibeh Asadi Isakan

Main category: eess.IV

TL;DR: This study develops U-Net architectures with CNN backbones (ResNet50, VGG16, Xception) for automated lung cancer detection and segmentation in chest CT images, achieving up to 99.1% accuracy in classification and 0.9495 Dice coefficient in segmentation.

Details

Motivation: The critical need for accurate diagnostic tools in clinical settings for lung cancer detection and segmentation in chest CT images, addressing the challenges of early diagnosis and clinical decision-making.

Method: Used balanced dataset of 832 chest CT images (416 cancerous, 416 non-cancerous) preprocessed with CLAHE and resized to 128x128 pixels. Developed U-Net models with three CNN backbones (ResNet50, VGG16, Xception) for segmentation, followed by CNN-based classifiers and hybrid models combining CNN with traditional ML classifiers (SVM, Random Forest, Gradient Boosting). Evaluated using 5-fold cross-validation.

Result: U-Net with ResNet50 achieved best performance for cancerous lungs (Dice: 0.9495, Accuracy: 0.9735). U-Net with VGG16 performed best for non-cancerous segmentation (Dice: 0.9532, Accuracy: 0.9513). CNN with U-Net-Xception achieved 99.1% accuracy, 99.74% recall, and 99.42% F1-score for classification. Hybrid CNN-SVM-Xception model achieved 96.7% accuracy and 97.88% F1-score.

Conclusion: Combining U-Net with advanced CNN backbones provides a powerful framework for both segmentation and classification of lung cancer in CT scans, consistently outperforming existing methods and supporting early diagnosis and clinical decision-making.

Abstract: This study investigates the effectiveness of U-Net architectures integrated with various convolutional neural network (CNN) backbones for automated lung cancer detection and segmentation in chest CT images, addressing the critical need for accurate diagnostic tools in clinical settings. A balanced dataset of 832 chest CT images (416 cancerous and 416 non-cancerous) was preprocessed using Contrast Limited Adaptive Histogram Equalization (CLAHE) and resized to 128x128 pixels. U-Net models were developed with three CNN backbones: ResNet50, VGG16, and Xception, to segment lung regions. After segmentation, CNN-based classifiers and hybrid models combining CNN feature extraction with traditional machine learning classifiers (Support Vector Machine, Random Forest, and Gradient Boosting) were evaluated using 5-fold cross-validation. Metrics included accuracy, precision, recall, F1-score, Dice coefficient, and ROC-AUC. U-Net with ResNet50 achieved the best performance for cancerous lungs (Dice: 0.9495, Accuracy: 0.9735), while U-Net with VGG16 performed best for non-cancerous segmentation (Dice: 0.9532, Accuracy: 0.9513). For classification, the CNN model using U-Net with Xception achieved 99.1 percent accuracy, 99.74 percent recall, and 99.42 percent F1-score. The hybrid CNN-SVM-Xception model achieved 96.7 percent accuracy and 97.88 percent F1-score. Compared to prior methods, our framework consistently outperformed existing models. In conclusion, combining U-Net with advanced CNN backbones provides a powerful method for both segmentation and classification of lung cancer in CT scans, supporting early diagnosis and clinical decision-making.

[393] EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Contro

An Wang, Rulin Zhou, Mengya Xu, Yiru Ye, Longfei Gou, Yiting Chang, Hao Chen, Chwee Ming Lim, Jiankun Wang, Hongliang Ren

Main category: eess.IV

TL;DR: EndoControlMag is a training-free framework that magnifies subtle vascular motions in endoscopic surgery videos using Lagrangian-based tracking with mask-conditioned magnification, featuring periodic reference resetting and hierarchical tissue-aware processing to enhance surgical precision.

Details

Motivation: Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, but remains challenging due to the complex and dynamic nature of surgical scenes with occlusions, instrument disturbances, and tissue deformations.

Method: The framework uses two key modules: (1) Periodic Reference Resetting (PRR) that divides videos into overlapping clips with dynamically updated reference frames, and (2) Hierarchical Tissue-aware Magnification (HTM) with dual-mode mask dilation that tracks vessel cores and applies adaptive softening strategies (motion-based or distance-based exponential decay) to surrounding tissues.

Result: EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions, as demonstrated through quantitative metrics, visual assessments, and expert surgeon evaluations on the EndoVMM24 dataset spanning four surgery types.

Conclusion: EndoControlMag provides an effective solution for vascular motion magnification in endoscopic surgery, successfully handling diverse surgical scenarios including occlusions, instrument disturbances, view changes, and vessel deformations through its training-free, adaptive framework.

Abstract: Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at https://szupc.github.io/EndoControlMag/.

[394] SFNet: A Spatial-Frequency Domain Deep Learning Network for Efficient Alzheimer’s Disease Diagnosis

Xinyue Yang, Meiliang Liu, Yunfang Xu, Xiaoxiao Yang, Zhengye Si, Zijin Li, Zhiwen Zhao

Main category: eess.IV

TL;DR: This paper proposes SFNet, the first end-to-end deep learning framework that combines spatial and frequency domain information from 3D MRI to improve Alzheimer’s disease diagnosis, achieving 95.1% accuracy on the ADNI dataset.

Details

Motivation: Current AD diagnostic models typically extract features from only one domain (spatial or frequency), limiting their ability to fully capture complex neuroimaging characteristics. While some studies combine both domains, they are mostly limited to 2D MRI, leaving the potential of dual-domain analysis in 3D MRI unexplored.

Method: The authors developed SFNet (Spatio-Frequency Network), which integrates: (1) an enhanced dense convolutional network for local spatial feature extraction, (2) a global frequency module for frequency-domain representations, and (3) a novel multi-scale attention module to refine spatial feature extraction. This creates the first end-to-end framework leveraging both spatial and frequency information in 3D MRI.

Result: SFNet achieved 95.1% accuracy in classifying cognitively normal (CN) and Alzheimer’s disease (AD) cases on the ADNI dataset, outperforming existing baseline methods while reducing computational overhead.

Conclusion: SFNet successfully demonstrates that combining spatial and frequency domain information in 3D MRI analysis significantly improves AD diagnosis performance, establishing a new state-of-the-art approach for MRI-based Alzheimer’s disease detection with both high accuracy and computational efficiency.

Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that predominantly affects the elderly population and currently has no cure. Magnetic Resonance Imaging (MRI), as a non-invasive imaging technique, is essential for the early diagnosis of AD. MRI inherently contains both spatial and frequency information, as raw signals are acquired in the frequency domain and reconstructed into spatial images via the Fourier transform. However, most existing AD diagnostic models extract features from a single domain, limiting their capacity to fully capture the complex neuroimaging characteristics of the disease. While some studies have combined spatial and frequency information, they are mostly confined to 2D MRI, leaving the potential of dual-domain analysis in 3D MRI unexplored. To overcome this limitation, we propose Spatio-Frequency Network (SFNet), the first end-to-end deep learning framework that simultaneously leverages spatial and frequency domain information to enhance 3D MRI-based AD diagnosis. SFNet integrates an enhanced dense convolutional network to extract local spatial features and a global frequency module to capture global frequency-domain representations. Additionally, a novel multi-scale attention module is proposed to further refine spatial feature extraction. Experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate that SFNet outperforms existing baselines and reduces computational overhead in classifying cognitively normal (CN) and AD, achieving an accuracy of 95.1%.

Today’s Research Highlights

Table of Contents

cs.CL

[1] A Unifying Scheme for Extractive Content Selection Tasks

[2] AI-based Clinical Decision Support for Primary Care: A Real-World Study

[3] Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs

[4] Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning

[5] Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain

[6] Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks

[7] Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors

[8] Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

[9] Evolutionary Feature-wise Thresholding for Binary Representation of NLP Embeddings

[10] Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

[11] CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards

[12] SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

[13] FinGAIA: An End-to-End Benchmark for Evaluating AI Agents in Finance

[14] Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge

[15] Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

[16] The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models

[17] Adaptive Graph Pruning for Multi-Agent Communication

[18] CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings

[19] Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents

[20] Investigating Subjective Factors of Argument Strength: Storytelling, Emotions, and Hedging

[21] Each to Their Own: Exploring the Optimal Embedding in RAG

[22] MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

[23] Synthetic Voice Data for Automatic Speech Recognition in African Languages

[24] A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE)

[25] WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

[26] Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries

[27] Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

[28] TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa

[29] Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach

[30] From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

[31] AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer

[32] Megrez2 Technical Report

[33] Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

[34] Multi-Level Explanations for Generative Language Models

[35] Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline

[36] Is text normalization relevant for classifying medieval charters?

[37] SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script

[38] A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects

[39] Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

[40] Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models

[41] An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning

[42] AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation

[43] Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models

[44] ORANSight-2.0: Foundational LLMs for O-RAN

[45] Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies

[46] Resona: Improving Context Copying in Linear Recurrence Models with Retrieval

[47] Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

[48] MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models

[49] Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

[50] Large Language Models in Argument Mining: A Survey

[51] Modeling Public Perceptions of Science in Media

[52] GTA: Grouped-head latenT Attention

[53] A Diagrammatic Calculus for a Functional Model of Natural Language Semantics

[54] Cautious Next Token Prediction

[55] Fairness Evaluation of Large Language Models in Academic Library Reference Services

[56] A Mathematical Theory of Discursive Networks

[57] Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

[58] Tiny language models

[59] From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment

[60] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

[61] 3LM: Bridging Arabic, STEM, and Code through Benchmarking

[62] WAKENLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

[63] Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

[64] Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

[65] Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

cs.CV

[66] Post-Disaster Affected Area Segmentation with a Vision Transformer (ViT)-based EVAP Model using Sentinel-2 and Formosat-5 Imagery

[67] Toward a Real-Time Framework for Accurate Monocular 3D Human Pose Estimation with Geometric Priors

[68] Principled Multimodal Representation Learning

[69] Coarse-to-fine crack cue for robust crack detection

[70] HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

[71] CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis

[72] SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

[73] AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation

[74] Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the Wild

[75] Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

[76] NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References