Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 92]
- cs.CV [Total: 159]
- cs.AI [Total: 60]
- cs.SD [Total: 16]
- cs.LG [Total: 129]
- cs.MA [Total: 4]
- cs.MM [Total: 1]
- eess.AS [Total: 6]
- eess.IV [Total: 13]
cs.CL
[1] MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch
Nikolay Banar, Ehsan Lotfi, Jens Van Nooten, Cristina Arhiliuc, Marija Kliocaite, Walter Daelemans
Main category: cs.CL
TL;DR: New resources for Dutch language embeddings including MTEB-NL benchmark, training dataset with synthetic data, and E5-NL embedding models to address Dutch underrepresentation in multilingual resources.
Details
Motivation: Dutch language is underrepresented in multilingual embedding resources, typically comprising only a small fraction of published resources, creating a gap that needs to be addressed.Method: Created MTEB-NL benchmark with existing and new Dutch datasets, compiled training dataset from Dutch retrieval datasets with synthetic LLM-generated data, and developed compact E5-NL embedding models.
Result: Developed comprehensive evaluation benchmark, enhanced training dataset, and efficient embedding models that demonstrate strong performance across multiple tasks for Dutch language.
Conclusion: The released resources (benchmark, dataset, and models) address the Dutch representation gap and support further development of Dutch embeddings, available publicly through Hugging Face Hub and MTEB package.
Abstract: Recently, embedding resources, including models, benchmarks, and datasets, have been widely released to support a variety of languages. However, the Dutch language remains underrepresented, typically comprising only a small fraction of the published multilingual resources. To address this gap and encourage the further development of Dutch embeddings, we introduce new resources for their evaluation and generation. First, we introduce the Massive Text Embedding Benchmark for Dutch (MTEB-NL), which includes both existing Dutch datasets and newly created ones, covering a wide range of tasks. Second, we provide a training dataset compiled from available Dutch retrieval datasets, complemented with synthetic data generated by large language models to expand task coverage beyond retrieval. Finally, we release a series of E5-NL models compact yet efficient embedding models that demonstrate strong performance across multiple tasks. We make our resources publicly available through the Hugging Face Hub and the MTEB package.
[2] MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables
Matteo Marcuzzo, Alessandro Zangari, Andrea Albarelli, Jose Camacho-Collados, Mohammad Taher Pilehvar
Main category: cs.CL
TL;DR: MORABLES is a human-verified benchmark using fables and short stories to test LLMs’ moral reasoning through multiple-choice questions with adversarial variants, revealing that larger models outperform smaller ones but remain brittle and susceptible to manipulation.
Details
Motivation: As LLMs excel at standard reading comprehension, there's a need to evaluate their capacity for complex abstract reasoning and moral inference using literature-based benchmarks with rich narrative depth.Method: Created MORABLES benchmark from historical fables and short stories with human-verified multiple-choice questions targeting moral inference, including carefully crafted distractors and adversarial variants to test model robustness and identify vulnerabilities.
Result: Larger models outperform smaller ones but remain susceptible to adversarial manipulation, relying on superficial patterns rather than true moral reasoning. Best models refute their own answers in ~20% of cases. Reasoning-enhanced models fail to bridge the performance gap.
Conclusion: Scale, not reasoning ability, is the primary driver of performance in moral reasoning tasks. Current LLMs show significant brittleness and self-contradiction in moral inference, indicating they lack true comprehension of complex moral reasoning.
Abstract: As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.
[3] LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation
Anu Pradhan, Alexandra Ortan, Apurv Verma, Madhavan Seshadri
Main category: cs.CL
TL;DR: This paper proposes using LLMs as judges to evaluate Retrieval-Augmented Generation systems in legal contexts, identifying robust metrics and statistical methods for reliable evaluation.
Details
Motivation: Traditional evaluation metrics fall short in capturing nuanced quality dimensions in specialized domains like legal research, especially with the rise of Generative AI. There's a need for scalable, cost-effective evaluation methods that maintain precision for high-stakes legal applications.Method: The study investigates LLM-as-a-Judge approach through systematic experimentation, examining inter-rater reliability metrics and statistical comparison methods. It tests traditional metrics like Krippendorff’s alpha against alternatives like Gwet’s AC2 and rank correlation coefficients, and uses Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections for system comparisons.
Result: Traditional agreement metrics like Krippendorff’s alpha were found misleading in skewed AI evaluation distributions. Gwet’s AC2 and rank correlation coefficients emerged as more robust indicators for judge selection, while Wilcoxon Signed-Rank Test with corrections provided statistical rigor for reliable system comparisons.
Conclusion: The research demonstrates a path toward scalable, cost-effective evaluation that maintains legal application precision, transforming human-intensive evaluation bottlenecks into automated yet statistically principled frameworks using LLMs as reliable judges.
Abstract: The evaluation bottleneck in recommendation systems has become particularly acute with the rise of Generative AI, where traditional metrics fall short of capturing nuanced quality dimensions that matter in specialized domains like legal research. Can we trust Large Language Models to serve as reliable judges of their own kind? This paper investigates LLM-as-a-Judge as a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts, where the stakes of recommendation quality are exceptionally high. We tackle two fundamental questions that determine practical viability: which inter-rater reliability metrics best capture the alignment between LLM and human assessments, and how do we conduct statistically sound comparisons between competing systems? Through systematic experimentation, we discover that traditional agreement metrics like Krippendorff’s alpha can be misleading in the skewed distributions typical of AI system evaluations. Instead, Gwet’s AC2 and rank correlation coefficients emerge as more robust indicators for judge selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections provides the statistical rigor needed for reliable system comparisons. Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications, transforming what was once a human-intensive bottleneck into an automated, yet statistically principled, evaluation framework.
[4] SENTRA: Selected-Next-Token Transformer for LLM Text Detection
Mitchell Plyler, Yilun Zhang, Alexander Tuzhilin, Saoud Khalifah, Sen Tian
Main category: cs.CL
TL;DR: SENTRA is a novel Transformer-based detector that uses selected-next-token probabilities and contrastive pre-training to significantly outperform existing methods in detecting LLM-generated text across multiple domains.
Details
Motivation: As LLMs become more capable and widespread, the potential for misuse of undisclosed AI-generated text is growing, creating a need for effective detection methods.Method: SENTRA is a supervised Transformer-based encoder that leverages selected-next-token-probability sequences and utilizes contrastive pre-training on large amounts of unlabeled data.
Result: Experiments on three public datasets across 24 text domains show SENTRA significantly outperforms popular baselines, particularly in out-of-domain settings.
Conclusion: SENTRA demonstrates strong performance as a general-purpose classifier for detecting LLM-generated text that isn’t explicitly declared, making it effective for real-world applications.
Abstract: LLMs are becoming increasingly capable and widespread. Consequently, the potential and reality of their misuse is also growing. In this work, we address the problem of detecting LLM-generated text that is not explicitly declared as such. We present a novel, general-purpose, and supervised LLM text detector, SElected-Next-Token tRAnsformer (SENTRA). SENTRA is a Transformer-based encoder leveraging selected-next-token-probability sequences and utilizing contrastive pre-training on large amounts of unlabeled data. Our experiments on three popular public datasets across 24 domains of text demonstrate SENTRA is a general-purpose classifier that significantly outperforms popular baselines in the out-of-domain setting.
[5] MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering
Wen-wai Yim, Asma Ben Abacha, Zixuan Yu, Robert Doerning, Fei Xia, Meliha Yetisgen
Main category: cs.CL
TL;DR: MORQA is a new multilingual benchmark for evaluating NLG systems in medical QA, featuring expert-authored gold answers and ratings. LLM-based evaluators outperform traditional metrics in correlating with human judgments.
Details
Motivation: Traditional automatic evaluation metrics like BLEU and ROUGE are inadequate for medical NLG evaluation due to the need for accuracy, relevance, and domain expertise in open-ended medical QA tasks where multiple valid responses exist.Method: Created MORQA benchmark with 2-4+ gold-standard answers by medical professionals across three medical visual/text QA datasets in English and Chinese. Compared traditional metrics with LLM-based evaluators (GPT-4, Gemini) using expert human ratings.
Result: LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. LLMs show better sensitivity to semantic nuances and robustness to variability among reference answers.
Conclusion: This provides the first comprehensive multilingual study of NLG evaluation in medical domain, demonstrating the superiority of LLM-based evaluators and highlighting the need for human-aligned evaluation methods.
Abstract: Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses may exist. In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. We further analyze factors driving this improvement, including LLMs’ sensitivity to semantic nuances and robustness to variability among reference answers. Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. All datasets and annotations will be publicly released to support future research.
[6] MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai
Main category: cs.CL
TL;DR: MedFact is a new Chinese medical fact-checking benchmark with 2,116 expert-annotated instances across 13 specialties, 8 error types, and multiple difficulty levels. Evaluation of 20 LLMs shows they struggle with precise error localization and exhibit ‘over-criticism’ tendencies.
Details
Motivation: Existing benchmarks are limited by narrow data domains and fail to capture real-world medical complexity, necessitating a more rigorous evaluation of LLMs' factual reliability in healthcare applications.Method: Developed MedFact using a hybrid AI-human framework with iterative expert feedback to refine an AI-driven multi-criteria filtering process. Evaluated 20 leading LLMs on veracity classification and error localization against human expert baseline.
Result: LLMs can often detect if text contains errors but struggle with precise error localization, falling short of human performance. Models exhibit ‘over-criticism’ - misidentifying correct information as erroneous - which worsens with advanced reasoning techniques.
Conclusion: MedFact highlights critical challenges for LLM deployment in medical applications and provides a robust resource to develop more factually reliable and medically aware models.
Abstract: The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism’’ phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.
[7] Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
Fuyu Xing, Zimu Wang, Wei Wang, Haiyang Zhang
Main category: cs.CL
TL;DR: First systematic evaluation of Large Vision-Language Models (LVLMs) on Multimedia Event Extraction (M2E2) tasks, showing strong cross-modal performance but text task limitations.
Details
Motivation: Multimedia content proliferation requires effective M2E2 systems, but LVLMs' utility for this task remains underexplored despite their strong cross-modal capabilities.Method: Evaluated representative LVLMs (DeepSeek-VL2, Qwen-VL series) on M2E2 dataset across text-only, image-only, and cross-media subtasks under few-shot prompting and fine-tuning with LoRA.
Result: Few-shot LVLMs perform better on visual tasks but struggle with textual tasks; fine-tuning with LoRA substantially enhances performance; LVLMs show strong synergy in cross-modal settings achieving superior performance.
Conclusion: LVLMs demonstrate promising M2E2 capabilities but face persistent challenges in semantic precision, localization, and cross-modal grounding that need to be addressed for further advancement.
Abstract: The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.
[8] Topic Coverage-based Demonstration Retrieval for In-Context Learning
Wonbin Kweon, SeongKu Kang, Runchu Tian, Pengcheng Jiang, Jiawei Han, Hwanjo Yu
Main category: cs.CL
TL;DR: TopicK is a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to test inputs and model capabilities, outperforming similarity-based methods.
Details
Motivation: Prior methods for in-context learning demonstration selection rely on embedding similarity or generation probability, which often results in irrelevant or redundant examples that don't cover all necessary knowledge requirements.Method: TopicK estimates topics required by the input, assesses model’s knowledge on those topics, and iteratively selects demonstrations that introduce previously uncovered required topics where the model shows low topical knowledge.
Result: Extensive experiments across various datasets and both open- and closed-source LLMs validate TopicK’s effectiveness in improving in-context learning performance.
Conclusion: TopicK provides a more comprehensive approach to demonstration selection by focusing on topic coverage and model knowledge assessment, leading to better in-context learning outcomes compared to traditional similarity-based methods.
Abstract: The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input. To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples. In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. Specifically, TopicK estimates the topics required by the input and assesses the model’s knowledge on those topics. TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge. We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs. Our source code is available at https://github.com/WonbinKweon/TopicK_EMNLP2025.
[9] Does Language Model Understand Language?
Suvojit Acharjee, Utathya Aich, Asfak Ali
Main category: cs.CL
TL;DR: Evaluation of SOTA language models on fine-grained linguistic phenomena (tense, negation, voice, modality) using new LUCID dataset and RECISE guidelines, with Compound-Beta emerging as the most balanced performer.
Details
Motivation: LMs struggle with fine-grained linguistic phenomena critical for effective human communication, particularly important for educational technologies supporting UN SDG 4 where linguistic clarity is essential.Method: Introduced RECISE guidelines and LUCID dataset with carefully crafted sentence pairs in English and Bengali. Evaluated models (MISTRAL-SABA-24B, LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, Compound-Beta) using Pearson/Spearman correlation, MAE, and novel HCE accuracy metric that measures alignment with human rating variability.
Result: Compound-Beta performed best as the most balanced model, achieving highest Pearson correlation in English and robust performance on mixed-language data, showing strong alignment with human judgments in cross-lingual scenarios.
Conclusion: The study demonstrates the importance of evaluating LMs on fine-grained linguistic phenomena and introduces effective evaluation frameworks, with Compound-Beta showing superior performance in aligning with human linguistic interpretation across languages.
Abstract: Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena such as tense, negation, voice, and modality which are the elements central to effective human communication. In the context of the United Nations SDG 4, where linguistic clarity is critical, the deployment of LMs in educational technologies demands careful scrutiny. As LMs are increasingly powering applications like tutoring systems, automated grading, and translation, their alignment with human linguistic interpretation becomes essential for effective learning. In this study, we conduct a evaluation of SOTA language models across these challenging contexts in both English and Bengali. To ensure a structured assessment, we introduce a new Route for Evaluation of Cognitive Inference in Systematic Environments guidelines. Our proposed LUCID dataset, composed of carefully crafted sentence pairs in English and Bengali, specifically challenges these models on critical aspects of language comprehension, including negation, tense, voice variations. We assess the performance of SOTA models including MISTRAL-SABA-24B, LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, and Compound-Beta using standard metrics like Pearson correlation, Spearman correlation, and Mean Absolute Error, as well as novel, linguistically inspired metric the HCE accuracy. The HCE accuracy measures how often model predictions fall within one standard deviation of the mean human rating, thus capturing human like tolerance for variability in language interpretation. Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions. It records the highest Pearson correlation in English and demonstrates robust performance on mixed-language data, indicating a strong alignment with human judgments in cross lingual scenarios.
[10] Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction
Sumanta Bhattacharyya, Sara Riaz, Pedram Rooshenas
Main category: cs.CL
TL;DR: R2tA is a method that uses refined LLM reasoning traces as supervision to train task-specific reasoning models when human labels are scarce, achieving effective adaptation through two-stage alignment.
Details
Motivation: Training task-specific reasoning models is challenging when direct human supervision or high-quality labels are scarce, but LLMs can generate abundant intermediate reasoning traces that can be refined into effective supervision.Method: Reason-Refine-then-Align (R2tA) generates initial reasoning from a base model, refines the traces to fix hallucinations and inconsistencies, then performs two-stage alignment with SFT followed by DPO to calibrate intermediate reasoning with human preferences.
Result: Applied to evaluating extended entity relationship diagrams (EERDs), R2tA effectively handles structurally complex tasks where prompt-only methods fail, using a curated dataset of 600 EERD variants with induced mistakes across 11 categories.
Conclusion: R2tA provides a practical, cost-effective path for scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and other applications.
Abstract: Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model’s intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.
[11] FunAudio-ASR Technical Report
Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou
Main category: cs.CL
TL;DR: FunAudio-ASR is a large-scale LLM-based ASR system that combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance while addressing LLM hallucination issues and optimizing for practical deployment.
Details
Motivation: Address the problem of LLM hallucination in ASR systems that degrades user experience in real-world applications, and bridge the performance gap between open-source benchmarks and real industry evaluation sets.Method: Synergistic combination of massive data scaling, large model capacity, LLM integration, reinforcement learning, and production-oriented optimizations including streaming capability, noise robustness, code-switching, and hotword customization.
Result: Achieves SOTA performance on real application datasets, demonstrating effectiveness and robustness in practical settings while outperforming other LLM-based ASR systems on industry evaluation sets.
Conclusion: FunAudio-ASR successfully addresses LLM hallucination issues and delivers superior performance in real-world ASR applications through comprehensive optimizations for practical deployment requirements.
Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
[12] A comparison of pipelines for the translation of a low resource language based on transformers
Chiara Bonfanti, Michele Colombino, Giulia Coucourde, Faeze Memari, Stefano Pinardi, Rosa Meo
Main category: cs.CL
TL;DR: Comparison of three transformer-based pipelines for French-to-Bambara machine translation, with simple transformer achieving best results on low-resource language.
Details
Motivation: To develop effective machine translation pipelines for Bambara, a low-resource African language spoken by ~14 million people, by comparing different neural network approaches.Method: Three approaches: 1) Simple transformer trained from scratch, 2) Fine-tuning LLaMA3 instructor models (3B-8B), 3) Language distillation with student-teacher architecture using LaBSE embeddings and BERT extension.
Result: Simple transformer pipeline achieved best results: 10% BLEU/21% chrF on Bayelemagaba, 33.81% BLEU/41% chrF on Yiri dataset. Instructor models performed better on single datasets than aggregated collections.
Conclusion: Simpler transformer architecture outperformed more complex approaches for low-resource Bambara translation, consistent with findings in other low-resource language scenarios.
Abstract: This work compares three pipelines for training transformer-based neural networks to produce machine translators for Bambara, a Mand`e language spoken in Africa by about 14,188,850 people. The first pipeline trains a simple transformer to translate sentences from French into Bambara. The second fine-tunes LLaMA3 (3B-8B) instructor models using decoder-only architectures for French-to-Bambara translation. Models from the first two pipelines were trained with different hyperparameter combinations to improve BLEU and chrF scores, evaluated on both test sentences and official Bambara benchmarks. The third pipeline uses language distillation with a student-teacher dual neural network to integrate Bambara into a pre-trained LaBSE model, which provides language-agnostic embeddings. A BERT extension is then applied to LaBSE to generate translations. All pipelines were tested on Dokotoro (medical) and Bayelemagaba (mixed domains). Results show that the first pipeline, although simpler, achieves the best translation accuracy (10% BLEU, 21% chrF on Bayelemagaba), consistent with low-resource translation results. On the Yiri dataset, created for this work, it achieves 33.81% BLEU and 41% chrF. Instructor-based models perform better on single datasets than on aggregated collections, suggesting they capture dataset-specific patterns more effectively.
[13] PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition
Li Fu, Yu Xin, Sunlu Zeng, Lu Fan, Youzheng Wu, Xiaodong He
Main category: cs.CL
TL;DR: PAC framework improves ASR by better pronunciation modeling and homophone discrimination, reducing WER by 30-54% compared to LLM-based ASR models.
Details
Motivation: Address two key challenges in LLM-based ASR: effective pronunciation modeling and robust homophone discrimination for better raw/long-tail word recognition.Method: Two-stage learning: (1) pronunciation-guided context learning with interleaved grapheme-phoneme modeling and distractors, (2) pronunciation-discriminative reinforcement learning with perturbed label sampling.
Result: 30.2% and 53.8% relative WER reduction on English Librispeech and Mandarin AISHELL-1 datasets; 31.8% and 60.5% relative reduction in biased WER for long-tail words.
Conclusion: PAC framework effectively enhances pronunciation awareness and homophone discrimination in ASR systems, significantly improving performance on both general and long-tail word recognition.
Abstract: This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the model's ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.
[14] MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap
Main category: cs.CL
TL;DR: Zero-shot AAC system using pre-trained audio CLIP and LLM achieves 35% improvement in caption quality without extensive training.
Details
Motivation: Automated Audio Captioning faces data limitations compared to image captioning, requiring methods that don't need large training datasets.Method: Uses pre-trained audio CLIP model for auditory feature extraction and structured prompt generation, then guides LLM with MAGIC search for refined token selection aligned with audio content.
Result: 35% improvement in NLG mean score (4.7 to 7.3) using MAGIC search with WavCaps model. Performance depends on audio-text matching and keyword selection, with single keyword prompt working best.
Conclusion: Zero-shot approach with pre-trained models effectively addresses AAC data limitations, with keyword selection being crucial for optimal performance.
Abstract: Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.
[15] EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving
Mukai Li, Linfeng Song, Zhenwen Liang, Jiahao Xu, Shansan Gong, Qi Liu, Haitao Mi, Dong Yu
Main category: cs.CL
TL;DR: EconProver reduces computational cost by 88% while maintaining ATP performance through dynamic CoT switching and diverse parallel-scaled RL.
Details
Motivation: Current ATP methods using LLMs have high computational overhead from reflective CoT reasoning and multiple sampling passes, with existing cost analyses ignoring the significant cost disparities between different scaling strategies.Method: Proposes two complementary methods: (1) dynamic Chain-of-Thought switching to reduce unnecessary token consumption, and (2) diverse parallel-scaled reinforcement learning with trainable prefixes to improve pass rates with limited sampling passes. These are integrated into a unified EconRL pipeline.
Result: Experiments on miniF2F and ProofNet show EconProver achieves comparable performance to baseline methods using only 12% of the computational cost.
Conclusion: Provides practical solutions for deploying lightweight automated theorem proving models without performance sacrifice, demonstrating significant efficiency improvements over current SOTA approaches.
Abstract: Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.
[16] Positional Encoding via Token-Aware Phase Attention
Yu, Wang, Sheng Shen, Rémi Munos, Hongyuan Zhan, Yuandong Tian
Main category: cs.CL
TL;DR: TAPA is a new positional encoding method that addresses RoPE’s distance-dependent bias in attention scores, enabling better long-context modeling through learnable phase functions without extensive post-training adjustments.
Details
Motivation: RoPE has intrinsic limitations in modeling long-context due to distance-dependent bias in attention scores, and existing extension methods require problematic post-hoc adjustments like rescaling or hyperparameter retuning after pretraining.Method: Token-Aware Phase Attention (TAPA) incorporates a learnable phase function into the attention mechanism to preserve token interactions over long ranges and enable direct, lightweight fine-tuning for longer contexts.
Result: TAPA achieves significantly lower perplexity on long-context tasks compared to RoPE families, demonstrates effective extrapolation to unseen lengths, and maintains token interactions over extended ranges.
Conclusion: TAPA provides a superior alternative to RoPE for long-context modeling by eliminating distance-dependent bias through learnable phase functions while requiring minimal fine-tuning and enabling effective length extrapolation.
Abstract: We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE’s ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.
[17] UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech
Shuhei Kato
Main category: cs.CL
TL;DR: UtterTune is a lightweight adaptation method that fine-tunes multilingual TTS systems using LLM architecture to improve pronunciation controllability in target languages while maintaining performance across languages.
Details
Motivation: LLM-based TTS models achieve high naturalness but struggle with accurate grapheme-to-phoneme mapping and prosody control, especially when lacking explicit G2P modules and processing minimally encoded text directly.Method: Uses low-rank adaptation to enable control of segmental pronunciation and pitch accent at phoneme level for Japanese speech, maintaining naturalness and speaker similarity in zero-shot settings.
Result: Objective and subjective evaluations confirm the effectiveness of UtterTune in enhancing pronunciation controllability while preserving cross-language performance.
Conclusion: UtterTune provides an effective lightweight adaptation approach for improving pronunciation control in multilingual TTS systems without compromising performance in other languages.
Abstract: We propose UtterTune, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness.
[18] Don’t Change My View: Ideological Bias Auditing in Large Language Models
Paul Kröger, Emilio Barkett
Main category: cs.CL
TL;DR: A method for detecting ideological steering in large language models by analyzing distributional shifts in outputs across related prompts, enabling black-box auditing of proprietary systems.
Details
Motivation: As LLMs influence public opinion, there's a need to detect when they are intentionally steered toward specific ideological positions to prevent disproportionate control over public discourse.Method: Adapts a statistical method that analyzes distributional shifts in model outputs across thematically related prompts, requiring no access to model internals (model-agnostic approach).
Result: The approach was validated through experiments demonstrating practical applicability and potential for independent post hoc audits of LLM behavior.
Conclusion: This method provides a crucial first step for detecting ideological steering attempts in LLMs, particularly valuable for auditing proprietary black-box systems where internal access is limited.
Abstract: As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.
[19] Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations
Yougen Zhou, Qin Chen, Ningning Zhou, Jie Zhou, Xingjiao Wu, Liang He
Main category: cs.CL
TL;DR: This paper addresses preference bias in LLMs for emotional support conversations by identifying knowledge boundaries and using reinforcement learning with dual rewards to improve strategy planning accuracy and reduce bias.
Details
Motivation: LLMs face challenges in emotional support conversations due to low accuracy in strategy planning and significant preference bias towards specific strategies, with underlying causes not well studied.Method: First identify knowledge boundaries of LLMs in strategy planning, then propose reinforcement learning approach with dual reward function that optimizes both accuracy and entropy-based confidence for different knowledge regions.
Result: Experiments on ESCov and ExTES datasets with multiple LLM backbones show the approach outperforms baselines.
Conclusion: The proposed method effectively mitigates preference bias in LLMs for emotional support conversations by addressing knowledge boundaries through dual-reward reinforcement learning.
Abstract: Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. Moreover, there is a considerable preference bias towards specific strategies. Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied. To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning. Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries. Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach.
[20] Chat-Driven Text Generation and Interaction for Person Retrieval
Zequn Xie, Chuxin Wang, Sihang Cai, Yeqiang Wang, Shulei Wang, Tao Jin
Main category: cs.CL
TL;DR: A novel annotation-free framework for text-based person search using multi-turn dialogue with MLLMs to generate pseudo-labels and refine queries, eliminating manual text annotations while improving retrieval performance.
Details
Motivation: Text-based person search requires labor-intensive manual text annotations, which limits scalability and practical deployment. The paper aims to address this by creating an annotation-free system that can handle vague, incomplete, or ambiguous real-world search queries.Method: Two complementary modules: Multi-Turn Text Generation (MTG) uses MLLMs to generate rich pseudo-labels through simulated dialogues without manual supervision. Multi-Turn Text Interaction (MTI) refines user queries at inference time through dynamic, dialogue-based reasoning.
Result: The method achieves competitive or superior retrieval accuracy while eliminating the need for manual captions. Extensive evaluations demonstrate improved robustness and usability.
Conclusion: The unified annotation-free framework enables scalable and practical deployment of TBPS systems by addressing the annotation bottleneck through MLLM-powered dialogue generation and query refinement.
Abstract: Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.
[21] Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content
Shaz Furniturewala, Arkaitz Zubiaga
Main category: cs.CL
TL;DR: The paper identifies vulnerable components in toxicity classifiers using mechanistic interpretability, suppresses these circuits to improve adversarial robustness, and reveals demographic-specific vulnerabilities that inform more inclusive model development.
Details
Motivation: LLM-generated content creates challenges for content moderation as conventional classifiers trained on human text suffer from misclassifications due to distribution shifts and adversarial attacks. Current defenses are reactive rather than proactive.Method: Used mechanistic interpretability techniques on fine-tuned BERT and RoBERTa classifiers. Applied adversarial attacking to identify vulnerable circuits, then suppressed these circuits to improve performance. Tested on diverse datasets covering various minority groups.
Result: Models have distinct heads that are either crucial for performance or vulnerable to attack. Suppressing vulnerable heads improves performance on adversarial input. Different heads are responsible for vulnerability across different demographic groups.
Conclusion: The approach improves adversarial robustness and reveals fairness gaps, providing insights for more inclusive toxicity detection model development through demographic-level analysis of vulnerable circuits.
Abstract: The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.
[22] Case-Based Decision-Theoretic Decoding with Quality Memories
Hiroyuki Deguchi, Masaaki Nagata
Main category: cs.CL
TL;DR: CBDT decoding combines case-based reasoning with decision theory to improve text generation quality, especially for out-of-domain tasks, outperforming both MAP and MBR decoding methods.
Details
Motivation: MBR decoding depends on sample texts from the model, making it difficult to capture knowledge from out-of-domain data. The authors want to improve text generation quality for domain-specific tasks.Method: Proposed case-based decision-theoretic (CBDT) decoding, which estimates expected utility using domain-specific examples rather than relying solely on model-generated samples.
Result: CBDT decoding outperformed MAP decoding, and the combination of MBR+CBDT decoding beat MBR alone in 7 domain translation tasks (De-En, Ja-En) and image captioning tasks on MSCOCO and nocaps datasets.
Conclusion: CBDT decoding provides an effective approach for improving text generation quality, particularly for handling out-of-domain knowledge by leveraging domain-specific examples in the decision process.
Abstract: Minimum Bayes risk (MBR) decoding is a decision rule of text generation, which selects the hypothesis that maximizes the expected utility and robustly generates higher-quality texts than maximum a posteriori (MAP) decoding. However, it depends on sample texts drawn from the text generation model; thus, it is difficult to find a hypothesis that correctly captures the knowledge or information of out-of-domain. To tackle this issue, we propose case-based decision-theoretic (CBDT) decoding, another method to estimate the expected utility using examples of domain data. CBDT decoding not only generates higher-quality texts than MAP decoding, but also the combination of MBR and CBDT decoding outperformed MBR decoding in seven domain De–En and Ja$\leftrightarrow$En translation tasks and image captioning tasks on MSCOCO and nocaps datasets.
[23] HistoryBankQA: Multilingual Temporal Question Answering on Historical Events
Biswadip Mandal, Anant Khandelwal, Manish Gupta
Main category: cs.CL
TL;DR: HistoryBank is a multilingual database of 10M+ historical events from Wikipedia, with a comprehensive QA benchmark for evaluating temporal reasoning in 6 tasks across 10 languages, showing GPT4o performs best.
Details
Motivation: Existing temporal reasoning datasets are limited in scale, lack multilingual coverage, and focus more on contemporary events, creating a need for better benchmarking of LLMs' temporal reasoning capabilities.Method: Extracted 10M+ historical events from Wikipedia timeline pages and article infoboxes, created a multilingual database covering 10 languages, and constructed a comprehensive QA benchmark with 6 temporal reasoning tasks.
Result: GPT4o performed best across all answer types and languages, while Gemma-2 outperformed other small language models on the temporal reasoning benchmark tasks.
Conclusion: HistoryBank provides a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events, with code and datasets to be made publicly available.
Abstract: Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper.
[24] Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision
Omri Suissa, Muhiim Ali, Shengmai Chen, Yinuo Cai, Shekhar Pradhan
Main category: cs.CL
TL;DR: The paper investigates concept abstraction capacity in VLMs and introduces a novel contrastive loss technique with grouped image-caption data to improve higher-level concept recognition.
Details
Motivation: Humans can recognize images as instances of general concepts beyond object identification, but current VLMs lack this concept abstraction capability. The research aims to enhance VLM's ability to understand higher-level conceptual information in images.Method: Introduces MAGIC dataset with grouped image-captions and conceptual labels. Uses novel contrastive loss with outer (text-image contrastive groups) and inner (distances between group instances) losses to encode common information across group members without exposing model to higher-level concepts.
Result: The CLEAR GLASS model demonstrates improved abstract concept recognition compared to state-of-the-art models, showing emergent concept abstraction capacity through the training methodology.
Conclusion: The grouped contrastive loss technique successfully enables VLMs to develop concept abstraction capabilities by forcing semantic representations to align with higher-level concepts in latent space, without direct exposure to those concepts during training.
Abstract: Humans can recognize an image as an instance of a general concept, beyond simply identifying its objects and their relationships. In this paper, we investigate 1. The extent to which VLMs have this concept abstraction capacity, and 2. Strategies for encoding the sort of higher-concept information in images that would enable the resulting VLM model (CLEAR GLASS model) to have this capability to a greater degree. To this end, we introduce a grouped image-caption dataset (MAGIC), which consists of several groups of image captions and for each group a set of associated images and higher-level conceptual labels. We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group the information that is common to all members of the image-caption group. Our main contribution is a grouped contrastive loss function based on text-image contrastive groups (outer contrastive loss) as well as an inner loss which measures the distances between image-caption instances in the group. Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity because the model is not exposed to the higher-level concepts associated with each group. Instead, the training forces the model to create for each image-caption group a semantic representation that brings it closer to the semantic representation of the higher-level concepts in the latent semantic space. Our experiments show that this training methodology results in a model which shows improvement in abstract concept recognition compared to SOTA models.
[25] ConvergeWriter: Data-Driven Bottom-Up Article Construction
Binquan Ji, Jiaqi Wang, Ruiting Li, Xingchen Han, Yiyang Qi, Shichao Wang, Yifei Lu, Yuantao Han, Feiliang Ren
Main category: cs.CL
TL;DR: A bottom-up framework that first retrieves and clusters knowledge from external sources before generating documents, ensuring factual accuracy and structural coherence while reducing hallucinations.
Details
Motivation: Existing top-down methods for long-form document generation often create disconnects between generated content and available knowledge, leading to factual inaccuracies and fragmented content.Method: Retrieval-First for Knowledge, Clustering for Structure approach - performs exhaustive iterative retrieval from knowledge base, then uses unsupervised clustering to organize documents into knowledge clusters that guide hierarchical outline and content generation.
Result: Achieves performance comparable to or exceeding state-of-the-art baselines on both 14B and 32B parameter models, with unique advantages in knowledge-constrained scenarios requiring high fidelity.
Conclusion: Presents an effective paradigm for generating reliable, structured long-form documents, enabling more robust LLM applications in high-stakes, knowledge-intensive domains.
Abstract: Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing “top-down” methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model’s plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel “bottom-up,” data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a “Retrieval-First for Knowledge, Clustering for Structure” strategy, which first establishes the “knowledge boundaries” of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct “knowledge clusters.” These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.
[26] Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data
Kurt Micallef, Nizar Habash, Claudia Borg
Main category: cs.CL
TL;DR: Arabic-language resources can support Maltese NLP through cross-lingual augmentation techniques including transliteration and machine translation, with novel transliteration systems showing significant benefits.
Details
Motivation: Maltese is a unique Semitic language with Latin script orthography that creates a gap from its Arabic linguistic relatives, making it challenging to leverage existing Arabic NLP resources.Method: Investigated multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes, machine translation approaches, and introduced novel transliteration systems that better represent Maltese orthography.
Result: Arabic-based augmentation demonstrated significant benefits for Maltese NLP tasks when evaluated on both monolingual and multilingual models.
Conclusion: Cross-lingual augmentation techniques using Arabic resources are effective for improving Maltese natural language processing despite the orthographic differences.
Abstract: Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.
[27] The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, Jing Shao
Main category: cs.CL
TL;DR: A novel method for estimating question difficulty for LLMs using only hidden representations without generating output tokens, achieving better performance than existing approaches and enabling more efficient adaptive inference strategies.
Details
Motivation: Existing difficulty estimation methods for LLMs require repeated sampling, auxiliary models, or fine-tuning, which are computationally expensive and may compromise generality. There's a need for efficient and accurate difficulty estimation that doesn't require token generation.Method: Model the token-level generation process as a Markov chain and define a value function to estimate expected output quality from hidden states. Uses only the initial hidden state representations from the target LLM without generating any output tokens.
Result: Extensive experiments across textual and multimodal tasks show the method consistently outperforms existing baselines in difficulty estimation. When applied to adaptive reasoning strategies (Self-Consistency, Best-of-N, Self-Refine), it achieves higher inference efficiency with fewer generated tokens.
Conclusion: The proposed approach provides efficient and accurate difficulty estimation for LLMs using only hidden representations, enabling more effective adaptive inference strategies while reducing computational costs compared to existing methods.
Abstract: Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.
[28] Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings
Shiyu Li, Yang Tang, Ruijie Liu, Shi-Zhe Chen, Xi Chen
Main category: cs.CL
TL;DR: Conan-embedding-v2 is a new 1.4B-parameter LLM trained from scratch as a text embedder, achieving SOTA performance on MTEB benchmarks by addressing data and training gaps through cross-lingual retrieval data and soft-masking techniques.
Details
Motivation: Previous LLM fine-tuning approaches using LoRA are limited by data and training gaps between LLMs and embedding models, particularly in cross-lingual contexts and the mismatch between causal masking (token-level loss) and bidirectional masking (sentence-level loss).Method: 1) Added news data and multilingual pairs for pretraining to bridge data gap; 2) Created cross-lingual retrieval dataset; 3) Introduced soft-masking mechanism to transition between causal and bidirectional masks; 4) Proposed dynamic hard negative mining for more challenging training.
Result: Conan-embedding-v2 with only 1.4B parameters achieves state-of-the-art performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB as of May 19, 2025.
Conclusion: Training LLMs from scratch specifically for embedding tasks with carefully designed data strategies and training mechanisms can overcome limitations of fine-tuning approaches and deliver superior performance even with relatively small parameter counts.
Abstract: Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).
[29] All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning
Caiqi Zhang, Chang Shu, Ehsan Shareghi, Nigel Collier
Main category: cs.CL
TL;DR: Training-free graph-based confidence estimation methods for LLMs on reasoning tasks, using reasoning path graphs with centrality, convergence, and weighting properties.
Details
Motivation: Existing confidence estimation methods for LLMs work well for factual QA but fail to generalize to reasoning tasks, creating a reliability gap for reasoning deployment.Method: Model reasoning paths as directed graphs and estimate confidence using graph properties including centrality, path convergence, and path weighting - all training-free approaches.
Result: Experiments with two LLMs on three reasoning datasets show improved confidence estimation and better performance on two downstream tasks.
Conclusion: Graph-based methods provide effective training-free confidence estimation for reasoning tasks, addressing the generalization gap in existing approaches.
Abstract: Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks.
[30] Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework
Heng Zhang, Chengzhi Zhang
Main category: cs.CL
TL;DR: An end-to-end framework for generating comprehensive research workflows from academic papers using PU learning, Flan-T5, and ChatGPT to identify workflow paragraphs, extract phrases, categorize stages, and create visual flowcharts.
Details
Motivation: To improve research reproducibility and enable AI for Science by addressing the limitation of existing methods that only extract fragmented procedural components rather than complete research workflows.Method: Paragraph-centric approach using PU Learning with SciBERT to identify workflow paragraphs, Flan-T5 for phrase generation, ChatGPT for categorizing phrases into data preparation/processing/analysis stages, and mapping to create visual flowcharts.
Result: Achieved F1-score of 0.9772 for paragraph identification, ROUGE scores of 0.4543/0.2877/0.4427 for phrase generation, and 0.958 precision for stage classification. Successfully analyzed NLP workflows over 20 years, revealing methodological shifts.
Conclusion: Provides a validated technical framework for automated workflow generation and offers a process-oriented perspective for studying evolving scientific paradigms, with applications for research reproducibility and AI for Science.
Abstract: The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of “AI for Science”. However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.
[31] Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models
Yuval Weiss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez
Main category: cs.CL
TL;DR: ReLoRA performs worse than standard training for small language models (11M-66M parameters), with performance gaps widening for larger models, and reinforces rank deficiencies in smaller models.
Details
Motivation: To systematically study ReLoRA (a parameter-efficient method) for pretraining small language models, which offer lower computational and environmental costs compared to larger models.Method: Conducted ablation experiments on SLMs (11M-66M parameters) evaluating performance metrics (loss, Paloma perplexity, BLiMP) and analyzed learning dynamics to understand rank deficiencies.
Result: ReLoRA generally underperforms standard training across all metrics, with performance gaps increasing for larger models, and reinforces existing rank deficiencies in smaller models.
Conclusion: Low-rank update strategies like ReLoRA may not transfer well to SLM pretraining, indicating the need for more research in low-compute regimes for parameter-efficient methods.
Abstract: Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.
[32] Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews
Chenye Zou, Xingyue Wen, Tianyi Hu, Qian Janice Wang, Daniel Hershcovich
Main category: cs.CL
TL;DR: This paper introduces cross-cultural wine review adaptation between Chinese and English, going beyond literal translation to incorporate cultural nuances like taste preferences and flavor descriptors.
Details
Motivation: To address the limitations of current translation models in handling cultural content, particularly in wine reviews where cultural context and taste preferences vary significantly between Chinese and English-speaking cultures.Method: Compiled a parallel corpus of 8k Chinese and 16k English wine reviews, benchmarked neural machine translation and state-of-the-art LLMs using automatic metrics and human evaluation with three culture-oriented criteria: Cultural Proximity, Cultural Neutrality, and Cultural Genuineness.
Result: Current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures, revealing significant challenges in handling cultural content.
Conclusion: The study highlights the limitations of existing translation models in cultural adaptation and proposes culture-specific evaluation criteria, emphasizing the need for better cultural understanding in language models for cross-cultural tasks.
Abstract: Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the first parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria – Cultural Proximity, Cultural Neutrality, and Cultural Genuineness – to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content.
[33] SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data
Jian Gao, Fufangchen Zhao, Yiyang Zhang, Danfeng Yan
Main category: cs.CL
TL;DR: SitLLM is a lightweight multimodal framework that combines pressure sensing with LLMs for fine-grained sitting posture monitoring and personalized health feedback.
Details
Motivation: Existing sitting posture monitoring systems have coarse-grained recognition and lack semantic expressiveness needed for personalized feedback, leading to long-term musculoskeletal issues.Method: Three components: 1) Gaussian-Robust Sensor Embedding Module for robust feature extraction from pressure maps, 2) Prompt-Driven Cross-Modal Alignment Module to map sensor data to LLM semantic space, 3) Multi-Context Prompt Module that fuses multiple contextual information levels.
Result: The paper proposes a novel framework but does not provide specific experimental results in the abstract.
Conclusion: SitLLM enables fine-grained posture understanding and personalized health-oriented response generation by integrating flexible pressure sensing with large language models.
Abstract: Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions. Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback. In this paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation. SitLLM comprises three key components: (1) a \textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that reprograms sensor embeddings into the LLM’s semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textit{Multi-Context Prompt Module} that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension.
[34] Multi-Model Synthetic Training for Mission-Critical Small Language Models
Nolan Platt, Pragyansmita Nayak
Main category: cs.CL
TL;DR: This paper presents a cost-effective method using LLMs as teachers to generate synthetic maritime Q&A data, enabling fine-tuning of smaller models that achieve 75% accuracy at 261x lower cost than using larger models directly.
Details
Motivation: Large Language Models face constraints in specialized domains due to scarce and complex domain-specific training data, particularly in maritime intelligence where manual annotation is infeasible.Method: Transforms 3.2 billion AIS vessel tracking records into 21,543 synthetic Q&A pairs using multi-model generation (GPT-4o and o3-mini), then fine-tunes Qwen2.5-7B model on this synthetic dataset.
Result: Achieves 75% accuracy on maritime tasks with 261x cost reduction compared to using larger models for inference, demonstrating smaller fine-tuned models can match larger model performance.
Conclusion: Provides a reproducible framework for synthetic dataset generation in specialized domains, enabling cost-effective AI applications in maritime safety, security, and vessel traffic management.
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their application to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing overfitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models – when fine tuned properly – can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expanding research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.
[35] Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
Francesco Pappone, Ruggero Marino Lazzaroni, Federico Califano, Niccolò Gentile, Roberto Marras
Main category: cs.CL
TL;DR: Using a small encoder transformer as semantic reward model within GRPO framework improves LLM explanation quality for medical exams, outperforming standard SFT baselines.
Details
Motivation: LLMs struggle to align outputs with complex qualitative goals like pedagogical soundness. Standard RL techniques rely on slow LLM-as-judge evaluations or brittle keyword metrics that fail to capture semantic quality.Method: Introduce small efficient encoder-only transformer as semantic reward model within GRPO framework. Uses cosine similarity between generated explanations and ground-truth references to provide dense semantic rewards. Applied to Italian medical-school entrance exam training after domain-adaptive CPT and SFT.
Result: GRPO with semantic reward significantly improves explanation faithfulness and clarity over strong SFT baseline.
Conclusion: Lightweight encoder models are powerful for nuanced reward shaping in complex generation tasks, enabling semantically rich guidance beyond factual correctness.
Abstract: While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks
[36] Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning
Sijia Cui, Shuai Xu, Aiyao He, Yanna Wang, Bo Xu
Main category: cs.CL
TL;DR: PLAP framework combines LLM planning with parameterized skills to improve agent grounding in long-horizon environments, achieving state-of-the-art performance in MicroRTS game.
Details
Motivation: Existing LLM-based agents struggle with reliable low-level action generation and rely heavily on expert experience for high-level task translation in complex adversarial environments.Method: Plan with Language, Act with Parameter (PLAP) framework with three components: skill library of parameterized skills, LLM-powered skill planner, and skill executor that converts parameterized skills to executable actions.
Result: GPT-4o-driven PLAP outperformed 80% of baseline agents in zero-shot setting. Qwen2-72B-driven PLAP with few-shot examples surpassed top-tier scripted agent CoacAI. Comprehensive evaluation of 8 LLMs with released leaderboard.
Conclusion: PLAP effectively addresses LLM grounding challenges in long-horizon environments by combining language planning with parameterized action execution, demonstrating strong performance in complex real-time strategy games.
Abstract: Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at https://github.com/AI-Research-TeamX/PLAP.
[37] LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals
Jinxin Li, Gang Tu, ShengYu Cheng, Junjie Hu, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan
Main category: cs.CL
TL;DR: HSAD is a novel hallucination detection framework that analyzes temporal dynamics of hidden representations using frequency-domain analysis, achieving over 10% improvement over state-of-the-art methods.
Details
Motivation: Existing hallucination detection methods are limited - factuality checking is constrained by external knowledge coverage, and static hidden-state analysis fails to capture reasoning dynamics, limiting their effectiveness and robustness.Method: HSAD models temporal dynamics of hidden representations during autoregressive generation by constructing hidden-layer signals, applying FFT for frequency-domain analysis, extracting strongest non-DC frequency components as spectral features, and identifying optimal observation points.
Result: Across multiple benchmarks including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods.
Conclusion: HSAD establishes a new paradigm for robust hallucination detection in LLMs by integrating reasoning-process modeling with frequency-domain analysis.
Abstract: Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics. As a result, their effectiveness and robustness remain limited. We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features. Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods. By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs.
[38] The Few-shot Dilemma: Over-prompting Large Language Models
Yongjian Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler
Main category: cs.CL
TL;DR: Over-prompting degrades LLM performance when too many examples are used. Optimal few-shot example quantities vary by LLM, with TF-IDF selection achieving best results.
Details
Motivation: To investigate the over-prompting phenomenon where excessive examples in prompts diminish LLM performance, challenging conventional few-shot learning wisdom.Method: Used three few-shot selection methods (random sampling, semantic embedding, TF-IDF) across multiple LLMs. Gradually increased example counts with TF-IDF-selected stratified examples on software requirement datasets.
Result: Excessive domain-specific examples paradoxically degrade performance in some LLMs. Combined TF-IDF approach achieved superior performance with fewer examples, surpassing state-of-the-art by 1% in requirement classification.
Conclusion: Optimal few-shot example quantity varies per LLM. Careful selection and quantity control can avoid over-prompting and achieve better performance than simply adding more examples.
Abstract: Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.
[39] Evaluating LLM Alignment on Personality Inference from Real-World Interview Data
Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin
Main category: cs.CL
TL;DR: LLMs show limited ability (correlations <0.26) to accurately infer continuous Big Five personality traits from natural conversations, despite various prompting and fine-tuning approaches.
Details
Motivation: As LLMs are increasingly used in psychological roles like emotional support and counseling, their ability to understand human personality traits needs evaluation, particularly in realistic conversational settings rather than discrete label simulations.Method: Created benchmark with semi-structured interview transcripts and validated Big Five scores. Tested three approaches: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA fine-tuning of RoBERTa and Meta-LLaMA, (3) regression using static embeddings from BERT and OpenAI’s text-embedding-3-small.
Result: All Pearson correlations between model predictions and ground-truth personality traits remained below 0.26, indicating poor alignment. Chain-of-thought provided minimal improvement over zero-shot, suggesting personality inference relies more on latent semantic representation than explicit reasoning.
Conclusion: Current LLMs have limited alignment with validated psychological constructs, highlighting challenges in modeling complex human attributes. Future work should focus on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.
Abstract: Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM “personas” using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI’s text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.
[40] ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement
Ali Salamatian, Amirhossein Abaskohi, Wan-Cyuan Fan, Mir Rayat Imtiaz Hossain, Leonid Sigal, Giuseppe Carenini
Main category: cs.CL
TL;DR: ChartGaze is an eye-tracking dataset that captures human gaze patterns during chart reasoning tasks, revealing that LVLMs often diverge from human attention. The paper proposes gaze-guided attention refinement to align model attention with human fixations, improving accuracy and interpretability.
Details
Motivation: Charts are important for information communication, but LVLMs struggle with chart question answering, particularly by attending to irrelevant chart regions. Human gaze patterns could help improve model attention and reasoning.Method: Created ChartGaze eye-tracking dataset, compared human vs model attention patterns, and proposed gaze-guided attention refinement to align image-text attention with human fixations.
Result: The approach improved answer accuracy by up to 2.56 percentage points across multiple models and enhanced attention alignment with human gaze patterns.
Conclusion: Incorporating human gaze data enhances both reasoning quality and interpretability of chart-focused LVLMs, demonstrating promise for gaze-guided attention refinement techniques.
Abstract: Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.
[41] WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents
Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Main category: cs.CL
TL;DR: WebResearcher is a novel framework for AI research agents that uses iterative deep-research paradigm and scalable data synthesis to overcome context limitations and achieve state-of-the-art performance across multiple benchmarks.
Details
Motivation: Existing mono-contextual AI research approaches suffer from context suffocation and noise contamination, limiting their ability to autonomously discover and synthesize knowledge from external sources effectively.Method: Two key components: (1) WebResearcher - iterative deep-research paradigm formulated as Markov Decision Process with periodic consolidation into evolving reports and focused workspaces; (2) WebFrontier - scalable data synthesis engine using tool-augmented complexity escalation to generate high-quality training data.
Result: Achieves state-of-the-art performance across 6 challenging benchmarks, surpassing frontier proprietary systems. Training data from this paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods.
Conclusion: The framework successfully overcomes limitations of existing approaches, enables concurrent multi-agent exploration through parallel thinking, and demonstrates superior performance in autonomous knowledge discovery and synthesis tasks.
Abstract: Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.
[42] Scaling Agents via Continual Pre-training
Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Main category: cs.CL
TL;DR: Agentic Continual Pre-training (Agentic CPT) is proposed to build powerful agentic foundation models, addressing optimization tensions in post-training approaches. AgentFounder-30B achieves SOTA performance on multiple benchmarks.
Details
Motivation: Current post-training approaches underperform in agentic tasks due to the absence of robust agentic foundation models, forcing models to simultaneously learn diverse behaviors and align with expert demonstrations, creating optimization tensions.Method: Incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models.
Result: AgentFounder-30B achieves state-of-the-art performance on 10 benchmarks: 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE, while retaining strong tool-use ability.
Conclusion: Agentic CPT effectively addresses the limitations of post-training approaches and enables the development of powerful agentic foundation models that excel in complex problem-solving tasks.
Abstract: Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.
[43] Towards General Agentic Intelligence via Environment Scaling
Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, Shibin Wu, Zhengwei Tao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Main category: cs.CL
TL;DR: AgentScaler framework scales up simulated environments to train LLM agents for better function-calling capabilities through automated environment generation and two-phase fine-tuning.
Details
Motivation: Real-world applications require LLMs to have robust function-calling intelligence, which depends on training in diverse environments. Current limitations stem from insufficient environmental diversity for agent training.Method: Automated framework for constructing heterogeneous simulated environments, plus two-phase agent fine-tuning: first building fundamental capabilities, then domain-specific specialization.
Result: AgentScaler model significantly enhances function-calling capabilities, demonstrated through extensive experiments on tau-bench, tau2-Bench, and ACEBench benchmarks.
Conclusion: Scaling environments systematically and using phased training approach effectively advances general agentic intelligence and function-calling competence in LLMs.
Abstract: Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.
[44] WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou
Main category: cs.CL
TL;DR: WebWeaver introduces a dual-agent framework that iteratively plans and writes research reports, addressing limitations of static pipelines and one-shot generation by emulating human research processes with dynamic evidence acquisition and hierarchical writing.
Details
Motivation: Current AI approaches for open-ended deep research suffer from static pipelines that separate planning from evidence gathering, and one-shot generation that leads to long-context failure issues like information loss and hallucinations.Method: A dual-agent framework with a planner that iteratively interleaves evidence acquisition with outline optimization, and a writer that performs hierarchical retrieval and section-by-section composition using targeted evidence from a memory bank.
Result: Achieves state-of-the-art performance across major OEDR benchmarks including DeepResearch Bench, DeepConsult, and DeepResearchGym.
Conclusion: The iterative, human-centric methodology demonstrates that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured research reports.
Abstract: This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like “loss in the middle” and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.
[45] ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou
Main category: cs.CL
TL;DR: ReSum enables indefinite web exploration through periodic context summarization, overcoming LLM context window limitations in complex search tasks.
Details
Motivation: LLM-based web agents face context window constraints that limit their ability to handle complex queries requiring extensive search cycles and multiple entities.Method: Introduces ReSum paradigm with periodic context summarization and ReSum-GRPO training method integrating GRPO with segmented trajectory training and advantage broadcasting.
Result: 4.5% average improvement over ReAct, up to 8.2% gains after training. WebResummer-30B achieves 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en.
Conclusion: ReSum effectively addresses context limitations in web agents, enabling indefinite exploration and superior performance with minimal training data.
Abstract: Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5% over ReAct, with further gains of up to 8.2% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en, surpassing existing open-source web agents.
[46] Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C. Wallace
Main category: cs.CL
TL;DR: Activation verbalization methods using verbalizer LLMs may not provide meaningful insights into target model internals, as they often reflect the verbalizer’s parametric knowledge rather than decoding target model activations.
Details
Motivation: To critically evaluate whether activation verbalization approaches actually provide privileged knowledge about LLM internal workings or merely convey information about inputs.Method: Evaluated popular verbalization methods across datasets from prior work, conducted controlled experiments to assess whether verbalizations reflect target model activations or verbalizer LLM knowledge.
Result: Verbalization methods succeed at benchmarks without access to target model internals, and verbalizations often reflect the parametric knowledge of the verbalizer LLM rather than the target model’s activations.
Conclusion: There is a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into LLM operations.
Abstract: Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.
[47] Do predictability factors towards signing avatars hold across cultures?
Abdelhadi Soudi, Manal El Hakkaoui, Kristof Van Laerhoven
Main category: cs.CL
TL;DR: Study examines factors influencing sign language users’ attitudes towards signing avatars across cultures, comparing Moroccan Sign Language users with previous research on American Sign Language users.
Details
Motivation: Avatar technology can improve accessibility for Deaf and Hard-of-Hearing sign language users, but acceptance varies and most research is conducted by non-Deaf researchers. The study aims to understand cultural differences in avatar acceptance.Method: Designed a questionnaire to assess attitudes towards avatars among Moroccan Sign Language users. Surveyed three participant groups: Deaf (57), Hearing (20), and Hard-of-Hearing (3). Compared results with previous studies on ASL users.
Result: Collected data from 80 participants across different hearing status groups. Results were compared with findings from other relevant studies to identify cultural and experiential differences in avatar acceptance.
Conclusion: The study provides cross-cultural insights into factors affecting sign language users’ acceptance of avatars, highlighting the importance of considering both intrinsic (avatar characteristics) and extrinsic (user experience, hearing status, age, fluency) factors across different cultural contexts.
Abstract: Avatar technology can offer accessibility possibilities and improve the Deaf-and-Hard of Hearing sign language users access to communication, education and services, such as the healthcare system. However, sign language users acceptance of signing avatars as well as their attitudes towards them vary and depend on many factors. Furthermore, research on avatar technology is mostly done by researchers who are not Deaf. The study examines the extent to which intrinsic or extrinsic factors contribute to predict the attitude towards avatars across cultures. Intrinsic factors include the characteristics of the avatar, such as appearance, movements and facial expressions. Extrinsic factors include users technology experience, their hearing status, age and their sign language fluency. This work attempts to answer questions such as, if lower attitude ratings are related to poor technology experience with ASL users, for example, is that also true for Moroccan Sign Language (MSL) users? For the purposes of the study, we designed a questionnaire to understand MSL users attitude towards avatars. Three groups of participants were surveyed: Deaf (57), Hearing (20) and Hard-of-Hearing (3). The results of our study were then compared with those reported in other relevant studies.
[48] Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation
Bohao Yang, Kun Zhao, Dong Liu, Chen Tang, Liang Zhan, Chenghua Lin
Main category: cs.CL
TL;DR: A novel framework combining AMR-enhanced domain-specific language models with LLMs for robust open-domain dialogue evaluation, achieving superior performance against adversarial negative responses and strong correlation with human judgments.
Details
Motivation: Traditional dialogue evaluation metrics struggle with adversarial negative responses that have high lexical overlap but semantic incongruity, leading to low correlation with human judgments. LLMs also face challenges in handling such cases.Method: Integrates Abstract Meaning Representation (AMR) enhanced domain-specific language models with LLMs. Uses gating mechanism for semantic representation learning and incorporates both SLM predictions and AMR knowledge into LLM prompts.
Result: Achieves superior performance compared to state-of-the-art baselines, with AMR graph information contributing substantially to improvements. Strong correlations with human judgments across multiple datasets.
Conclusion: Establishes a new benchmark for dialogue evaluation by effectively handling adversarial negative examples through semantic representation enhancement and LLM integration.
Abstract: Automatic open-domain dialogue evaluation has attracted increasing attention, yet remains challenging due to the complexity of assessing response appropriateness. Traditional evaluation metrics, typically trained with true positive and randomly selected negative responses, tend to assign higher scores to responses that share greater content similarity with contexts. However, adversarial negative responses, despite possessing high lexical overlap with contexts, can be semantically incongruous. Consequently, existing metrics struggle to effectively evaluate such responses, resulting in low correlations with human judgments. While recent studies have demonstrated the effectiveness of Large Language Models (LLMs) for open-domain dialogue evaluation, they still face challenges in handling adversarial negative examples. We propose a novel evaluation framework that integrates Abstract Meaning Representation (AMR) enhanced domain-specific language models (SLMs) with LLMs. Our SLMs explicitly incorporate AMR graph information through a gating mechanism for enhanced semantic representation learning, while both SLM predictions and AMR knowledge are integrated into LLM prompts for robust evaluation. Extensive experiments on open-domain dialogue evaluation tasks demonstrate the superiority of our method compared to state-of-the-art baselines. Our comprehensive ablation studies reveal that AMR graph information contributes substantially more to performance improvements. Our framework achieves strong correlations with human judgments across multiple datasets, establishing a new benchmark for dialogue evaluation. Our code and data are publicly available.
[49] JoPA:Explaining Large Language Model’s Generation via Joint Prompt Attribution
Yurui Chang, Bochuan Cao, Yujia Wang, Jinghui Chen, Lu Lin
Main category: cs.CL
TL;DR: JoPA is a counterfactual explanation framework that identifies how combinations of prompt texts collaboratively influence LLM generation outputs through combinatorial optimization.
Details
Motivation: Existing prompt explanation methods are limited to classification tasks or treat input texts independently, failing to capture the combinatorial effects of prompts on complete language generation.Method: Formulates prompt attribution as a combinatorial optimization problem and uses a probabilistic algorithm to search for causal input combinations in discrete space.
Result: The framework demonstrates both faithfulness and efficiency in explaining how specific prompt text combinations influence LLM generation outputs.
Conclusion: JoPA provides an effective approach for understanding the collaborative effects of prompt texts on LLM generation, addressing limitations of existing explanation methods.
Abstract: Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. However, the contribution of the input prompt to the generated content still remains obscure to humans, underscoring the necessity of understanding the causality between input and output pairs. Existing works for providing prompt-specific explanation often confine model output to be classification or next-word prediction. Few initial attempts aiming to explain the entire language generation often treat input prompt texts independently, ignoring their combinatorial effects on the follow-up generation. In this study, we introduce a counterfactual explanation framework based on Joint Prompt Attribution, JoPA, which aims to explain how a few prompt texts collaboratively influences the LLM’s complete generation. Particularly, we formulate the task of prompt attribution for generation interpretation as a combinatorial optimization problem, and introduce a probabilistic algorithm to search for the casual input combination in the discrete space. We define and utilize multiple metrics to evaluate the produced explanations, demonstrating both the faithfulness and efficiency of our framework.
[50] Cutting Through the Noise: Boosting LLM Performance on Math Word Problems
Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra
Main category: cs.CL
TL;DR: LLMs struggle with math word problems containing irrelevant information. The paper proposes a prompting framework to create adversarial MWPs with irrelevant variables, shows LLMs suffer ~26% performance drop on such problems, and demonstrates that fine-tuning on adversarial samples improves robustness by ~8%.
Details
Motivation: Large Language Models excel at math word problems but perform poorly on real-world problems containing irrelevant information, highlighting a need for improved robustness to numerical noise and distraction.Method: Proposed a prompting framework to generate adversarial MWPs with irrelevant variables, created PROBLEMATHIC dataset, conducted experiments with LLMs, and fine-tuned models (Llama-2, Mistral) on adversarial samples. Also introduced GSM-8K-Adv benchmark.
Result: LLMs showed ~26% average performance drop on adversarial MWPs due to numerical noise distraction. Fine-tuning on adversarial samples improved performance by ~8% on adversarial problems. On GSM-8K-Adv benchmark, performance dropped by up to 6%.
Conclusion: LLMs are susceptible to irrelevant information in math problems, but adversarial training can improve robustness. The proposed prompting framework effectively creates challenging adversarial variants that reveal model weaknesses and enable better training.
Abstract: Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.
[51] Context-Aware Membership Inference Attacks against Pre-trained Large Language Models
Hongyan Chang, Ali Shahin Shamsabadi, Kleomenis Katevas, Hamed Haddadi, Reza Shokri
Main category: cs.CL
TL;DR: Novel membership inference attack method for LLMs that analyzes perplexity dynamics of subsequences, outperforming previous classification-based approaches.
Details
Motivation: Existing membership inference attacks designed for classification models fail on large language models because they ignore the generative nature and token sequence dynamics of LLMs.Method: Adapts MIA statistical tests to analyze the perplexity dynamics of subsequences within data points in pre-trained LLMs.
Result: Significantly outperforms prior approaches and reveals context-dependent memorization patterns in pre-trained LLMs.
Conclusion: The proposed method successfully addresses the limitations of previous MIAs for LLMs by considering the generative nature and subsequence perplexity dynamics.
Abstract: Membership Inference Attacks (MIAs) on pre-trained Large Language Models (LLMs) aim at determining if a data point was part of the model’s training set. Prior MIAs that are built for classification models fail at LLMs, due to ignoring the generative nature of LLMs across token sequences. In this paper, we present a novel attack on pre-trained LLMs that adapts MIA statistical tests to the perplexity dynamics of subsequences within a data point. Our method significantly outperforms prior approaches, revealing context-dependent memorization patterns in pre-trained LLMs.
[52] Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes
Maximus Powers, Shaina Raza, Alex Chang, Rehana Riaz, Umang Mavani, Harshitha Reddy Jonala, Ansh Tiwari, Hua Wei
Main category: cs.CL
TL;DR: GUS-Net Framework introduces token-level bias detection with BIO tagging for Generalizations, Unfairness, and Stereotypes, using encoder models that outperform LLMs for fine-grained bias analysis.
Details
Motivation: Sentence-level bias classification obscures specific harmful words and bias types, limiting auditability and targeted mitigation of representational harms in language technologies.Method: Created GUS dataset with 3,739 snippets and 69k+ token-level BIO annotations using automated multi-agent pipeline with human verification. Benchmarked encoder-based models (BERT variants) vs decoder-based LLMs for multi-label token classification.
Result: Encoder-based models consistently outperform decoder-based baselines on nuanced and overlapping spans while being more computationally efficient. Achieves fine-grained token-level identification and span-level entity recognition.
Conclusion: The framework provides interpretable, fine-grained diagnostics for systematic auditing and mitigation of representational harms in real-world NLP systems.
Abstract: Representational harms in language technologies often occur in short spans within otherwise neutral text, where phrases may simultaneously convey generalizations, unfairness, or stereotypes. Framing bias detection as sentence-level classification obscures which words carry bias and what type is present, limiting both auditability and targeted mitigation. We introduce the GUS-Net Framework, comprising the GUS dataset and a multi-label token-level detector for span-level analysis of social bias. The GUS dataset contains 3,739 unique snippets across multiple domains, with over 69,000 token-level annotations. Each token is labeled using BIO tags (Begin, Inside, Outside) for three pathways of representational harm: Generalizations, Unfairness, and Stereotypes. To ensure reliable data annotation, we employ an automated multi-agent pipeline that proposes candidate spans which are subsequently verified and corrected by human experts. We formulate bias detection as multi-label token-level classification and benchmark both encoder-based models (e.g., BERT family variants) and decoder-based large language models (LLMs). Our evaluations cover token-level identification and span-level entity recognition on our test set, and out-of-distribution generalization. Empirical results show that encoder-based models consistently outperform decoder-based baselines on nuanced and overlapping spans while being more computationally efficient. The framework delivers interpretable, fine-grained diagnostics that enable systematic auditing and mitigation of representational harms in real-world NLP systems.
[53] Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching
Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, Dongha Lee
Main category: cs.CL
TL;DR: Code-switching (mixing English with Korean) helps activate language-specific knowledge in multilingual LLMs, improving performance on low-resource language tasks compared to using English alone.
Details
Motivation: LLMs are English-centric due to training data dominance, creating challenges for low-resource languages. Code-switching may help activate language-specific knowledge that gets lost in translation.Method: Created EnKoQA (English-Korean code-switching QA dataset), analyzed multilingual LLMs by subdividing activation into knowledge identification and leveraging processes.
Result: Code-switching faithfully activates knowledge inside LLMs, especially for language-specific domains, outperforming English-only text.
Conclusion: Code-switching shows potential for improving LLM performance on low-resource language tasks by better activating language-specific knowledge.
Abstract: Recent large language models (LLMs) demonstrate multilingual abilities, yet they are English-centric due to dominance of English in training corpora. The limited resource for low-resource languages remains a crucial challenge. Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances that can be otherwise lost in translation and elicits language-specific knowledge in human communications. In light of this, we investigate whether code-switching can activate, or identify and leverage knowledge for reasoning when LLMs solve low-resource language tasks. To facilitate the research, we first present EnKoQA, a synthetic English-Korean CS question-answering dataset. We provide comprehensive analysis on a variety of multilingual LLMs by subdividing activation process into knowledge identification and knowledge leveraging. Our results demonstrate that compared to English text, CS can faithfully activate knowledge inside LLMs especially on language-specific domains, suggesting the potential of code-switching on low-resource language tasks.
[54] Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs
Mohsinul Kabir, Ajwad Abrar, Sophia Ananiadou
Main category: cs.CL
TL;DR: Closed-style multiple-choice surveys are inadequate for evaluating cultural alignment in LLMs. Unconstrained settings reveal stronger cultural alignment and expose inconsistencies in forced-choice evaluations.
Details
Motivation: To challenge the constrained evaluation paradigm of using closed-style multiple-choice surveys for assessing cultural alignment in Large Language Models, and explore more realistic, unconstrained approaches.Method: Used World Values Survey (WVS) and Hofstede Cultural Dimensions as case studies, comparing cultural alignment in constrained vs unconstrained settings, and testing sensitivity to choice reordering.
Result: LLMs exhibit stronger cultural alignment in less constrained settings without forced responses. Even minor changes like reordering survey choices lead to inconsistent outputs, revealing limitations of closed-style evaluations.
Conclusion: Advocates for more robust and flexible evaluation frameworks focusing on specific cultural proxies, enabling more nuanced and accurate assessments of cultural alignment in LLMs.
Abstract: A large number of studies rely on closed-style multiple-choice surveys to evaluate cultural alignment in Large Language Models (LLMs). In this work, we challenge this constrained evaluation paradigm and explore more realistic, unconstrained approaches. Using the World Values Survey (WVS) and Hofstede Cultural Dimensions as case studies, we demonstrate that LLMs exhibit stronger cultural alignment in less constrained settings, where responses are not forced. Additionally, we show that even minor changes, such as reordering survey choices, lead to inconsistent outputs, exposing the limitations of closed-style evaluations. Our findings advocate for more robust and flexible evaluation frameworks that focus on specific cultural proxies, encouraging more nuanced and accurate assessments of cultural alignment in LLMs.
[55] TokenSkip: Controllable Chain-of-Thought Compression in LLMs
Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, Wenjie Li
Main category: cs.CL
TL;DR: TokenSkip is a method that selectively skips less important tokens in Chain-of-Thought reasoning to reduce inference latency while maintaining reasoning performance.
Details
Motivation: Longer Chain-of-Thought sequences improve LLM reasoning but cause linear increases in inference latency, negatively impacting user experience when CoT exceeds 10,000 tokens.Method: Analyze semantic importance of tokens in CoT outputs and develop TokenSkip approach that enables LLMs to selectively skip less important tokens for controllable CoT compression.
Result: Reduces reasoning tokens by 40% (from 313 to 181) on GSM8K with Qwen2.5-14B-Instruct, achieving less than 0.4% performance drop across various models and tasks.
Conclusion: TokenSkip effectively compresses CoT outputs while preserving reasoning capabilities, addressing the inference latency problem in long CoT sequences.
Abstract: Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI’s o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks demonstrate the effectiveness of TokenSkip in reducing CoT token usage while preserving strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct, TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less than a 0.4% performance drop. We release our code and checkpoints in https://github.com/hemingkx/TokenSkip.
[56] How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
Main category: cs.CL
TL;DR: This paper presents a multilingual study on LLM hallucination in knowledge-intensive QA across 30 languages, developing a detection model and finding that hallucination rates don’t correlate with language digital representation but do decrease with larger model sizes.
Details
Motivation: Most hallucination research is English-centric and focuses on MT/summarization, but open information seeking is more common in practice. The authors aim to quantify LLM hallucination across languages in knowledge-intensive QA.Method: Trained multilingual hallucination detection model using MT-translated English data, manually annotated gold data for 5 languages, built QA dataset with LLM-generated prompts and Wikipedia references across 30 languages.
Result: LLMs generate longer responses with more hallucinated tokens for higher-resource languages, but length-normalized hallucination rates show no correlation with digital representation. Smaller LLMs exhibit larger hallucination rates than larger models.
Conclusion: The study validates using silver data for hallucination estimation and provides insights into multilingual hallucination patterns, showing that model size rather than language representation affects hallucination rates.
Abstract: In the age of misinformation, hallucination – the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses – represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common ``in the wild’’ than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering. To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to generate (noisy) training data in other languages. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build a knowledge-intensive QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. We find that, while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation. Further, we find that smaller LLMs exhibit larger hallucination rates than larger models.
[57] Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection
Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
Main category: cs.CL
TL;DR: Proposes a robust adaptation framework for hateful meme detection that improves in-domain accuracy, cross-domain generalization, and model interpretability while preserving LMM capabilities.
Details
Motivation: Hateful memes are a significant online concern, but current Large Multimodal Models face challenges with sub-optimal performance, limited out-of-domain generalization, and limitations of supervised fine-tuning and in-context learning for this task.Method: A robust adaptation framework for hateful meme detection that enhances both in-domain accuracy and cross-domain generalization while preserving general vision-language capabilities of LMMs.
Result: Achieves state-of-the-art performance on six meme classification datasets, outperforms larger agentic systems, shows improved robustness under adversarial attacks, and generates higher-quality rationales for explaining hateful content compared to standard SFT.
Conclusion: The proposed framework effectively addresses the limitations of current approaches to hateful meme detection, providing superior performance, better generalization, enhanced robustness, and improved interpretability through high-quality rationales.
Abstract: Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While Large Multimodal Models (LMMs) have shown promise in hateful meme detection, they face notable challenges like sub-optimal performance and limited out-of-domain generalization capabilities. Recent studies further reveal the limitations of both supervised fine-tuning (SFT) and in-context learning when applied to LMMs in this setting. To address these issues, we propose a robust adaptation framework for hateful meme detection that enhances in-domain accuracy and cross-domain generalization while preserving the general vision-language capabilities of LMMs. Analysis reveals that our approach achieves improved robustness under adversarial attacks compared to SFT models. Experiments on six meme classification datasets show that our approach achieves state-of-the-art performance, outperforming larger agentic systems. Moreover, our method generates higher-quality rationales for explaining hateful content compared to standard SFT, enhancing model interpretability. Code available at https://github.com/JingbiaoMei/RGCL
[58] Teaching Your Models to Understand Code via Focal Preference Alignment
Jie Wu, Haoling Li, Xin Zhang, Jianwen Luo, Yangyu Huang, Ruihang Chu, Yujiu Yang, Scarlett Li
Main category: cs.CL
TL;DR: Target-DPO is a new preference alignment framework that improves Code LLMs by explicitly locating error regions and aligning corresponding tokens, mimicking human iterative debugging rather than treating entire code blocks as positive/negative.
Details
Motivation: Existing preference learning approaches for Code LLMs lack granularity - they align entire failing code blocks rather than pinpointing specific errors, preventing models from learning meaningful error-correction patterns.Method: Target-DPO uses a tailored DPO algorithm to explicitly locate error regions and align corresponding tokens. The approach is supported by the CodeFlow dataset with iteratively refined samples that capture error corrections until tests pass.
Result: Extensive experiments show Code LLMs equipped with Target-DPO achieve significant performance gains in code generation and improve on challenging tasks like BigCodeBench, with fewer errors.
Conclusion: Target-DPO provides a more granular approach to preference alignment for Code LLMs by focusing on specific error regions rather than entire code blocks, leading to better error-correction learning and improved performance.
Abstract: Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
[59] Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Simon A. Aytes, Jinheon Baek, Sung Ju Hwang
Main category: cs.CL
TL;DR: Sketch-of-Thought (SoT) is a prompting framework that reduces token usage by up to 84% while maintaining reasoning accuracy, using cognitively inspired paradigms like Conceptual Chaining, Chunked Symbolism, and Expert Lexicons.
Details
Motivation: Chain-of-Thought prompting produces verbose intermediate outputs that increase computational overhead, creating a need for more efficient reasoning methods.Method: SoT integrates three cognitively inspired reasoning paradigms (Conceptual Chaining, Chunked Symbolism, Expert Lexicons) selected dynamically by a lightweight routing model for different reasoning tasks.
Result: Across 18 reasoning datasets spanning multiple domains, languages, and modalities, SoT achieves token reductions of up to 84% with minimal accuracy loss, and even improves accuracy in mathematical and multi-hop reasoning tasks.
Conclusion: SoT provides an effective framework for reducing computational overhead in LLM reasoning while preserving or even enhancing accuracy through cognitively inspired, modular prompting approaches.
Abstract: Recent advances in large language models (LLMs) have enabled strong reasoning capabilities through Chain-of-Thought (CoT) prompting, which elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs, leading to increased computational overhead. We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints to reduce token usage while preserving reasoning accuracy. SoT is designed as a flexible, modular approach and is instantiated with three paradigms–Conceptual Chaining, Chunked Symbolism, and Expert Lexicons–each tailored to distinct reasoning tasks and selected dynamically at test-time by a lightweight routing model. Across 18 reasoning datasets spanning multiple domains, languages, and modalities, SoT achieves token reductions of up to 84% with minimal accuracy loss. In tasks such as mathematical and multi-hop reasoning, it even improves accuracy while shortening outputs.
[60] Dynamic Relation Inference via Verb Embeddings
Omri Suissa, Muhiim Ali, Ariana Azarbal, Hui Shen, Shekhar Pradhan
Main category: cs.CL
TL;DR: DRIVE method improves CLIP’s relation inference by augmenting COCO dataset with hard negatives and novel loss function, significantly boosting zero-shot relation detection accuracy.
Details
Motivation: CLIP struggles with relation inference between objects in images despite strong object-level matching capabilities, and previous linguistic supervision approaches have had limited success.Method: Propose Dynamic Relation Inference via Verb Embeddings (DRIVE) which augments COCO dataset, fine-tunes CLIP with hard negative subject-relation-object triples, and introduces a novel loss function.
Result: Significantly improves zero-shot relation inference accuracy in both frozen and fine-tuned settings, outperforming CLIP and state-of-the-art models while generalizing well on unseen data.
Conclusion: DRIVE effectively addresses CLIP’s limitation in relation inference through targeted dataset augmentation and novel training methodology, achieving substantial performance gains.
Abstract: CLIP has demonstrated exceptional image-text matching capabilities due to its training on contrastive learning tasks. Past research has suggested that whereas CLIP effectively matches text to images when the matching can be achieved just by matching the text with the objects in the image, CLIP struggles when the matching depends on representing the relationship among the objects in the images (i.e., inferring relations). Previous attempts to address this limitation by training CLIP on relation detection datasets with only linguistic supervision have met with limited success. In this paper, we offer insights and practical methods to advance the field of relation inference from images. This paper approaches the task of creating a model that effectively detects relations among the objects in images by producing text and image embeddings that capture relationships through linguistic supervision. To this end, we propose Dynamic Relation Inference via Verb Embeddings (DRIVE), which augments the COCO dataset, fine-tunes CLIP with hard negatives subject-relation-object triples and corresponding images, and introduces a novel loss function to improve relation detection. Evaluated on multiple CLIP-based models, our method significantly improves zero-shot relation inference accuracy in both frozen and fine-tuned settings, significantly outperforming CLIP and state-of-the-art models while generalizing well on unseen data.
[61] Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors
Zhiyu Yang, Shuo Wang, Yukun Yan, Yang Deng
Main category: cs.CL
TL;DR: DSDBench is the first benchmark for evaluating LLMs on multi-hop error tracing and multi-bug detection in data science code debugging, featuring 1,117 annotated samples with realistic data science debugging tasks.
Details
Motivation: Current code generation benchmarks mainly assess syntactic correctness in simple cases, leaving LLMs' capabilities to autonomously find and fix runtime logical errors in complex data science code largely unexplored.Method: Adapts datasets from existing data science task benchmarks (DABench and MatPlotBench) with automatically synthesized multi-hop, multi-bug code snippets and 741 cause-effect error pairs with runtime error messages.
Result: Evaluations show significant performance gaps in state-of-the-art LLMs, highlighting challenges in debugging logical runtime errors in data science code.
Conclusion: DSDBench provides a crucial resource to evaluate and improve LLMs’ debugging and reasoning capabilities, enabling more reliable AI-assisted data science in the future.
Abstract: LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs’ capabilities to autonomously find and fix runtime logical errors in complex data science code remain largely unexplored. To address this gap, we introduce DSDBench: the Data Science Debugging Benchmark, the first benchmark for systematic evaluation of LLMs on multi-hop error tracing and multi-bug detection in data science code debugging. DSDBench adapts datasets from existing data science task benchmarks, such as DABench and MatPlotBench, featuring realistic data science debugging tasks with automatically synthesized multi-hop, multi-bug code snippets. DSDBench includes 1,117 annotated samples with 741 cause-effect error pairs and runtime error messages. Evaluations of state-of-the-art LLMs on DSDBench show significant performance gaps, highlighting challenges in debugging logical runtime errors in data science code. DSDBench offers a crucial resource to evaluate and improve LLMs’ debugging and reasoning capabilities, enabling more reliable AI-assisted data science in the future. DSDBench is publicly available at github.com/KevinCL16/DSDBench.
[62] Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding
Melanie Subbiah, Akankshya Mishra, Grace Kim, Liyan Tang, Greg Durrett, Kathleen McKeown
Main category: cs.CL
TL;DR: The paper introduces ARM (Ambiguity Rewrite Metric) to handle subjectivity in factuality judgments of ambiguous claims, replacing binary faithfulness labels with nuanced rewrite-based evaluation.
Details
Motivation: Binary judgments of claim faithfulness are unreliable for ambiguous claims where different people can reasonably interpret the same claim as either supported or unsupported based on inferences from evidence.Method: Use LLM-generated edits of summaries to measure how much a summary needs to be edited to become unambiguous. The amount and nature of rewriting serves as an automatic evaluation metric.
Result: ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, significantly reducing subjectivity in evaluations.
Conclusion: The Ambiguity Rewrite Metric provides a richer, more reliable evaluation framework for claim faithfulness that better handles subjective interpretations compared to binary judgments.
Abstract: Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.
[63] Do Large Language Models Truly Grasp Addition? A Rule-Focused Diagnostic Using Two-Integer Arithmetic
Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan
Main category: cs.CL
TL;DR: LLMs show high accuracy on basic arithmetic but fail fundamental property tests like commutativity and symbolic invariance, revealing they use pattern matching rather than true mathematical reasoning.
Details
Motivation: To determine whether LLMs truly understand fundamental arithmetic rules or just rely on pattern matching, given their impressive performance on advanced math benchmarks but occasional failures on basic arithmetic tasks.Method: Systematically tested 12 leading LLMs on two-integer addition (0 to 2^64) by evaluating three crucial properties: commutativity, representation invariance through symbolic remapping, and consistent accuracy scaling with operand length.
Result: While models achieved high numeric accuracy (73.8-99.8%), they systematically failed diagnostics: accuracy dropped to <=7.5% with symbolic inputs, commutativity was violated in up to 20% of cases, and accuracy scaling was non-monotonic.
Conclusion: Current LLMs address elementary addition via pattern matching rather than robust rule induction, highlighting the need for new diagnostic benchmarks and innovations in model architecture and training to develop genuine mathematical reasoning capabilities.
Abstract: Large language models (LLMs) achieve impressive results on advanced mathematics benchmarks but sometimes fail on basic arithmetic tasks, raising the question of whether they have truly grasped fundamental arithmetic rules or are merely relying on pattern matching. To unravel this issue, we systematically probe LLMs’ understanding of two-integer addition (0 to $2^64$) by testing three crucial properties: commutativity (A+B=B+A), representation invariance via symbolic remapping (e.g., $7 -> Y$), and consistent accuracy scaling with operand length. Our evaluation of 12 leading LLMs reveals a stark disconnect: while models achieve high numeric accuracy (73.8-99.8%), they systematically fail these diagnostics. Specifically, accuracy plummets to <= 7.5% with symbolic inputs, commutativity is violated in up to 20% of cases, and accuracy scaling is non-monotonic. These findings demonstrate that current LLMs address elementary addition via pattern matching, not robust rule induction, motivating new diagnostic benchmarks and innovations in model architecture and training to cultivate genuine mathematical reasoning. Our dataset and generating code are available at https://github.com/kuri-leo/llm-arithmetic-diagnostic.
[64] Game-RL: Synthesizing Verifiable Game Tasks at Scale to Boost VLMs General Reasoning
Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Main category: cs.CL
TL;DR: Code2Logic uses LLMs to generate verifiable game reasoning tasks from game code, creating the GameQA dataset for training VLMs. Game-RL reinforcement learning on this dataset improves VLM performance across diverse vision-language benchmarks.
Details
Motivation: Real-world vision language reasoning requires diverse tasks, but current VLMs focus on narrow domains like geometry/chart reasoning, limiting general reasoning capabilities.Method: Propose Code2Logic approach using LLMs to synthesize verifiable game reasoning tasks by adapting game code. Create GameQA dataset with 30 games and 158 tasks. Apply Game-RL (reinforcement learning) on this dataset.
Result: VLMs trained on GameQA showed out-of-domain generalization. Qwen2.5-VL-7B improved performance by 2.33% across 7 diverse vision-language benchmarks despite training only on game tasks.
Conclusion: Code2Logic enables scalable creation of diverse reasoning tasks, and Game-RL on game-based datasets can significantly enhance VLM generalization across multiple vision-language domains.
Abstract: Real-world vision language reasoning scenarios often include diverse and complex tasks. However, vision language reinforcement learning has primarily focused on a narrow set of tasks (e.g. geometry or chart reasoning), limiting the improvement of Vision Language Models’ (VLMs) general reasoning. Therefore, we propose a novel Code2Logic approach, using Large Language Models (LLMs) to synthesize verifiable game reasoning tasks at scale via adapting game code. Using the Code2Logic, we developed the GameQA dataset to train and evaluate VLMs. GameQA is verifiable and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Then we apply Game-RL, which is simple reinforcement learning on GameQA. Surprisingly, despite training solely on game tasks, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at the GitHub repository.
[65] The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models
Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu
Main category: cs.CL
TL;DR: LLMs fail at simple character-level tasks due to tokenization limitations, but character-level reasoning emerges suddenly late in training. A lightweight architectural modification significantly improves performance while preserving subword model advantages.
Details
Motivation: Large Language Models consistently fail at simple character-level tasks like counting letters in words due to fundamental tokenization limitations, creating a need to understand and address these structural blind spots.Method: Used 19 synthetic tasks to isolate character-level reasoning in controlled settings, analyzed the problem through concept emergence and percolation-based models, and proposed a lightweight architectural modification to improve character-level reasoning.
Result: Character-level capabilities emerge suddenly and only late in training, and the proposed architectural modification significantly improves character-level reasoning while preserving the inductive advantages of subword models.
Conclusion: The research bridges low-level perceptual gaps in tokenized language models and provides a principled framework for understanding and mitigating their structural blind spots, showing that learning character composition is not fundamentally different from learning commonsense knowledge.
Abstract: Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge suddenly and only late in training. We find that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.
[66] Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering
Hwan Chang, Yumin Kim, Yonghyun Jun, Hwanhee Lee
Main category: cs.CL
TL;DR: CoPriva benchmark reveals LLMs frequently violate contextual non-disclosure policies, especially against indirect attacks, exposing critical security vulnerabilities in sensitive applications.
Details
Motivation: As LLMs are deployed in sensitive domains like enterprise and government, ensuring adherence to user-defined security policies for information non-disclosure is critical, but existing benchmarks lack evaluation of contextual security preservation.Method: Introduces CoPriva, a large-scale benchmark dataset derived from realistic contexts with explicit policies and queries designed as direct/indirect attacks. Evaluates 10 LLMs on their ability to adhere to contextual non-disclosure policies.
Result: Most models violate user-defined policies and leak sensitive information, particularly against indirect attacks. Models struggle to incorporate policy constraints during generation but show partial ability to revise outputs when explicitly prompted.
Conclusion: Current LLMs have significant vulnerabilities in contextual security preservation, highlighting an urgent need for more robust methods to guarantee policy adherence in sensitive applications.
Abstract: As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.
[67] HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation
Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, Derek F. Wong
Main category: cs.CL
TL;DR: HiMATE is a hierarchical multi-agent framework that improves machine translation evaluation by leveraging MQM error typology and addressing LLM hallucinations through self-reflection and agent discussion.
Details
Motivation: Current LLM-based translation evaluation methods struggle with accurately identifying error spans and assessing severity, failing to fully exploit the fine-grained structural and semantic information in MQM hierarchy.Method: Developed a hierarchical multi-agent system based on MQM error typology, incorporating self-reflection capabilities and agent discussion with asymmetric information to reduce hallucinations.
Result: Outperformed competitive baselines across datasets, achieving 89% average F1-score improvement in error span detection and severity assessment over best-performing baseline.
Conclusion: HiMATE effectively addresses limitations of current LLM evaluation methods and provides more human-aligned machine translation evaluations through its hierarchical multi-agent approach.
Abstract: The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model’s self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at https://github.com/nlp2ct-shijie/HiMATE.
[68] From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs
Muhammad Farid Adilazuarda, Chen Cecilia Liu, Iryna Gurevych, Alham Fikri Aji
Main category: cs.CL
TL;DR: This paper investigates cultural value adaptation in LLMs, finding that relying solely on World Values Survey data homogenizes cultural norms and interferes with factual knowledge. The authors augment survey data with encyclopedic and scenario-based narratives from Wikipedia and NormAd, which consistently improve cultural distinctiveness.
Details
Motivation: Cultural value adaptation in LLMs faces challenges due to biases and limited training data. Prior work uses World Values Survey data, but it's unclear if this effectively captures cultural nuances or produces distinct cultural representations for downstream tasks.Method: Systematically investigate WVS-based training for cultural value adaptation, then augment WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd to address limitations of survey-only approaches.
Result: Relying solely on survey data homogenizes cultural norms and interferes with factual knowledge. While narratives may have variable effects on downstream tasks, they consistently improve cultural distinctiveness compared to survey data alone.
Conclusion: The work highlights the inherent complexity of aligning cultural values to guide task-specific behavior, demonstrating that augmented narrative approaches outperform survey-only methods for cultural distinctiveness.
Abstract: Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and limited training data. Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data. However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for various downstream tasks. In this paper, we systematically investigate WVS-based training for cultural value adaptation and find that relying solely on survey data can homogenize cultural norms and interfere with factual knowledge. To investigate these issues, we augment WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd. While these narratives may have variable effects on downstream tasks, they consistently improve cultural distinctiveness than survey data alone. Our work highlights the inherent complexity of aligning cultural values with the goal of guiding task-specific behavior. We release our code at https://github.com/faridlazuarda/from-surveys-to-narratives.
[69] Reading Between the Prompts: How Stereotypes Shape LLM’s Implicit Personalization
Vera Neplenbroek, Arianna Bisazza, Raquel Fernández
Main category: cs.CL
TL;DR: LLMs infer user demographics from conversational cues, leading to biased responses. This study shows LLMs persist in using stereotypes even when users explicitly state different identities, but proposes a mitigation method using linear probes on internal representations.
Details
Motivation: Prior research shows LLMs infer demographic information from subtle conversational cues, resulting in lower quality responses for minority groups. This work aims to systematically explore how LLMs respond to stereotypical cues and develop mitigation strategies.Method: Used controlled synthetic conversations to analyze LLMs’ latent user representations through model internals and generated answers to targeted questions. Developed intervention using trained linear probes on internal representations to steer toward explicitly stated identities.
Result: LLMs do infer demographic attributes based on stereotypical signals, and this persists even when users explicitly identify with different demographic groups. The stereotype-driven implicit personalization can be effectively mitigated using the proposed intervention method.
Conclusion: The findings highlight the need for greater transparency and control in how LLMs represent user identity, as implicit demographic inference can lead to biased responses that persist despite explicit user statements.
Abstract: Generative Large Language Models (LLMs) infer user’s demographic information from subtle cues in the conversation – a phenomenon called implicit personalization. Prior work has shown that such inferences can lead to lower quality responses for users assumed to be from minority groups, even when no demographic information is explicitly provided. In this work, we systematically explore how LLMs respond to stereotypical cues using controlled synthetic conversations, by analyzing the models’ latent user representations through both model internals and generated answers to targeted user questions. Our findings reveal that LLMs do infer demographic attributes based on these stereotypical signals, which for a number of groups even persists when the user explicitly identifies with a different demographic group. Finally, we show that this form of stereotype-driven implicit personalization can be effectively mitigated by intervening on the model’s internal representations using a trained linear probe to steer them toward the explicitly stated identity. Our results highlight the need for greater transparency and control in how LLMs represent user identity.
[70] PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims
Yongmin Yoo, Qiongkai Xu, Longbing Cao
Main category: cs.CL
TL;DR: PatentScore is a specialized evaluation framework for LLM-generated patent claims that outperforms conventional NLG metrics by incorporating legal/technical standards and hierarchical analysis.
Details
Motivation: High-stakes documents like patent claims require precise evaluation that conventional NLG metrics cannot provide, as they fail to capture structural and legal characteristics essential for reliability.Method: PatentScore integrates hierarchical decomposition of claim elements, validation patterns based on legal/technical standards, and multi-dimensional scoring across structural, semantic, and legal dimensions.
Result: On a dataset of 400 Claim1 documents, PatentScore achieved the highest correlation with expert annotations (r = 0.819), significantly outperforming widely used NLG metrics.
Conclusion: This work establishes a new standard for evaluating LLM-generated patent claims and provides a solid foundation for research on patent generation and validation in high-stakes domains.
Abstract: High-stakes texts such as patent claims, medical records, and technical reports are structurally complex and demand a high degree of reliability and precision. While large language models (LLMs) have recently been applied to automate their generation in high-stakes domains, reliably evaluating such outputs remains a major challenge. Conventional natural language generation (NLG) metrics are effective for generic documents but fail to capture the structural and legal characteristics essential to evaluating complex high-stakes documents. To address this gap, we propose PatentScore, a multi-dimensional evaluation framework specifically designed for one of the most intricate and rigorous domains, patent claims. PatentScore integrates hierarchical decomposition of claim elements, validation patterns grounded in legal and technical standards, and scoring across structural, semantic, and legal dimensions. In experiments on our dataset which consists of 400 Claim1, PatentScore achieved the highest correlation with expert annotations ($r = 0.819$), significantly outperforming widely used NLG metrics. This work establishes a new standard for evaluating LLM-generated patent claims, providing a solid foundation for research on patent generation and validation.
[71] Counterfactual Simulatability of LLM Explanations for Generation Tasks
Marvin Limpijankit, Yanda Chen, Melanie Subbiah, Nicholas Deas, Kathleen McKeown
Main category: cs.CL
TL;DR: A framework for evaluating LLM explanations using counterfactual simulatability in generation tasks, showing effectiveness in summarization but limitations in medical suggestion tasks.
Details
Motivation: LLMs are unpredictable and their outputs can change unexpectedly with prompt variations, making accurate explanations critical for high-stakes applications where understanding model behavior is essential.Method: Extends counterfactual simulatability evaluation from yes/no tasks to generation tasks using news summarization and medical suggestion as case studies, measuring how well explanations help users predict model outputs on related counterfactuals.
Result: LLM explanations improved user ability to predict outputs on counterfactuals in summarization tasks, but showed significant room for improvement in medical suggestion tasks. Evaluation appears more suitable for skill-based tasks than knowledge-based tasks.
Conclusion: Counterfactual simulatability provides a valuable framework for evaluating LLM explanations in generation tasks, but its effectiveness varies by task type, working better for skill-based tasks like summarization than knowledge-based tasks like medical suggestion.
Abstract: LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model’s output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.
[72] UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment
Joseph Marvin Imperial, Abdullah Barayan, Regina Stodden, Rodrigo Wilkens, Ricardo Munoz Sanchez, Lingyun Gao, Melissa Torgbi, Dawn Knight, Gail Forey, Reka R. Jablonkai, Ekaterina Kochmar, Robert Reynolds, Eugénio Ribeiro, Horacio Saggion, Elena Volodina, Sowmya Vajjala, Thomas François, Fernando Alva-Manchego, Harish Tayyar Madabushi
Main category: cs.CL
TL;DR: UniversalCEFR is a large-scale multilingual dataset with 505,807 CEFR-labeled texts across 13 languages, standardized for automated readability and language proficiency assessment research.
Details
Motivation: To enable open research in automated readability and language proficiency assessment by providing a standardized, multilingual dataset that supports consistent processing and modeling across languages and tasks.Method: Curated texts from educational and learner-oriented resources, standardized into unified format. Benchmarking experiments using three modeling paradigms: linguistic feature-based classification, fine-tuning pre-trained LLMs, and descriptor-based prompting of instruction-tuned LLMs.
Result: Results support using linguistic features and fine-tuning pretrained models for multilingual CEFR level assessment.
Conclusion: UniversalCEFR aims to establish best practices in data distribution for language proficiency research by standardizing dataset formats and promoting accessibility to the global research community.
Abstract: We introduce UniversalCEFR, a large-scale multilingual and multidimensional dataset of texts annotated with CEFR (Common European Framework of Reference) levels in 13 languages. To enable open research in automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modelling across tasks and languages. To demonstrate its utility, we conduct benchmarking experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution for language proficiency research by standardising dataset formats, and promoting their accessibility to the global research community.
[73] From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
Viktor Hangya, Fabian Küch, Darina Gold
Main category: cs.CL
TL;DR: Reformulating generative NLG tasks into cheaper NLU alternatives enables 35x faster evaluation of LLM capabilities during training while maintaining strong performance correlation.
Details
Motivation: To reduce the computational burden of time-consuming NLG (token-by-token generation) evaluations for monitoring LLM capabilities like reasoning and code generation during training.Method: Reformulate generative tasks into computationally cheaper NLU alternatives and test performance correlation between original and reformulated tasks using 8 LMs of various sizes across 4 capabilities.
Result: Strong correlation between task formats with over 35x average reduction in evaluation time, supporting capability assessment via cheaper alternatives.
Conclusion: NLU reformulations provide efficient and reliable alternatives for monitoring crucial LLM capabilities during training, significantly reducing computational costs.
Abstract: Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. Our project is available at: https://github.com/Fraunhofer-IIS/EvalShortcut
[74] EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models
Tao Zou, Xinghua Zhang, Haiyang Yu, Minzheng Wang, Fei Huang, Yongbin Li
Main category: cs.CL
TL;DR: EIFBENCH benchmark introduces complex multi-task scenarios with constraints to better evaluate LLMs in real-world applications, revealing performance gaps and proposing SegPO algorithm for improvement.
Details
Motivation: Existing benchmarks focus on single-task environments with limited constraints, lacking the complexity needed to reflect real-world scenarios where LLMs must handle multiple tasks with precise workflow execution.Method: Created EIFBENCH benchmark with multi-task scenarios and various constraints, and proposed Segment Policy Optimization (SegPO) algorithm to enhance LLMs’ ability to handle complex multi-task workflows.
Result: Evaluations on EIFBENCH revealed significant performance discrepancies in existing LLMs when faced with extremely complex instructions, showing they struggle with multi-task scenarios.
Conclusion: The findings highlight the need for ongoing optimization of LLMs to handle the intricate challenges posed by real-world applications, and EIFBENCH provides a more realistic evaluation framework.
Abstract: With the development and widespread application of large language models (LLMs), the new paradigm of “Model as Product” is rapidly evolving, and demands higher capabilities to address complex user needs, often requiring precise workflow execution which involves the accurate understanding of multiple tasks. However, existing benchmarks focusing on single-task environments with limited constraints lack the complexity required to fully reflect real-world scenarios. To bridge this gap, we present the Extremely Complex Instruction Following Benchmark (EIFBENCH), meticulously crafted to facilitate a more realistic and robust evaluation of LLMs. EIFBENCH not only includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently, but also integrates a variety of constraints, replicating complex operational environments. Furthermore, we propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM’s ability to accurately fulfill multi-task workflow. Evaluations on EIFBENCH have unveiled considerable performance discrepancies in existing LLMs when challenged with these extremely complex instructions. This finding underscores the necessity for ongoing optimization to navigate the intricate challenges posed by LLM applications.
[75] Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$
Chihiro Taguchi, Seiji Maekawa, Nikita Bhutani
Main category: cs.CL
TL;DR: Adaptive-k retrieval is a single-pass method that dynamically selects the optimal number of passages for RAG based on similarity score distributions, improving QA efficiency and accuracy without requiring model changes.
Details
Motivation: Existing adaptive retrieval methods struggle with aggregation QA where optimal context size is unknown and variable, while fixed retrieval sizes risk wasting tokens or missing key evidence.Method: Adaptive selection of passage count based on similarity score distribution between query and candidate passages, requiring no model fine-tuning, extra LLM inferences, or pipeline changes.
Result: Matches or outperforms fixed-k baselines on factoid and aggregation QA benchmarks, uses 10x fewer tokens than full-context input, retrieves 70% of relevant passages, and improves accuracy across multiple LCLMs and embedding models.
Conclusion: Dynamically adjusting context size leads to more efficient and accurate question answering, demonstrating the effectiveness of adaptive retrieval strategies.
Abstract: Retrieval-augmented generation (RAG) and long-context language models (LCLMs) both address context limitations of LLMs in open-domain question answering (QA). However, optimal external context to retrieve remains an open problem: fixing the retrieval size risks either wasting tokens or omitting key evidence. Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM prompting and perform well on factoid QA, but struggle with aggregation QA, where the optimal context size is both unknown and variable. We present Adaptive-$k$ retrieval, a simple and effective single-pass method that adaptively selects the number of passages based on the distribution of the similarity scores between the query and the candidate passages. It does not require model fine-tuning, extra LLM inferences or changes to existing retriever-reader pipelines. On both factoid and aggregation QA benchmarks, Adaptive-$k$ matches or outperforms fixed-$k$ baselines while using up to 10x fewer tokens than full-context input, yet still retrieves 70% of relevant passages. It improves accuracy across five LCLMs and two embedding models, highlighting that dynamically adjusting context size leads to more efficient and accurate QA.
[76] References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation
Silvia Casola, Yang Janet Liu, Siyao Peng, Oliver Kraus, Albert Gatt, Barbara Plank
Main category: cs.CL
TL;DR: Reference-based summarization metrics like ROUGE show significant instability depending on which reference summaries are used, undermining reliable model comparisons across diverse datasets.
Details
Motivation: Human language production exhibits rich variation that is often overlooked in summarization evaluation, and the impact of reference set choice on metrics hasn't been systematically studied.Method: Analyzed three diverse multi-reference summarization datasets (SummEval, GUMSum, DUC2004) and examined sensitivity of popular metrics to reference set choice, plus collected human judgments on LLM outputs for genre-diverse data.
Result: Many popular metrics exhibit significant instability, especially n-gram-based metrics like ROUGE where model rankings vary by reference sets. Weak-to-no correlation found between metrics and human judgments beyond newswire summaries.
Conclusion: Recommend incorporating reference set variation into summarization evaluation to enhance consistency and correlation with human judgments, especially when evaluating LLMs.
Abstract: Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of the reference set on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.
[77] Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation
Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li
Main category: cs.CL
TL;DR: CoT prompting reduces hallucinations but impairs detection methods by obscuring critical signals, creating a trade-off between reasoning benefits and detection effectiveness.
Details
Motivation: LLMs often generate hallucinations, and while Chain-of-Thought (CoT) prompting helps mitigate this, its impact on hallucination detection methods remains underexplored.Method: Conducted systematic empirical evaluation including pilot experiments showing CoT affects LLM internal states and token probabilities, then evaluated various CoT methods on mainstream detection approaches across different LLM types.
Result: CoT prompting reduces hallucination frequency but obscures critical detection signals, impairing the effectiveness of various detection methods across multiple dimensions (score distributions, accuracy, confidence).
Conclusion: There’s an overlooked trade-off in using reasoning techniques - while CoT helps reduce hallucinations, it simultaneously makes them harder to detect, highlighting a critical limitation in current detection approaches.
Abstract: Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM’s internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://github.com/ECNU-Text-Computing/cot-hallu-detect .
[78] TAPS: Tool-Augmented Personalisation via Structured Tagging
Ekaterina Taktasheva, Jeff Dalton
Main category: cs.CL
TL;DR: TAPS enhances personalized tool use in LLMs with structured tagging and uncertainty detection, achieving SOTA on NLSI task.
Details
Motivation: Existing tool-augmented LLMs overlook personalization in guiding tool use for user tasks.Method: Introduces TAPS with structured tagging tool and uncertainty-based tool detector to integrate user preferences.
Result: Significantly improves LLMs’ ability to incorporate user preferences, achieving state-of-the-art performance on NLSI task.
Conclusion: Personalized tool use through TAPS framework effectively addresses key weaknesses in LLM personalization capabilities.
Abstract: Recent advancements in tool-augmented large language models have enabled them to interact with external tools, enhancing their ability to perform complex user tasks. However, existing approaches overlook the role of personalisation in guiding tool use. This work investigates how user preferences can be effectively integrated into goal-oriented dialogue agents. Through extensive analysis, we identify key weaknesses in the ability of LLMs to personalise tool use. To this end, we introduce TAPS, a novel solution that enhances personalised tool use by leveraging a structured tagging tool and an uncertainty-based tool detector. TAPS significantly improves the ability of LLMs to incorporate user preferences, achieving the new state-of-the-art for open source models on the NLSI task.
[79] Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement
Hao Li, Yizheng Sun, Viktor Schlegel, Kailai Yang, Riza Batista-Navarro, Goran Nenadic
Main category: cs.CL
TL;DR: Arg-LLaDA is a novel iterative LLM framework that improves argument summarization through sufficiency-guided remasking and regeneration, outperforming state-of-the-art methods in both automatic and human evaluations.
Details
Motivation: Existing argument summarization approaches rely on single-pass generation with limited support for factual correction and structural refinement, leaving the generation stage underexplored despite advances in argument identification and clustering.Method: Arg-LLaDA combines a flexible masking controller with a sufficiency-checking module to iteratively identify and revise unsupported, redundant, or incomplete spans through remasking and regeneration.
Result: The method surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics and shows substantial human-evaluated improvements in coverage, faithfulness, and conciseness.
Conclusion: The iterative, sufficiency-aware generation strategy effectively produces more faithful, concise, and coherent argument summaries, validating the approach’s effectiveness for complex multi-perspective debate summarization.
Abstract: Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.
[80] OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages
Raphaël Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova
Main category: cs.CL
TL;DR: OpenWHO is a new document-level parallel corpus for health domain machine translation evaluation, covering 20+ languages including 9 low-resource ones. LLMs outperform traditional MT models, with Gemini 2.5 Flash showing +4.79 ChrF improvement over NLLB-54B on low-resource languages.
Details
Motivation: Health is a high-stakes MT domain with widespread deployment and specialized vocabulary, but there's a lack of evaluation datasets for low-resource languages in this domain.Method: Created OpenWHO corpus with 2,978 documents and 26,824 sentences from WHO’s e-learning platform. Evaluated modern LLMs against traditional MT models and analyzed LLM context utilization effects.
Result: LLMs consistently outperform traditional MT models. Gemini 2.5 Flash achieved +4.79 ChrF improvement over NLLB-54B on low-resource test set. Document-level translation benefits are most pronounced in specialized domains like health.
Conclusion: The OpenWHO corpus addresses the gap in health domain MT evaluation for low-resource languages and demonstrates LLMs’ superiority over traditional approaches, particularly benefiting from document-level context in specialized domains.
Abstract: In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization’s e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.
[81] Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs
Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, Zilong Zheng
Main category: cs.CL
TL;DR: Router Lens identifies context-faithful experts in LLMs, and CEFT selectively fine-tunes them to improve context faithfulness efficiently.
Details
Motivation: Large language models often fail to ground outputs in provided context, leading to irrelevant responses. The work explores if certain experts specialize in context utilization for targeted optimization.Method: Proposed Router Lens to identify context-faithful experts, then introduced Context-faithful Expert Fine-Tuning (CEFT) to selectively fine-tune these experts.
Result: CEFT matches or surpasses full fine-tuning performance across various benchmarks while being significantly more efficient.
Conclusion: Specialized context-faithful experts exist and can be selectively optimized to improve context grounding efficiently.
Abstract: Context faithfulness is essential for reliable reasoning in context-dependent scenarios. However, large language models often struggle to ground their outputs in the provided context, resulting in irrelevant responses. Inspired by the emergent expert specialization observed in mixture-of-experts architectures, this work investigates whether certain experts exhibit specialization in context utilization, offering a potential pathway toward targeted optimization for improved context faithfulness. To explore this, we propose Router Lens, a method that accurately identifies context-faithful experts. Our analysis reveals that these experts progressively amplify attention to relevant contextual information, thereby enhancing context grounding. Building on this insight, we introduce Context-faithful Expert Fine-Tuning (CEFT), a lightweight optimization approach that selectively fine-tunes context-faithful experts. Experiments across a wide range of benchmarks and models demonstrate that CEFT matches or surpasses the performance of full fine-tuning while being significantly more efficient.
[82] Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs
Ayush Gupta, Ramneet Kaur, Anirban Roy, Adam D. Cobb, Rama Chellappa, Susmit Jha
Main category: cs.CL
TL;DR: Novel inference-time OOD detection method for specialized LLMs using dropout tolerance and conformal anomaly detection, achieving significant AUROC improvements over baselines.
Details
Motivation: Specialized LLMs fine-tuned for specific domains remain vulnerable to incorrect outputs when presented with out-of-domain inputs, posing risks in critical applications like healthcare.Method: Leverages Inductive Conformal Anomaly Detection (ICAD) with a new non-conformity measure based on dropout tolerance, aggregating across multiple layers via ensemble approach while maintaining theoretical false alarm bounds.
Result: Experiments with medical-specialized LLMs show AUROC improvements of 2% to 37% over baseline methods when detecting OOD inputs.
Conclusion: The proposed dropout tolerance-based approach effectively detects out-of-domain inputs for specialized LLMs, providing reliable OOD detection with theoretical guarantees.
Abstract: We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model’s dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of $2%$ to $37%$ when treating OOD datapoints as positives and in-domain test datapoints as negatives.
[83] ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions
Matteo Bortoletto, Constantin Ruhdorfer, Andreas Bulling
Main category: cs.CL
TL;DR: ToM-SSI is a new multimodal benchmark that tests Theory of Mind capabilities in complex social environments with group interactions and spatial dynamics, revealing significant limitations in current AI models.
Details
Motivation: Existing ToM benchmarks are limited to simple Sally-Anne tests and text-only or dyadic interactions, failing to capture the complexity of real human social interactions.Method: Developed ToM-SSI benchmark featuring multimodal inputs, group interactions of up to four agents, situated environments with movement, and mixed cooperative-obstructive settings to test parallel reasoning about multiple agents’ mental states.
Result: Current foundation models show severely limited performance on ToM-SSI, particularly in the novel tasks involving complex group dynamics and spatial reasoning.
Conclusion: The benchmark highlights critical gaps in AI’s social cognition capabilities and provides a more comprehensive framework for future research on Theory of Mind in complex social environments.
Abstract: Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents’ mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that the current models’ performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research.
[84] ICR: Iterative Clarification and Rewriting for Conversational Search
Zhiyu Cao, Peifeng Li, Qiaoming Zhu
Main category: cs.CL
TL;DR: ICR framework uses iterative clarification questions and query rewriting to handle multiple fuzzy expressions in conversational queries, achieving SOTA performance.
Details
Motivation: End-to-end query rewriting struggles with multiple fuzzy expressions in queries, making simultaneous identification and rewriting difficult.Method: Iterative framework alternating between generating clarification questions and rewritten queries to progressively refine the query.
Result: ICR continuously improves retrieval performance through iterative process and achieves state-of-the-art results on two datasets.
Conclusion: Iterative clarification-rewriting approach effectively addresses multi-position fuzzy expression problem in conversational query rewriting.
Abstract: Most previous work on Conversational Query Rewriting employs an end-to-end rewriting paradigm. However, this approach is hindered by the issue of multiple fuzzy expressions within the query, which complicates the simultaneous identification and rewriting of multiple positions. To address this issue, we propose a novel framework ICR (Iterative Clarification and Rewriting), an iterative rewriting scheme that pivots on clarification questions. Within this framework, the model alternates between generating clarification questions and rewritten queries. The experimental results show that our ICR can continuously improve retrieval performance in the clarification-rewriting iterative process, thereby achieving state-of-the-art performance on two popular datasets.
[85] Benchmarking Gender and Political Bias in Large Language Models
Jinrui Yang, Xudong Han, Timothy Baldwin
Main category: cs.CL
TL;DR: EuroParlVote is a new benchmark for evaluating LLM bias in political contexts using European Parliament data, revealing systematic gender and political biases in model performance.
Details
Motivation: To create a comprehensive benchmark for assessing LLM fairness and bias in politically sensitive scenarios, particularly in parliamentary contexts where demographic factors and political affiliations matter.Method: Developed EuroParlVote dataset linking debate speeches to vote outcomes with demographic metadata. Evaluated state-of-the-art LLMs on gender classification and vote prediction tasks using this benchmark.
Result: LLMs consistently misclassify female MEPs as male, show reduced accuracy for female speakers in vote prediction, favor centrist political groups while underperforming on far-left/right groups. Proprietary models (GPT-4o) outperform open-weight alternatives in robustness and fairness.
Conclusion: The study reveals systematic biases in LLMs across gender and political dimensions, highlighting the need for improved fairness in political NLP applications. The released dataset and tools support future research on accountability in political AI systems.
Abstract: We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts. It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP), such as gender, age, country, and political group. Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks – gender classification and vote prediction – revealing consistent patterns of bias. We find that LLMs frequently misclassify female MEPs as male and demonstrate reduced accuracy when simulating votes for female speakers. Politically, LLMs tend to favor centrist groups while underperforming on both far-left and far-right ones. Proprietary models like GPT-4o outperform open-weight alternatives in terms of both robustness and fairness. We release the EuroParlVote dataset, code, and demo to support future research on fairness and accountability in NLP within political contexts.
[86] MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke
Main category: cs.CL
TL;DR: MachineLearningLM is a framework that enhances LLMs’ in-context learning for ML tasks through continued pretraining on synthetic data from structural causal models, enabling strong many-shot scaling while preserving general capabilities.
Details
Motivation: Large language models struggle to learn from many in-context examples on standard ML tasks through pure in-context learning without gradient descent, limiting their effectiveness on tabular data tasks.Method: Continued pretraining framework that synthesizes ML tasks from millions of structural causal models, uses random-forest teacher for distillation, and employs token-efficient prompting to enable more examples per context window.
Result: Outperforms strong LLM baselines by ~15% on out-of-distribution tabular classification across multiple domains, shows monotonic accuracy improvement from 8 to 1,024 shots, and achieves random-forest-level accuracy without task-specific training while preserving 75.4% MMLU score.
Conclusion: MachineLearningLM successfully equips general-purpose LLMs with robust in-context ML capability while maintaining their general knowledge and reasoning abilities, demonstrating effective many-shot scaling for tabular data tasks.
Abstract: Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
[87] AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
Christian Rene Thelen, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski
Main category: cs.CL
TL;DR: Multilingual XLM-RoBERTa-Large model achieves best performance for detecting positive supportive language (candy speech) in German YouTube comments, outperforming other models with span-level training.
Details
Motivation: Automated detection of positive supportive online communication (candy speech) is underexplored but important for fostering civility and enabling systematic analysis of its impact.Method: Used monolingual (GBERT) and multilingual (Qwen3 Embedding, XLM-RoBERTa) language models on 46k German YouTube comments, with span-level training for candy speech detection.
Result: XLM-RoBERTa-Large ranked first in both binary positive F1 (0.8906) and categorized span-based detection (strict F1: 0.6307) at GermEval 2025 Shared Task.
Conclusion: Multilingual models with span-based training and emoji-aware tokenizers are effective for detecting positive supportive language, demonstrating cross-lingual capabilities.
Abstract: Positive, supportive online communication in social media (candy speech) has the potential to foster civility, yet automated detection of such language remains underexplored, limiting systematic analysis of its impact. We investigate how candy speech can be reliably detected in a 46k-comment German YouTube corpus by monolingual and multilingual language models, including GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual XLM-RoBERTa-Large model trained to detect candy speech at the span level outperforms other approaches, ranking first in both binary positive F1: 0.8906) and categorized span-based detection (strict F1: 0.6307) subtasks at the GermEval 2025 Shared Task on Candy Speech Detection. We speculate that span-based training, multilingual capabilities, and emoji-aware tokenizers improved detection performance. Our results demonstrate the effectiveness of multilingual models in identifying positive, supportive language.
[88] MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values
Yao Liang, Dongcheng Zhao, Feifei Zhao, Guobin Shen, Yuwei Wang, Dongqi Liang, Yi Zeng
Main category: cs.CL
TL;DR: MVPBench is a comprehensive benchmark for evaluating LLM alignment with human values across 75 countries, revealing geographic/demographic disparities and showing that lightweight fine-tuning methods can significantly improve value alignment.
Details
Motivation: Existing benchmarks neglect cultural and demographic diversity, limiting understanding of how value alignment generalizes globally across diverse user populations.Method: Introduced MVPBench with 24,020 annotated instances across 75 countries, then analyzed state-of-the-art LLMs and tested lightweight fine-tuning methods like LoRA and DPO for value alignment improvement.
Result: Revealed substantial disparities in alignment performance across geographic and demographic lines, and demonstrated that lightweight fine-tuning significantly enhances value alignment in both in-domain and out-of-domain settings.
Conclusion: Highlights the necessity for population-aware alignment evaluation and provides actionable insights for building culturally adaptive and value-sensitive LLMs, with MVPBench serving as a foundation for future global alignment research.
Abstract: The alignment of large language models (LLMs) with human values is critical for their safe and effective deployment across diverse user populations. However, existing benchmarks often neglect cultural and demographic diversity, leading to limited understanding of how value alignment generalizes globally. In this work, we introduce MVPBench, a novel benchmark that systematically evaluates LLMs’ alignment with multi-dimensional human value preferences across 75 countries. MVPBench contains 24,020 high-quality instances annotated with fine-grained value labels, personalized questions, and rich demographic metadata, making it the most comprehensive resource of its kind to date. Using MVPBench, we conduct an in-depth analysis of several state-of-the-art LLMs, revealing substantial disparities in alignment performance across geographic and demographic lines. We further demonstrate that lightweight fine-tuning methods, such as Low-Rank Adaptation (LoRA) and Direct Preference Optimization (DPO), can significantly enhance value alignment in both in-domain and out-of-domain settings. Our findings underscore the necessity for population-aware alignment evaluation and provide actionable insights for building culturally adaptive and value-sensitive LLMs. MVPBench serves as a practical foundation for future research on global alignment, personalized value modeling, and equitable AI development.
[89] A funny companion: Distinct neural responses to perceived AI- versus human-generated humor
Xiaohui Rao, Hanlin Wu, Zhenguang G. Cai
Main category: cs.CL
TL;DR: EEG study shows AI humor elicits different neural responses than human humor - reduced cognitive effort (smaller N400) but greater surprise/emotional response (larger LPP), with improving efficiency over time that challenges algorithm aversion.
Details
Motivation: As AI companions become capable of human-like communication including humor, understanding how people cognitively and emotionally respond to AI humor becomes important for human-AI social interaction.Method: Used electroencephalography (EEG) to compare how people process humor from AI versus human sources, analyzing both behavioral ratings and neurophysiological data including N400 and LPP effects.
Result: Participants rated AI and human humor as comparably funny behaviorally, but neurophysiologically AI humor showed smaller N400 (reduced cognitive effort), larger LPP (greater surprise/emotional response), and different temporal dynamics with improving efficiency over time. Social attitudes modulated responses.
Conclusion: The brain responds to AI humor with surprisingly positive and intense reactions, demonstrating humor’s potential for fostering genuine engagement in human-AI social interaction through cognitive adaptation and emotional reward.
Abstract: As AI companions become capable of human-like communication, including telling jokes, understanding how people cognitively and emotionally respond to AI humor becomes increasingly important. This study used electroencephalography (EEG) to compare how people process humor from AI versus human sources. Behavioral analysis revealed that participants rated AI and human humor as comparably funny. However, neurophysiological data showed that AI humor elicited a smaller N400 effect, suggesting reduced cognitive effort during the processing of incongruity. This was accompanied by a larger Late Positive Potential (LPP), indicating a greater degree of surprise and emotional response. This enhanced LPP likely stems from the violation of low initial expectations regarding AI’s comedic capabilities. Furthermore, a key temporal dynamic emerged: human humor showed habituation effects, marked by an increasing N400 and a decreasing LPP over time. In contrast, AI humor demonstrated increasing processing efficiency and emotional reward, with a decreasing N400 and an increasing LPP. This trajectory reveals how the brain can dynamically update its predictive model of AI capabilities. This process of cumulative reinforcement challenges “algorithm aversion” in humor, as it demonstrates how cognitive adaptation to AI’s language patterns can lead to an intensified emotional reward. Additionally, participants’ social attitudes toward AI modulated these neural responses, with higher perceived AI trustworthiness correlating with enhanced emotional engagement. These findings indicate that the brain responds to AI humor with surprisingly positive and intense reactions, highlighting humor’s potential for fostering genuine engagement in human-AI social interaction.
[90] Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs
Hang Guo, Yawei Li, Luca Benini
Main category: cs.CL
TL;DR: OBR is a training-free framework that combines quantization and sparsity for LLM compression, achieving W4A4KV4 quantization with 50% sparsity and significant speedup/memory reduction.
Details
Motivation: Single compression methods (quantization/pruning) are reaching their limits, so combining them is promising but challenging due to conflicting weight distribution requirements.Method: Optimal Brain Restoration (OBR) uses error compensation between quantization and pruning via second-order Hessian objective, surrogate approximation, and closed-form group error compensation.
Result: Achieves aggressive W4A4KV4 quantization with 50% sparsity, delivering up to 4.72x speedup and 6.4x memory reduction compared to FP16-dense baseline.
Conclusion: OBR successfully aligns quantization and pruning through error compensation, enabling more aggressive LLM compression than single methods alone.
Abstract: Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.
[91] HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking
Wensheng Lu, Keyu Chen, Ruizhi Qiao, Xing Sun
Main category: cs.CL
TL;DR: HiCBench is a new benchmark for evaluating document chunking in RAG systems, addressing evidence sparsity issues in existing benchmarks, and HiChunk is a multi-level document structuring framework that improves retrieval quality.
Details
Motivation: Existing RAG evaluation benchmarks are inadequate for assessing document chunking quality due to evidence sparsity, creating a need for better evaluation tools.Method: Proposed HiCBench with manually annotated multi-level document chunking points and synthesized evidence-dense QA pairs. Introduced HiChunk framework using fine-tuned LLMs for multi-level document structuring combined with Auto-Merge retrieval algorithm.
Result: HiCBench effectively evaluates different chunking methods across the RAG pipeline. HiChunk achieves better chunking quality with reasonable time consumption, enhancing overall RAG system performance.
Conclusion: The proposed HiCBench benchmark and HiChunk framework successfully address document chunking evaluation challenges and improve retrieval quality in RAG systems.
Abstract: Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This paper first analyzes why existing RAG evaluation benchmarks are inadequate for assessing document chunking quality, specifically due to evidence sparsity. Based on this conclusion, we propose HiCBench, which includes manually annotated multi-level document chunking points, synthesized evidence-dense quetion answer(QA) pairs, and their corresponding evidence sources. Additionally, we introduce the HiChunk framework, a multi-level document structuring framework based on fine-tuned LLMs, combined with the Auto-Merge retrieval algorithm to improve retrieval quality. Experiments demonstrate that HiCBench effectively evaluates the impact of different chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves better chunking quality within reasonable time consumption, thereby enhancing the overall performance of RAG systems.
[92] GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models
Min Zeng, Jingfei Sun, Xueyou Luo, Caiquan Liu, Shiqi Zhang, Li Xie, Xiaoxin Chen
Main category: cs.CL
TL;DR: GTA framework combines supervised fine-tuning efficiency with reinforcement learning capability gains through a guess-think-answer process using cross-entropy loss and RL rewards.
Details
Motivation: Address the efficiency-capability trade-off between pure RL (slow convergence, inefficient exploration) and pure SFT (limited performance ceiling, weak theoretical foundation) in NLP tasks.Method: Proposes Guess-Think-Answer framework: model produces provisional guess (cross-entropy loss), reflects on it, then generates final answer with RL rewards shaping both output and structure format. Uses loss masking and gradient constraints to mitigate training signal conflicts.
Result: Substantially accelerates convergence while outperforming both standalone SFT and RL baselines on four text classification benchmarks.
Conclusion: GTA successfully combines SFT efficiency with RL capability gains, achieving faster convergence than pure RL and higher performance ceiling than pure SFT through hybrid training paradigm.
Abstract: In natural language processing tasks, pure reinforcement learning (RL) fine-tuning methods often suffer from inefficient exploration and slow convergence; while supervised fine-tuning (SFT) methods, although efficient in training, have limited performance ceiling and less solid theoretical foundation compared to RL. To address efficiency-capability trade-off, we propose the Guess-Think-Answer (GTA) framework that combines the efficiency of SFT with the capability gains of RL in a unified training paradigm. GTA works by having the model first produce a provisional guess (optimized via cross-entropy loss), then reflect on this guess before generating the final answer, with RL rewards shaping both the final output and the format of the entire GTA structure. This hybrid approach achieves both faster convergence than pure RL and higher performance ceiling than pure SFT. To mitigate gradient conflicts between the two training signals, we employ loss masking and gradient constraints. Empirical results on four text classification benchmarks demonstrate that GTA substantially accelerates convergence while outperforming both standalone SFT and RL baselines.
cs.CV
[93] Artificial Intelligence in Breast Cancer Care: Transforming Preoperative Planning and Patient Education with 3D Reconstruction
Mustafa Khanbhai, Giulia Di Nardo, Jun Ma, Vivienne Freitas, Caterina Masino, Ali Dolatabadi, Zhaoxun “Lorenz” Liu, Wey Leong, Wagner H. Souza, Amin Madani
Main category: cs.CV
TL;DR: Novel human-in-the-loop ML approach using U-Mamba for 3D anatomical segmentation achieves high accuracy (DSC 0.97 for organs, 0.96 for tissue, 0.82 for tumors) and improves surgical planning and patient education.
Details
Motivation: Traditional segmentation models struggle with generalization across diverse datasets, requiring improved algorithms for accurate preoperative planning and 3D anatomical reconstruction beyond breast cancer applications.Method: Processed 120 breast MRIs through anonymization, manual segmentation of T1-weighted/DCE sequences, co-registration, segmentation of breast structures, and 3D visualization using ITK-SNAP. Used human-in-the-loop approach with U-Mamba model for refinement.
Result: U-Mamba achieved DSC values of 0.97 (±0.013) for whole organs, 0.96 (±0.024) for fibroglandular tissue, and 0.82 (±0.12) for tumors. Generated accurate 3D reconstructions and received positive feedback from clinicians for improved planning and navigation.
Conclusion: The human-in-the-loop ML approach successfully generalizes algorithms for 3D reconstruction, offering enhanced visualization for clinicians, improved preoperative planning, and more effective patient education across medical applications.
Abstract: Effective preoperative planning requires accurate algorithms for segmenting anatomical structures across diverse datasets, but traditional models struggle with generalization. This study presents a novel machine learning methodology to improve algorithm generalization for 3D anatomical reconstruction beyond breast cancer applications. We processed 120 retrospective breast MRIs (January 2018-June 2023) through three phases: anonymization and manual segmentation of T1-weighted and dynamic contrast-enhanced sequences; co-registration and segmentation of whole breast, fibroglandular tissue, and tumors; and 3D visualization using ITK-SNAP. A human-in-the-loop approach refined segmentations using U-Mamba, designed to generalize across imaging scenarios. Dice similarity coefficient assessed overlap between automated segmentation and ground truth. Clinical relevance was evaluated through clinician and patient interviews. U-Mamba showed strong performance with DSC values of 0.97 ($\pm$0.013) for whole organs, 0.96 ($\pm$0.024) for fibroglandular tissue, and 0.82 ($\pm$0.12) for tumors on T1-weighted images. The model generated accurate 3D reconstructions enabling visualization of complex anatomical features. Clinician interviews indicated improved planning, intraoperative navigation, and decision support. Integration of 3D visualization enhanced patient education, communication, and understanding. This human-in-the-loop machine learning approach successfully generalizes algorithms for 3D reconstruction and anatomical segmentation across patient datasets, offering enhanced visualization for clinicians, improved preoperative planning, and more effective patient education, facilitating shared decision-making and empowering informed patient choices across medical applications.
[94] RU-Net for Automatic Characterization of TRISO Fuel Cross Sections
Lu Cai, Fei Xu, Min Xian, Yalei Tang, Shoukun Sun, John Stempien
Main category: cs.CV
TL;DR: CNN-based automated segmentation of TRISO particle layers using RU-Net architecture achieves high accuracy (IoU) and reduces manual labor in irradiation analysis.
Details
Motivation: Manual analysis of TRISO particle irradiation effects is time-consuming, subjective, and inefficient for statistical data collection from thousands of particles per compact.Method: Used convolutional neural networks (CNNs) including custom RU-Net, U-Net, ResNet, and Attention U-Net to automatically segment microscopic TRISO layer images from a dataset of 2,000+ annotated cross-sectional images.
Result: RU-Net performed best among the tested models in terms of Intersection over Union (IoU) metric for layer segmentation accuracy.
Conclusion: CNN models enable faster, more objective analysis of TRISO particle cross-sections, significantly reducing manual labor while improving segmentation objectivity.
Abstract: During irradiation, phenomena such as kernel swelling and buffer densification may impact the performance of tristructural isotropic (TRISO) particle fuel. Post-irradiation microscopy is often used to identify these irradiation-induced morphologic changes. However, each fuel compact generally contains thousands of TRISO particles. Manually performing the work to get statistical information on these phenomena is cumbersome and subjective. To reduce the subjectivity inherent in that process and to accelerate data analysis, we used convolutional neural networks (CNNs) to automatically segment cross-sectional images of microscopic TRISO layers. CNNs are a class of machine-learning algorithms specifically designed for processing structured grid data. They have gained popularity in recent years due to their remarkable performance in various computer vision tasks, including image classification, object detection, and image segmentation. In this research, we generated a large irradiated TRISO layer dataset with more than 2,000 microscopic images of cross-sectional TRISO particles and the corresponding annotated images. Based on these annotated images, we used different CNNs to automatically segment different TRISO layers. These CNNs include RU-Net (developed in this study), as well as three existing architectures: U-Net, Residual Network (ResNet), and Attention U-Net. The preliminary results show that the model based on RU-Net performs best in terms of Intersection over Union (IoU). Using CNN models, we can expedite the analysis of TRISO particle cross sections, significantly reducing the manual labor involved and improving the objectivity of the segmentation results.
[95] Modular, On-Site Solutions with Lightweight Anomaly Detection for Sustainable Nutrient Management in Agriculture
Abigail R. Cohen, Yuming Sun, Zhihao Qin, Harsh S. Muriki, Zihao Xiao, Yeonju Lee, Matthew Housley, Andrew F. Sharkey, Rhuanito S. Ferrarezi, Jing Li, Lu Gan, Yongsheng Chen
Main category: cs.CV
TL;DR: A tiered pipeline using multispectral imaging and autoencoders for efficient nutrient anomaly detection and crop status estimation, balancing accuracy with energy efficiency for sustainable agriculture.
Details
Motivation: Current nutrient management approaches require lengthy analyses preventing real-time optimization, and imaging-based phenotyping is computationally intensive for resource-constrained environments.Method: Hierarchical pipeline using autoencoder for early anomaly detection, comparing vegetation index features with Random Forest vs. raw whole-image Vision Transformer for detailed nutrient status estimation across three fertilizer treatments.
Result: 73% net detection of severely nutrient-deficient samples 9 days after transplanting at lower energy cost than wasted nitrogen; ViT outperformed RF on phosphorus and calcium estimation (R² 0.61 vs 0.58, 0.48 vs 0.35) but with higher energy consumption.
Conclusion: The modular pipeline enables edge diagnostics and practical opportunities for agricultural sustainability by providing flexible trade-offs between efficiency and accuracy in nutrient management.
Abstract: Efficient nutrient management is critical for crop growth and sustainable resource consumption (e.g., nitrogen, energy). Current approaches require lengthy analyses, preventing real-time optimization; similarly, imaging facilitates rapid phenotyping but can be computationally intensive, preventing deployment under resource constraints. This study proposes a flexible, tiered pipeline for anomaly detection and status estimation (fresh weight, dry mass, and tissue nutrients), including a comprehensive energy analysis of approaches that span the efficiency-accuracy spectrum. Using a nutrient depletion experiment with three treatments (T1-100%, T2-50%, and T3-25% fertilizer strength) and multispectral imaging (MSI), we developed a hierarchical pipeline using an autoencoder (AE) for early warning. Further, we compared two status estimation modules of different complexity for more detailed analysis: vegetation index (VI) features with machine learning (Random Forest, RF) and raw whole-image deep learning (Vision Transformer, ViT). Results demonstrated high-efficiency anomaly detection (73% net detection of T3 samples 9 days after transplanting) at substantially lower energy than embodied energy in wasted nitrogen. The state estimation modules show trade-offs, with ViT outperforming RF on phosphorus and calcium estimation (R2 0.61 vs. 0.58, 0.48 vs. 0.35) at higher energy cost. With our modular pipeline, this work opens opportunities for edge diagnostics and practical opportunities for agricultural sustainability.
[96] Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
Yuriel Ryan, Rui Yang Tan, Kenny Tsu Wei Choo, Roy Ka-Wei Lee
Main category: cs.CV
TL;DR: PixelHumor is a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate Large Multimodal Models’ ability to understand multimodal humor and narrative sequences, revealing significant performance gaps compared to humans.
Details
Motivation: Humor understanding is a core aspect of social intelligence but remains a significant challenge for Large Multimodal Models (LMMs), highlighting the need for better evaluation frameworks.Method: Created PixelHumor benchmark with 2,800 annotated multi-panel comics and conducted experiments with state-of-the-art LMMs to evaluate their performance on humor interpretation and narrative sequencing tasks.
Result: Top LMMs achieved only 61% accuracy in panel sequencing (far below human performance), revealing substantial gaps in multimodal integration for narrative and humor understanding.
Conclusion: Current LMMs have critical limitations in integrating visual and textual cues for coherent narrative and humor understanding, and PixelHumor provides a rigorous framework to drive development of more socially aware models.
Abstract: Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs’ ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models’ integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.
[97] Palmprint De-Identification Using Diffusion Model for High-Quality and Diverse Synthesis
Licheng Yan, Bob Zhang, Andrew Beng Jin Teoh, Lu Leng, Shuyi Li, Yuqi Wang, Ziyuan Yang
Main category: cs.CV
TL;DR: A training-free framework using pre-trained diffusion models to generate diverse, high-quality palmprint images that conceal identity features while preserving visual fidelity and utility.
Details
Motivation: Palmprint recognition advances enable reliable identification but create security risks from publicly available images. There's a lack of research on palmprint anonymization methods to prevent misuse while maintaining image utility.Method: Uses pre-trained diffusion models with semantic-guided embedding fusion and prior interpolation mechanism for stable, controllable synthesis. Introduces de-identification ratio metric for assessment.
Result: Effectively conceals identity traits with significant diversity across samples. Maintains high visual fidelity and excellent usability while balancing de-identification with non-identity information retention.
Conclusion: The proposed framework successfully addresses palmprint de-identification needs by generating identity-concealed images that preserve utility, offering a practical solution for privacy protection in palmprint applications.
Abstract: Palmprint recognition techniques have advanced significantly in recent years, enabling reliable recognition even when palmprints are captured in uncontrolled or challenging environments. However, this strength also introduces new risks, as publicly available palmprint images can be misused by adversaries for malicious activities. Despite this growing concern, research on methods to obscure or anonymize palmprints remains largely unexplored. Thus, it is essential to develop a palmprint de-identification technique capable of removing identity-revealing features while retaining the image’s utility and preserving non-sensitive information. In this paper, we propose a training-free framework that utilizes pre-trained diffusion models to generate diverse, high-quality palmprint images that conceal identity features for de-identification purposes. To ensure greater stability and controllability in the synthesis process, we incorporate a semantic-guided embedding fusion alongside a prior interpolation mechanism. We further propose the de-identification ratio, a novel metric for intuitive de-identification assessment. Extensive experiments across multiple palmprint datasets and recognition methods demonstrate that our method effectively conceals identity-related traits with significant diversity across de-identified samples. The de-identified samples preserve high visual fidelity and maintain excellent usability, achieving a balance between de-identification and retaining non-identity information.
[98] OnlineHOI: Towards Online Human-Object Interaction Generation and Perception
Yihong Ji, Yunze Liu, Yiyao Zhuo, Weijiang Yu, Fei Ma, Joshua Huang, Fei Yu
Main category: cs.CV
TL;DR: Proposes OnlineHOI framework for real-time human-object interaction tasks using Mamba architecture with memory mechanism, achieving SOTA results on online generation and perception tasks.
Details
Motivation: Current HOI methods operate in offline settings using full sequence information, but real-world applications require online processing with only current and historical data available at each time step.Method: Introduces OnlineHOI framework based on Mamba architecture with memory mechanism, leveraging Mamba’s streaming data modeling capabilities and efficient historical information integration.
Result: Achieves state-of-the-art performance on Core4D and OAKINK2 online generation tasks, as well as online HOI4D perception task.
Conclusion: The proposed online approach significantly outperforms offline methods in real-time HOI scenarios, demonstrating the importance of proper modeling for streaming interaction data.
Abstract: The perception and generation of Human-Object Interaction (HOI) are crucial for fields such as robotics, AR/VR, and human behavior understanding. However, current approaches model this task in an offline setting, where information at each time step can be drawn from the entire interaction sequence. In contrast, in real-world scenarios, the information available at each time step comes only from the current moment and historical data, i.e., an online setting. We find that offline methods perform poorly in an online context. Based on this observation, we propose two new tasks: Online HOI Generation and Perception. To address this task, we introduce the OnlineHOI framework, a network architecture based on the Mamba framework that employs a memory mechanism. By leveraging Mamba’s powerful modeling capabilities for streaming data and the Memory mechanism’s efficient integration of historical information, we achieve state-of-the-art results on the Core4D and OAKINK2 online generation tasks, as well as the online HOI4D perception task.
[99] EfficientNet-Based Multi-Class Detection of Real, Deepfake, and Plastic Surgery Faces
Li Kun, Milena Radenkovic
Main category: cs.CV
TL;DR: Deep learning enables advanced deepfake technology that creates indistinguishable fake media, posing serious risks to privacy, reputation, national security, and democratic processes through face-swapping and political manipulation.
Details
Motivation: To analyze the dual nature of deep learning technology - while it drives progress in computer vision and revolutionary technologies, its application in deepfakes creates significant societal risks that need to be understood and addressed.Method: The paper examines the development and applications of deepfake technology, focusing on how it produces counterfeit photographs and films that can bypass facial recognition systems and be used for malicious purposes like identity theft and political manipulation.
Result: Deepfake technology has advanced to create indistinguishable fake media that can damage privacy, ruin reputations of public figures, threaten national security, and undermine political systems through election interference and public perception manipulation.
Conclusion: The improper application of deepfake technology presents serious detrimental effects on society, including threats to personal privacy, political stability, and national security, requiring careful consideration and countermeasures to mitigate these risks.
Abstract: Currently, deep learning has been utilised to tackle several difficulties in our everyday lives. It not only exhibits progress in computer vision but also constitutes the foundation for several revolutionary technologies. Nonetheless, similar to all phenomena, the use of deep learning in diverse domains has produced a multifaceted interaction of advantages and disadvantages for human society. Deepfake technology has advanced, significantly impacting social life. However, developments in this technology can affect privacy, the reputations of prominent personalities, and national security via software development. It can produce indistinguishable counterfeit photographs and films, potentially impairing the functionality of facial recognition systems, so presenting a significant risk. The improper application of deepfake technology produces several detrimental effects on society. Face-swapping programs mislead users by altering persons’ appearances or expressions to fulfil particular aims or to appropriate personal information. Deepfake technology permeates daily life through such techniques. Certain individuals endeavour to sabotage election campaigns or subvert prominent political figures by creating deceptive pictures to influence public perception, causing significant harm to a nation’s political and economic structure.
[100] A Modern Look at Simplicity Bias in Image Classification Tasks
Xiaoguang Chang, Teng Wang, Changyin Sun
Main category: cs.CV
TL;DR: This paper investigates the simplicity bias (SB) in CLIP models and its relationship with performance across various image classification tasks, proposing a new frequency-aware measure for better SB assessment.
Details
Motivation: Existing studies on neural network simplicity bias focus on simple models or synthetic tasks, making it challenging to measure SB in large models like CLIP and understand its relevance to real image classification tasks.Method: The authors theoretically analyze limitations of existing complexity measures, propose a frequency-aware measure for finer-grained SB differences, validate it on CLIP models using SB-modulation methods, and examine SB-performance relationships across zero-shot and fine-tuning settings.
Result: The proposed frequency-aware measure is more informative and consistent than previous measures. Experiments show varying behaviors - stronger SB correlates better with OOD generalization than adversarial robustness, indicating task-dependent benefits of simplicity bias.
Conclusion: The study highlights the importance of aligning a model’s inductive biases with target task characteristics, demonstrating that simplicity bias has varying impacts across different image classification scenarios.
Abstract: The simplicity Bias (SB) of neural networks, i.e.\ their tendency to represent simple functions, is a key factor in their generalization capabilities. Recent studies show that an excessive SB may harm performance on complex tasks, and the need for this bias varies across tasks. Many of these studies focus on simple models or synthetic tasks. It remains challenging to measure the SB in large models and little is known about the relevance of the SB to various image classification tasks. In this paper, we investigate the relationship between the SB in CLIP models and their performance across image classification tasks. First, we theoretically analyze the potential limitation of existing measures of complexity that have been used to characterize small models. To address this, we propose a frequency-aware measure capturing finer-grained SB differences. We validate this measure on CLIP models subjected to two recent SB-modulation methods, demonstrating that it is more informative and consistent than previous measures. Second, we examine the relation between the SB of those models and their performance across a range of image classification tasks, including zero-shot and fine-tuning settings. These experiments reveal a range of behaviors. For example, a stronger SB correlates with a better performance on OOD generalization than on adversarial robustness. These results highlight the benefits of aligning a model’s inductive biases with the characteristics of the target task.
[101] GraphDerm: Fusing Imaging, Physical Scale, and Metadata in a Population-Graph Classifier for Dermoscopic Lesions
Mehdi Yousefzadeh, Parsa Esfahanian, Sara Rashidifar, Hossein Salahshoor Gavalan, Negar Sadat Rafiee Tabatabaee, Saeid Gorgin, Dara Rahmati, Maryam Daneshpazhooh
Main category: cs.CV
TL;DR: GraphDerm is a graph neural network framework that combines dermoscopic images, millimeter-scale calibration, and patient metadata for improved melanoma classification, outperforming image-only AI approaches.
Details
Motivation: Current AI systems for dermoscopy often ignore crucial patient metadata (age, sex, lesion site) and lack physical scale calibration needed for proper geometric analysis of lesions, limiting their diagnostic accuracy.Method: The framework uses U-Nets for lesion and ruler segmentation, regresses pixels-per-millimeter from ruler masks, computes real-scale geometric descriptors, and employs a spectral GNN with EfficientNet-B3 features and metadata/geometry-based edges for semi-supervised classification.
Result: Achieved AUC of 0.9812 with thresholded variant maintaining 0.9788 AUC using only 25% of edges, significantly outperforming the image-only baseline (AUC 0.9440). Ruler and lesion segmentation reached Dice scores of 0.904 and 0.908 respectively.
Conclusion: Integrating calibrated scale, lesion geometry, and metadata in a population graph provides substantial improvements over image-only approaches, with sparser graphs maintaining near-optimal accuracy for efficient deployment.
Abstract: Introduction. Dermoscopy aids melanoma triage, yet image-only AI often ignores patient metadata (age, sex, site) and the physical scale needed for geometric analysis. We present GraphDerm, a population-graph framework that fuses imaging, millimeter-scale calibration, and metadata for multiclass dermoscopic classification, to the best of our knowledge the first ISIC-scale application of GNNs to dermoscopy. Methods. We curate ISIC 2018/2019, synthesize ruler-embedded images with exact masks, and train U-Nets (SE-ResNet-18) for lesion and ruler segmentation. Pixels-per-millimeter are regressed from the ruler-mask two-point correlation via a lightweight 1D-CNN. From lesion masks we compute real-scale descriptors (area, perimeter, radius of gyration). Node features use EfficientNet-B3; edges encode metadata/geometry similarity (fully weighted or thresholded). A spectral GNN performs semi-supervised node classification; an image-only ANN is the baseline. Results. Ruler and lesion segmentation reach Dice 0.904 and 0.908; scale regression attains MAE 1.5 px (RMSE 6.6). The graph attains AUC 0.9812, with a thresholded variant using about 25% of edges preserving AUC 0.9788 (vs. 0.9440 for the image-only baseline); per-class AUCs typically fall in the 0.97-0.99 range. Conclusion. Unifying calibrated scale, lesion geometry, and metadata in a population graph yields substantial gains over image-only pipelines on ISIC-2019. Sparser graphs retain near-optimal accuracy, suggesting efficient deployment. Scale-aware, graph-based AI is a promising direction for dermoscopic decision support; future work will refine learned edge semantics and evaluate on broader curated benchmarks.
[102] ResidualViT for Efficient Temporally Dense Video Encoding
Mattia Soldan, Fabian Caba Heilbron, Bernard Ghanem, Josef Sivic, Bryan Russell
Main category: cs.CV
TL;DR: ResidualViT: A vision transformer architecture that reduces computational cost for temporally dense video tasks by leveraging temporal redundancy, using learnable residual connections and token reduction, achieving 60% cost reduction and 2.5x speedup while maintaining accuracy.
Details
Motivation: Temporally dense video tasks require high temporal resolution frame processing, which is computationally expensive. There's a need to reduce feature computation costs while maintaining performance.Method: Proposes ResidualViT architecture with learnable residual connections for temporal consistency and token reduction module to discard redundant information. Uses lightweight distillation to approximate original foundation model features.
Result: Achieves up to 60% reduction in computational cost and 2.5x faster inference speed across four tasks and five datasets, while closely matching the accuracy of the original foundation model in both zero-shot and supervised settings.
Conclusion: ResidualViT effectively reduces computational burden for temporally dense video tasks by exploiting temporal redundancy, making high-resolution video processing more efficient without significant accuracy loss.
Abstract: Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require “temporally dense” reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model. Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while closely approximating the accuracy of the original foundation model.
[103] PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models
Wanru Zhuang, Wenbo Li, Zhibin Lan, Xu Han, Peng Li, Jinsong Su
Main category: cs.CV
TL;DR: This paper introduces position-aware text image machine translation (PATIMT) with layout preservation, creates a benchmark dataset (PATIMTBench) across 10 real-world scenarios, and shows compact LVLMs achieve SOTA performance after fine-tuning.
Details
Motivation: Current TIMT methods only translate all text in images without providing bounding boxes or handling diverse scenarios, limiting practical applications that require fine-grained, layout-preserving translations.Method: Extended TIMT to PATIMT with two sub-tasks: region-specific translation and full-image translation with grounding. Developed PATIMTBench with 10 diverse scenarios using an Adaptive Image OCR Refinement Pipeline and manually annotated 1,200 high-quality test instances.
Result: Compact Large Vision-Language Models achieved state-of-the-art performance on both sub-tasks after fine-tuning on the constructed data, demonstrating scalability and generalizability.
Conclusion: The proposed PATIMT framework and benchmark successfully address limitations of traditional TIMT, enabling fine-grained layout-preserving translations with practical value across diverse real-world scenarios.
Abstract: Text Image Machine Translation (TIMT) aims to translate texts embedded within an image into another language. Current TIMT studies primarily focus on providing translations for all the text within an image, while neglecting to provide bounding boxes and covering limited scenarios. In this work, we extend traditional TIMT into position-aware TIMT (PATIMT), aiming to support fine-grained and layoutpreserving translation, which holds great practical value but remains largely unexplored. This task comprises two key sub-tasks: regionspecific translation and full-image translation with grounding. To support existing models on PATIMT and conduct fair evaluation, we construct the PATIMT benchmark (PATIMTBench), which consists of 10 diverse real-world scenarios. Specifically, we introduce an Adaptive Image OCR Refinement Pipeline, which adaptively selects appropriate OCR tools based on scenario and refines the results of text-rich images. To ensure evaluation reliability, we further construct a test set, which contains 1,200 high-quality instances manually annotated and reviewed by human experts. After fine-tuning on our data, compact Large Vision-Language Models (LVLMs) achieve state-of-the-art performance on both sub-tasks. Experimental results also highlight the scalability and generalizability of our training data
[104] Image Realness Assessment and Localization with Multimodal Features
Lovish Kaushik, Agnij Biswas, Somdyuti Paul
Main category: cs.CV
TL;DR: A framework for assessing perceptual realness of AI-generated images using vision-language models to provide both overall realness scores and local inconsistency maps.
Details
Motivation: Need for reliable quantification of perceptual realness in AI-generated images and identification of visually inconsistent regions to improve photorealism through realness feedback during training.Method: Multimodal approach using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets as substitutes for human annotations.
Result: Improved objective realness prediction performance and production of dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.
Conclusion: The proposed framework successfully accomplishes both overall realness assessment and local inconsistency identification, providing valuable feedback for improving generative AI photorealism.
Abstract: A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.
[105] Domain Adaptive SAR Wake Detection: Leveraging Similarity Filtering and Memory Guidance
He Gao, Baoxiang Huang, Milena Radenkovic, Borui Li, Ge Chen
Main category: cs.CV
TL;DR: SimMemDA framework for unsupervised domain adaptation from optical to SAR images for ship wake detection, using style transfer, similarity filtering, and memory-guided pseudo-label calibration.
Details
Motivation: SAR images are crucial for wake detection but have abstract/noisy features making annotation difficult. Optical images provide clearer cues but models trained on optical data degrade when applied to SAR due to domain shift.Method: Uses WakeGAN for style transfer to generate SAR-like pseudo-images, instance-level feature similarity filtering to prioritize target-like samples, and Feature-Confidence Memory Bank with KNN fusion to dynamically calibrate pseudo-labels. Includes region-mixed training combining source annotations with calibrated target labels.
Result: Experimental results demonstrate improved accuracy and robustness for cross-modal ship wake detection tasks.
Conclusion: The proposed SimMemDA method effectively addresses cross-modal domain adaptation challenges and validates the feasibility of the approach for SAR wake detection.
Abstract: Synthetic Aperture Radar (SAR), with its all-weather and wide-area observation capabilities, serves as a crucial tool for wake detection. However, due to its complex imaging mechanism, wake features in SAR images often appear abstract and noisy, posing challenges for accurate annotation. In contrast, optical images provide more distinct visual cues, but models trained on optical data suffer from performance degradation when applied to SAR images due to domain shift. To address this cross-modal domain adaptation challenge, we propose a Similarity-Guided and Memory-Guided Domain Adaptation (termed SimMemDA) framework for unsupervised domain adaptive ship wake detection via instance-level feature similarity filtering and feature memory guidance. Specifically, to alleviate the visual discrepancy between optical and SAR images, we first utilize WakeGAN to perform style transfer on optical images, generating pseudo-images close to the SAR style. Then, instance-level feature similarity filtering mechanism is designed to identify and prioritize source samples with target-like distributions, minimizing negative transfer. Meanwhile, a Feature-Confidence Memory Bank combined with a K-nearest neighbor confidence-weighted fusion strategy is introduced to dynamically calibrate pseudo-labels in the target domain, improving the reliability and stability of pseudo-labels. Finally, the framework further enhances generalization through region-mixed training, strategically combining source annotations with calibrated target pseudo-labels. Experimental results demonstrate that the proposed SimMemDA method can improve the accuracy and robustness of cross-modal ship wake detection tasks, validating the effectiveness and feasibility of the proposed method.
[106] Uncertainty-Aware Hourly Air Temperature Mapping at 2 km Resolution via Physics-Guided Deep Learning
Shengjie Kris Liu, Siqin Wang, Lu Zhang
Main category: cs.CV
TL;DR: A deep learning approach called Amplifier Air-Transformer generates hourly 2km resolution air temperature data over the contiguous US by reconstructing cloud-obscured satellite data and transforming it using Earth surface properties, achieving 1.93°C accuracy.
Details
Motivation: No single data source provides seamless spatiotemporal air temperature data - weather stations offer continuous monitoring but limited coverage, while satellites provide broad coverage but with gaps from cloud obscuration.Method: Two-stage neural network approach: 1) Reconstructs GOES-16 surface temperature data through a network encoded with annual temperature cycle, using linear amplification and convolutional layers 2) Transforms reconstructed surface temperature to air temperature using Earth surface properties, enhanced with deep ensemble learning for uncertainty estimation.
Result: Built and tested on 77.7 billion surface temperature pixels and 155 million weather station records (2018-2024), achieving hourly air temperature mapping accuracy of 1.93°C in station-based validation.
Conclusion: The approach successfully streamlines surface temperature reconstruction and air temperature prediction, can be extended to other satellite sources for seamless high-resolution temperature monitoring, with generated data publicly available.
Abstract: Near-surface air temperature is a key physical property of the Earth’s surface. Although weather stations offer continuous monitoring and satellites provide broad spatial coverage, no single data source offers seamless data in a spatiotemporal fashion. Here, we propose a data-driven, physics-guided deep learning approach to generate hourly air temperature data at 2 km resolution over the contiguous United States. The approach, called Amplifier Air-Transformer, first reconstructs GOES-16 surface temperature data obscured by clouds. It does so through a neural network encoded with the annual temperature cycle, incorporating a linear term to amplify ERA5 temperature values at finer scales and convolutional layers to capture spatiotemporal variations. Then, another neural network transforms the reconstructed surface temperature into air temperature by leveraging its latent relationship with key Earth surface properties. The approach is further enhanced with predictive uncertainty estimation through deep ensemble learning to improve reliability. The proposed approach is built and tested on 77.7 billion surface temperature pixels and 155 million air temperature records from weather stations across the contiguous United States (2018-2024), achieving hourly air temperature mapping accuracy of 1.93 C in station-based validation. The proposed approach streamlines surface temperature reconstruction and air temperature prediction, and it can be extended to other satellite sources for seamless air temperature monitoring at high spatiotemporal resolution. The generated data of this study can be downloaded at https://doi.org/10.5281/zenodo.15252812, and the project webpage can be found at https://skrisliu.com/HourlyAirTemp2kmUSA/.
[107] DS@GT AnimalCLEF: Triplet Learning over ViT Manifolds with Nearest Neighbor Classification for Animal Re-identification
Anthony Miyaguchi, Chandrasekaran Maruthaiyannan, Charles R. Clark
Main category: cs.CV
TL;DR: Post-hoc metric learning effectiveness depends heavily on backbone embedding quality and domain-specificity. Domain-specific MegaDescriptor outperforms general-purpose DINOv2 for animal re-identification, with minimal gains from metric learning on general-purpose features.
Details
Motivation: To investigate how the effectiveness of post-hoc metric learning depends on the initial quality and domain-specificity of backbone embeddings for animal re-identification tasks, comparing general-purpose vs domain-specific models.Method: Used K-Nearest Neighbor classifier with robust thresholding for identification, compared DINOv2 (general-purpose) vs MegaDescriptor (domain-specific) backbones, and applied triplet-learning projection head to improve performance.
Result: Triplet-learning improved specialized MegaDescriptor by 0.13 points but only gained 0.03 for general-purpose DINOv2 on averaged BAKS and BAUS metrics. General-purpose manifold proved difficult to reshape for fine-grained tasks.
Conclusion: General-purpose features have critical limitations for specialized re-ID tasks with limited data. Domain-specific pre-training is crucial, as general-purpose manifolds are resistant to metric learning refinement for fine-grained tasks.
Abstract: This paper details the DS@GT team’s entry for the AnimalCLEF 2025 re-identification challenge. Our key finding is that the effectiveness of post-hoc metric learning is highly contingent on the initial quality and domain-specificity of the backbone embeddings. We compare a general-purpose model (DINOv2) with a domain-specific model (MegaDescriptor) as a backbone. A K-Nearest Neighbor classifier with robust thresholding then identifies known individuals or flags new ones. While a triplet-learning projection head improved the performance of the specialized MegaDescriptor model by 0.13 points, it yielded minimal gains (0.03) for the general-purpose DINOv2 on averaged BAKS and BAUS. We demonstrate that the general-purpose manifold is more difficult to reshape for fine-grained tasks, as evidenced by stagnant validation loss and qualitative visualizations. This work highlights the critical limitations of refining general-purpose features for specialized, limited-data re-ID tasks and underscores the importance of domain-specific pre-training. The implementation for this work is publicly available at github.com/dsgt-arc/animalclef-2025.
[108] GhostNetV3-Small: A Tailored Architecture and Comparative Study of Distillation Strategies for Tiny Images
Florian Zager, Hamza A. A. Gardi
Main category: cs.CV
TL;DR: GhostNetV3-Small variant outperforms original GhostNetV3 on CIFAR-10 with 93.94% accuracy, but knowledge distillation techniques surprisingly reduced performance, showing architectural adaptation is more effective than distillation for small-scale image classification.
Details
Motivation: Deep neural networks have high computational demands unsuitable for resource-constrained edge devices, requiring efficient model compression and adaptation strategies for mobile deployment.Method: Proposed GhostNetV3-Small architecture optimized for low-resolution inputs (CIFAR-10) and evaluated knowledge distillation techniques including traditional distillation, teacher assistants, and teacher ensembles.
Result: GhostNetV3-Small achieved 93.94% accuracy on CIFAR-10, significantly outperforming original GhostNetV3. All distillation strategies unexpectedly reduced accuracy compared to baseline training.
Conclusion: Architectural adaptation is more impactful than knowledge distillation for small-scale image classification tasks, indicating need for further research on model design and advanced distillation techniques for low-resolution domains.
Abstract: Deep neural networks have achieved remarkable success across a range of tasks, however their computational demands often make them unsuitable for deployment on resource-constrained edge devices. This paper explores strategies for compressing and adapting models to enable efficient inference in such environments. We focus on GhostNetV3, a state-of-the-art architecture for mobile applications, and propose GhostNetV3-Small, a modified variant designed to perform better on low-resolution inputs such as those in the CIFAR-10 dataset. In addition to architectural adaptation, we provide a comparative evaluation of knowledge distillation techniques, including traditional knowledge distillation, teacher assistants, and teacher ensembles. Experimental results show that GhostNetV3-Small significantly outperforms the original GhostNetV3 on CIFAR-10, achieving an accuracy of 93.94%. Contrary to expectations, all examined distillation strategies led to reduced accuracy compared to baseline training. These findings indicate that architectural adaptation can be more impactful than distillation in small-scale image classification tasks, highlighting the need for further research on effective model design and advanced distillation techniques for low-resolution domains.
[109] From Orthomosaics to Raw UAV Imagery: Enhancing Palm Detection and Crown-Center Localization
Rongkun Zhu, Kangning Cui, Wei Tang, Rui-Feng Wang, Sarra Alqahtani, David Lutz, Fan Yang, Paul Fine, Jordan Karubian, Robert Plemmons, Jean-Michel Morel, Victor Pauca, Miles Silman
Main category: cs.CV
TL;DR: Raw UAV imagery outperforms orthomosaics for palm tree detection and localization in tropical forests, with crown-center annotations further improving accuracy for ecological monitoring applications.
Details
Motivation: Orthomosaic imagery from UAVs has stitching artifacts and heavy preprocessing that limit field deployment suitability, while raw UAV imagery may offer better performance for tree mapping in ecological monitoring.Method: Used state-of-the-art detectors and keypoint models to compare detection performance between orthomosaic and raw UAV imagery, including within-domain and cross-domain transfer analysis, and evaluated crown-center annotations vs bounding-box centroids.
Result: Raw imagery yields superior performance in deployment-relevant scenarios, while orthomosaics retain value for robust cross-domain generalization. Crown-center annotations improve localization accuracy beyond bounding-box centroids.
Conclusion: Raw UAV imagery with crown-center annotations provides practical advantages for biodiversity and conservation monitoring, offering precise tree positions for downstream ecological analyses.
Abstract: Accurate mapping of individual trees is essential for ecological monitoring and forest management. Orthomosaic imagery from unmanned aerial vehicles (UAVs) is widely used, but stitching artifacts and heavy preprocessing limit its suitability for field deployment. This study explores the use of raw UAV imagery for palm detection and crown-center localization in tropical forests. Two research questions are addressed: (1) how detection performance varies across orthomosaic and raw imagery, including within-domain and cross-domain transfer, and (2) to what extent crown-center annotations improve localization accuracy beyond bounding-box centroids. Using state-of-the-art detectors and keypoint models, we show that raw imagery yields superior performance in deployment-relevant scenarios, while orthomosaics retain value for robust cross-domain generalization. Incorporating crown-center annotations in training further improves localization and provides precise tree positions for downstream ecological analyses. These findings offer practical guidance for UAV-based biodiversity and conservation monitoring.
[110] DYNAMO: Dependency-Aware Deep Learning Framework for Articulated Assembly Motion Prediction
Mayank Patel, Rahul Jain, Asim Unmesh, Karthik Ramani
Main category: cs.CV
TL;DR: MechBench benchmark dataset and DYNAMO neural model for predicting coupled mechanical motion in gear assemblies from static CAD geometry without joint annotations.
Details
Motivation: Existing methods for articulated objects assume simplified kinematics or require joint annotations, but mechanical assemblies like gears have motion arising from geometric coupling through meshing teeth or aligned axes, making it difficult to reason about relational motion from geometry alone.Method: Introduced MechBench dataset with 693 synthetic gear assemblies and ground-truth motion trajectories. Proposed DYNAMO, a dependency-aware neural model that predicts per-part SE(3) motion trajectories directly from segmented CAD point clouds.
Result: DYNAMO outperforms strong baselines, achieving accurate and temporally consistent motion predictions across varied gear configurations.
Conclusion: MechBench and DYNAMO establish a novel systematic framework for data-driven learning of coupled mechanical motion in CAD assemblies, addressing the challenge of understanding motion from static geometry in mechanical systems.
Abstract: Understanding the motion of articulated mechanical assemblies from static geometry remains a core challenge in 3D perception and design automation. Prior work on everyday articulated objects such as doors and laptops typically assumes simplified kinematic structures or relies on joint annotations. However, in mechanical assemblies like gears, motion arises from geometric coupling, through meshing teeth or aligned axes, making it difficult for existing methods to reason about relational motion from geometry alone. To address this gap, we introduce MechBench, a benchmark dataset of 693 diverse synthetic gear assemblies with part-wise ground-truth motion trajectories. MechBench provides a structured setting to study coupled motion, where part dynamics are induced by contact and transmission rather than predefined joints. Building on this, we propose DYNAMO, a dependency-aware neural model that predicts per-part SE(3) motion trajectories directly from segmented CAD point clouds. Experiments show that DYNAMO outperforms strong baselines, achieving accurate and temporally consistent predictions across varied gear configurations. Together, MechBench and DYNAMO establish a novel systematic framework for data-driven learning of coupled mechanical motion in CAD assemblies.
[111] Cott-ADNet: Lightweight Real-Time Cotton Boll and Flower Detection Under Field Conditions
Rui-Feng Wang, Mingrui Xu, Matthew C Bauer, Iago Beffart Schardong, Xiaowen Ma, Kangning Cui
Main category: cs.CV
TL;DR: Cott-ADNet is a lightweight real-time cotton boll detector that achieves high accuracy (93.3% mAP50) with low computational cost (7.5 GFLOPs) for automated cotton harvesting applications.
Details
Motivation: Cotton harvesting remains labor-intensive with yield losses from missing optimal harvest windows. Accurate recognition of cotton bolls and maturity is essential for automation, yield estimation, and breeding research.Method: Built on YOLOv11n with improved convolutional designs, NeLU-enhanced Global Attention Mechanism for weak feature capture, and Dilated Receptive Field SPPF for multi-scale context modeling at low computational cost.
Result: Achieves 91.5% Precision, 89.8% Recall, 93.3% mAP50, 71.3% mAP, and 90.6% F1-Score with only 7.5 GFLOPs, maintaining stable performance under multi-scale and rotational variations.
Conclusion: Cott-ADNet provides an accurate and efficient solution for in-field deployment, offering a reliable basis for automated cotton harvesting and high-throughput phenotypic analysis.
Abstract: Cotton is one of the most important natural fiber crops worldwide, yet harvesting remains limited by labor-intensive manual picking, low efficiency, and yield losses from missing the optimal harvest window. Accurate recognition of cotton bolls and their maturity is therefore essential for automation, yield estimation, and breeding research. We propose Cott-ADNet, a lightweight real-time detector tailored to cotton boll and flower recognition under complex field conditions. Building on YOLOv11n, Cott-ADNet enhances spatial representation and robustness through improved convolutional designs, while introducing two new modules: a NeLU-enhanced Global Attention Mechanism to better capture weak and low-contrast features, and a Dilated Receptive Field SPPF to expand receptive fields for more effective multi-scale context modeling at low computational cost. We curate a labeled dataset of 4,966 images, and release an external validation set of 1,216 field images to support future research. Experiments show that Cott-ADNet achieves 91.5% Precision, 89.8% Recall, 93.3% mAP50, 71.3% mAP, and 90.6% F1-Score with only 7.5 GFLOPs, maintaining stable performance under multi-scale and rotational variations. These results demonstrate Cott-ADNet as an accurate and efficient solution for in-field deployment, and thus provide a reliable basis for automated cotton harvesting and high-throughput phenotypic analysis. Code and dataset is available at https://github.com/SweefongWong/Cott-ADNet.
[112] Deep learning for 3D point cloud processing – from approaches, tasks to its implications on urban and environmental applications
Zhenxin Zhang, Zhihua Xu, Yuwei Cao, Ningli Xu, Shuye Wang, Shen’ao Cui, Zhen Li, Rongjun Qin
Main category: cs.CV
TL;DR: This paper provides a meta-review of deep learning approaches for point cloud processing, focusing on practical applications like scene completion, registration, semantic segmentation, and modeling, while identifying gaps between research methods and real-world implementation.
Details
Motivation: Existing surveys primarily focus on network architectures for unordered point clouds but ignore practical considerations like large data volumes, diverse scenes, varying point density, and data modality that are critical for real-world applications.Method: The authors conduct a comprehensive review of deep learning approaches and datasets covering key point cloud processing tasks including scene completion, registration, semantic segmentation, and modeling across various urban and environmental applications.
Result: The review identifies significant gaps that need to be addressed as these deep learning methods transition from research to practical applications, highlighting both algorithmic and practical limitations.
Conclusion: The paper concludes with insights on both algorithmic improvements needed and practical considerations required to successfully implement deep learning-based point cloud processing methods in real-world scenarios across various geomatics and computer vision applications.
Abstract: Point cloud processing as a fundamental task in the field of geomatics and computer vision, has been supporting tasks and applications at different scales from air to ground, including mapping, environmental monitoring, urban/tree structure modeling, automated driving, robotics, disaster responses etc. Due to the rapid development of deep learning, point cloud processing algorithms have nowadays been almost explicitly dominated by learning-based approaches, most of which are yet transitioned into real-world practices. Existing surveys primarily focus on the ever-updating network architecture to accommodate unordered point clouds, largely ignoring their practical values in typical point cloud processing applications, in which extra-large volume of data, diverse scene contents, varying point density, data modality need to be considered. In this paper, we provide a meta review on deep learning approaches and datasets that cover a selection of critical tasks of point cloud processing in use such as scene completion, registration, semantic segmentation, and modeling. By reviewing a broad range of urban and environmental applications these tasks can support, we identify gaps to be closed as these methods transformed into applications and draw concluding remarks in both the algorithmic and practical aspects of the surveyed methods.
[113] Optimal Transport Based Unsupervised Restoration Learning Exploiting Degradation Sparsity
Fei Wen, Wei Wang, Zeyu Yan, Wenbin Jiang
Main category: cs.CV
TL;DR: Sparsity-aware optimal transport (SOT) improves unsupervised restoration by incorporating frequency domain sparsity of degradations, outperforming standard OT and achieving superior results in super-resolution, deraining, and dehazing tasks.
Details
Motivation: Optimal transport shows promise for unsupervised restoration but falls short of supervised methods. The authors observed that degradations exhibit distinct sparsity in frequency domain, which can reduce ambiguity in inverse mapping.Method: Proposed sparsity-aware optimal transport (SOT) framework that incorporates sparsity prior of degradations in frequency domain into optimal transport criterion.
Result: Extensive experiments show SOT offers significant performance gains over standard OT, achieves superior perceptual quality compared to supervised and unsupervised methods, and consistently outperforms existing unsupervised methods across all three tasks while narrowing gap to supervised counterparts.
Conclusion: Incorporating sparsity prior into optimal transport significantly benefits unsupervised restoration learning, making SOT an effective framework for challenging restoration tasks like super-resolution, deraining, and dehazing.
Abstract: Optimal transport (OT) has recently been shown as a promising criterion for unsupervised restoration when no explicit prior model is available. Despite its theoretical appeal, OT still significantly falls short of supervised methods on challenging tasks such as super-resolution, deraining, and dehazing. In this paper, we propose a \emph{sparsity-aware optimal transport} (SOT) framework to bridge this gap by leveraging a key observation: the degradations in these tasks exhibit distinct sparsity in the frequency domain. Incorporating this sparsity prior into OT can significantly reduce the ambiguity of the inverse mapping for restoration and substantially boost performance. We provide analysis to show exploiting degradation sparsity benefits unsupervised restoration learning. Extensive experiments on real-world super-resolution, deraining, and dehazing demonstrate that SOT offers notable performance gains over standard OT, while achieving superior perceptual quality compared to existing supervised and unsupervised methods. In particular, SOT consistently outperforms existing unsupervised methods across all three tasks and narrows the performance gap to supervised counterparts.
[114] Two-Stage Decoupling Framework for Variable-Length Glaucoma Prognosis
Yiran Song, Yikai Zhang, Silvia Orengo-Nania, Nian Wang, Fenglong Ma, Rui Zhang, Yifan Peng, Mingquan Lin
Main category: cs.CV
TL;DR: A two-stage framework for glaucoma prognosis that handles variable-length sequential data and limited dataset sizes through self-supervised learning and attention mechanisms.
Details
Motivation: Existing glaucoma prognosis methods are constrained by fixed-length inputs and struggle with limited dataset sizes, requiring a more flexible and data-efficient approach.Method: Two-Stage Decoupling Framework (TSDF): 1) Feature representation module using self-supervised learning to aggregate multiple datasets, 2) Temporal aggregation module with attention mechanism for variable-length sequential inputs.
Result: Extensive experiments on OHTS and GRAPE datasets show the approach is effective and robust across different scales and clinical settings.
Conclusion: The proposed framework significantly enhances model performance while maintaining compact parameter size, enabling better glaucoma prognosis with variable-length data and limited datasets.
Abstract: Glaucoma is one of the leading causes of irreversible blindness worldwide. Glaucoma prognosis is essential for identifying at-risk patients and enabling timely intervention to prevent blindness. Many existing approaches rely on historical sequential data but are constrained by fixed-length inputs, limiting their flexibility. Additionally, traditional glaucoma prognosis methods often employ end-to-end models, which struggle with the limited size of glaucoma datasets. To address these challenges, we propose a Two-Stage Decoupling Framework (TSDF) for variable-length glaucoma prognosis. In the first stage, we employ a feature representation module that leverages self-supervised learning to aggregate multiple glaucoma datasets for training, disregarding differences in their supervisory information. This approach enables datasets of varying sizes to learn better feature representations. In the second stage, we introduce a temporal aggregation module that incorporates an attention-based mechanism to process sequential inputs of varying lengths, ensuring flexible and efficient utilization of all available data. This design significantly enhances model performance while maintaining a compact parameter size. Extensive experiments on two benchmark glaucoma datasets:the Ocular Hypertension Treatment Study (OHTS) and the Glaucoma Real-world Appraisal Progression Ensemble (GRAPE),which differ significantly in scale and clinical settings,demonstrate the effectiveness and robustness of our approach.
[115] Gradient-Free Adversarial Purification with Diffusion Models
Xuelong Dai, Dong Wang, Xiuzhen Cheng, Bin Xiao
Main category: cs.CV
TL;DR: A novel defense framework combining adversarial anti-aliasing and super-resolution techniques to counter both perturbation-based and unrestricted adversarial attacks without requiring retraining or gradient computations.
Details
Motivation: Existing defenses like adversarial training require costly retraining and are ineffective against unrestricted attacks, while purification methods suffer from low efficiency. Adversarial examples are sensitive to pixel-level perturbations and lie near decision boundaries.Method: Proposes adversarial anti-aliasing to reduce pixel-level perturbation magnitude, and adversarial super-resolution for image restoration using clean dataset priors. Also introduces contrastive learning-based adversarial deblurring fine-tuning to enhance purification without retraining diffusion models.
Result: The framework provides effective and efficient defense against both perturbation-based and unrestricted adversarial attacks, requiring no additional training and being computationally efficient due to no gradient computations.
Conclusion: The proposed techniques offer a practical defense solution that addresses limitations of existing methods by combining preprocessing and restoration approaches while maintaining computational efficiency and broad attack coverage.
Abstract: Adversarial training and adversarial purification are two widely used defense strategies for enhancing model robustness against adversarial attacks. However, adversarial training requires costly retraining, while adversarial purification often suffers from low efficiency. More critically, existing defenses are primarily designed under the perturbation-based adversarial threat model, which is ineffective against recently introduced unrestricted adversarial attacks. In this paper, we propose an effective and efficient defense framework that counters both perturbation-based and unrestricted adversarial attacks. Our approach is motivated by the observation that adversarial examples typically lie near the decision boundary and are highly sensitive to pixel-level perturbations. To address this, we introduce adversarial anti-aliasing, a preprocessing technique that mitigates adversarial noise by reducing the magnitude of pixel-level perturbations. In addition, we propose adversarial super-resolution, which leverages prior knowledge from clean datasets to benignly restore high-quality images from adversarially degraded ones. Unlike image synthesis methods that generate entirely new images, adversarial super-resolution focuses on image restoration, making it more suitable for purification. Importantly, both techniques require no additional training and are computationally efficient since they do not rely on gradient computations. To further improve robustness across diverse datasets, we introduce a contrastive learning-based adversarial deblurring fine-tuning method. By incorporating adversarial priors during fine-tuning on the target dataset, this method enhances purification effectiveness without the need to retrain diffusion models.
[116] Image Tokenizer Needs Post-Training
Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides
Main category: cs.CV
TL;DR: A novel tokenizer training scheme with main-training and post-training phases to address the discrepancy between reconstruction and generation distributions in image generative models, improving latent space construction and decoding robustness.
Details
Motivation: Current image tokenizers prioritize reconstruction tasks but don't consider generation errors during sampling, creating a significant discrepancy between reconstruction and generation distributions that affects generative model performance.Method: Proposes a two-phase training: main-training uses latent perturbation to simulate sampling noises and improve tokenizer robustness, and post-training optimizes the tokenizer decoder with a well-trained generative model to mitigate distribution differences. Also introduces pFID metric to evaluate tokenizer performance.
Result: Achieves 1.60 gFID with main training alone and further improves to 1.36 gFID with additional post-training using a ~400M generator. The method is validated across discrete/continuous tokenizers and autoregressive/diffusion generators.
Conclusion: The proposed tokenizer training scheme successfully addresses the reconstruction-generation discrepancy, significantly improving generation quality, convergence speed, and works effectively across different tokenizer types and generative models.
Abstract: Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a $\sim$400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.
[117] Towards Foundational Models for Single-Chip Radar
Tianshu Huang, Akarsh Prabhakara, Chuhan Chen, Jay Karhade, Deva Ramanan, Matthew O’Toole, Anthony Rowe
Main category: cs.CV
TL;DR: A foundational model called GRT (Generalizable Radar Transformer) is trained on the largest raw mmWave radar dataset (1M samples) to achieve 3D occupancy and semantic segmentation with quality comparable to higher-resolution sensors, demonstrating strong generalization and data scaling properties.
Details
Motivation: mmWave radars are robust sensors but suffer from poor angular resolution, especially in inexpensive single-chip versions. While learning-based methods exist, there are no standardized foundational models or large datasets for mmWave radar, forcing practitioners to train task-specific models from scratch on small datasets.Method: Collected the largest available raw radar dataset with 1M samples (29 hours) and trained a Generalizable Radar Transformer (GRT) foundational model for 4D single-chip radar that can predict 3D occupancy and semantic segmentation.
Result: GRT generalizes across diverse settings, can be fine-tuned for different tasks, shows logarithmic data scaling of 20% per 10x data increase, and using raw radar data significantly outperforms lossy representations (equivalent to 10x more training data). Estimated that ≈100M samples are needed to fully exploit GRT’s potential.
Conclusion: The GRT foundational model demonstrates that high-quality 3D perception is achievable with inexpensive single-chip mmWave radars when trained on sufficiently large datasets, overcoming the traditional limitation of poor angular resolution through deep learning approaches.
Abstract: mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets. In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20% per $10\times$ data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a $10\times$ increase in training data. Finally, we roughly estimate that $\approx$100M samples (3000 hours) of data are required to fully exploit the potential of GRT.
[118] Evaluating Robustness of Vision-Language Models Under Noisy Conditions
Purushoth, Alireza
Main category: cs.CV
TL;DR: Comprehensive evaluation of Vision-Language Models’ robustness under various noise conditions like lighting variation, motion blur, and compression artifacts, revealing performance trade-offs between model size and noise resilience.
Details
Motivation: While VLMs excel in multimodal tasks, their robustness under noisy conditions remains poorly understood, necessitating systematic evaluation of their performance degradation under controlled perturbations.Method: Used a comprehensive evaluation framework with controlled perturbations (lighting variation, motion blur, compression artifacts) and measured performance using both lexical-based metrics (BLEU, METEOR, ROUGE, CIDEr) and neural-based similarity measures with sentence embeddings across diverse datasets.
Result: Key findings: (1) Ground-truth caption descriptiveness significantly affects performance; (2) Larger models like LLaVA excel in semantic understanding but don’t universally outperform smaller models; (3) JPEG compression and motion blur cause dramatic performance degradation across all models.
Conclusion: The study reveals nuanced trade-offs between model size, dataset characteristics, and noise resilience, providing a standardized benchmark for future robust multimodal learning research.
Abstract: Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering. However, their robustness under noisy conditions remains unfamiliar. In this study, we present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations, including lighting variation, motion blur, and compression artifacts. We used both lexical-based metrics (BLEU, METEOR, ROUGE, CIDEr) and neural-based similarity measures using sentence embeddings to quantify semantic alignment. Our experiments span diverse datasets, revealing key insights: (1) descriptiveness of ground-truth captions significantly influences model performance; (2) larger models like LLaVA excel in semantic understanding but do not universally outperform smaller models; and (3) certain noise types, such as JPEG compression and motion blur, dramatically degrade performance across models. Our findings highlight the nuanced trade-offs between model size, dataset characteristics, and noise resilience, offering a standardized benchmark for future robust multimodal learning.
[119] Instance-Guided Class Activation Mapping for Weakly Supervised Semantic Segmentation
Ali Torabi, Sanjog Gaihre, MD Mahbubur Rahman, Yaqoob Majeed
Main category: cs.CV
TL;DR: IG-CAM is a novel weakly supervised semantic segmentation method that uses instance guidance and influence functions to generate high-quality boundary-aware localization maps, achieving state-of-the-art performance on PASCAL VOC 2012.
Details
Motivation: Existing weakly supervised semantic segmentation methods struggle with precise object boundary localization and often focus only on discriminative regions, requiring expensive pixel-level annotations.Method: Proposes IG-CAM with three innovations: Instance-Guided Refinement using ground truth masks, Influence Function Integration for robust features, and Multi-Scale Boundary Enhancement for precise boundaries.
Result: Achieves 82.3% mIoU before post-processing and 86.6% after CRF refinement on PASCAL VOC 2012, significantly outperforming previous methods with superior boundary accuracy and complete object coverage.
Conclusion: IG-CAM establishes a new benchmark for weakly supervised semantic segmentation, providing a practical solution when pixel-level annotations are unavailable or too expensive.
Abstract: Weakly Supervised Semantic Segmentation (WSSS) addresses the challenge of training segmentation models using only image-level annotations, eliminating the need for expensive pixel-level labeling. While existing methods struggle with precise object boundary localization and often focus only on the most discriminative regions, we propose IG-CAM (Instance-Guided Class Activation Mapping), a novel approach that leverages instance-level cues and influence functions to generate high-quality, boundary-aware localization maps. Our method introduces three key innovations: (1) Instance-Guided Refinement that uses ground truth segmentation masks to guide CAM generation, ensuring complete object coverage rather than just discriminative parts; (2) Influence Function Integration that captures the relationship between training samples and model predictions, leading to more robust feature representations; and (3) Multi-Scale Boundary Enhancement that employs progressive refinement strategies to achieve sharp, precise object boundaries. IG-CAM achieves state-of-the-art performance on the PASCAL VOC 2012 dataset with an mIoU of 82.3% before post-processing, which further improves to 86.6% after applying Conditional Random Field (CRF) refinement, significantly outperforming previous WSSS methods. Our approach demonstrates superior localization accuracy, with complete object coverage and precise boundary delineation, while maintaining computational efficiency. Extensive ablation studies validate the contribution of each component, and qualitative comparisons across 600 diverse images showcase the method’s robustness and generalization capability. The results establish IG-CAM as a new benchmark for weakly supervised semantic segmentation, offering a practical solution for scenarios where pixel-level annotations are unavailable or prohibitively expensive.
[120] Artist-Created Mesh Generation from Raw Observation
Yao He, Youngjoong Kwon, Wenxiao Cai, Ehsan Adeli
Main category: cs.CV
TL;DR: End-to-end framework for generating artist-style meshes from noisy/incomplete point clouds using 2D inpainting approach
Details
Motivation: Artist-created meshes are crucial for commercial graphics but existing methods assume clean inputs or use complex pipelines, limiting real-world applicabilityMethod: Reformulates 3D point cloud refinement as 2D inpainting task to leverage powerful generative models for end-to-end mesh generation
Result: Preliminary results on ShapeNet dataset show promising production of clean, complete meshes
Conclusion: The framework demonstrates potential for practical real-world applications with noisy sensor data like LiDAR and RGB-D cameras
Abstract: We present an end-to-end framework for generating artist-style meshes from noisy or incomplete point clouds, such as those captured by real-world sensors like LiDAR or mobile RGB-D cameras. Artist-created meshes are crucial for commercial graphics pipelines due to their compatibility with animation and texturing tools and their efficiency in rendering. However, existing approaches often assume clean, complete inputs or rely on complex multi-stage pipelines, limiting their applicability in real-world scenarios. To address this, we propose an end-to-end method that refines the input point cloud and directly produces high-quality, artist-style meshes. At the core of our approach is a novel reformulation of 3D point cloud refinement as a 2D inpainting task, enabling the use of powerful generative models. Preliminary results on the ShapeNet dataset demonstrate the promise of our framework in producing clean, complete meshes.
[121] Axis-Aligned 3D Stalk Diameter Estimation from RGB-D Imagery
Benjamin Vail, Rahul Harsha Cheppally, Ajay Sharda, Sidharth Rai
Main category: cs.CV
TL;DR: A computer vision pipeline using RGB-D imagery and deep learning to automatically measure crop stalk diameters for high-throughput phenotyping.
Details
Motivation: Traditional stalk diameter measurement methods are labor-intensive, error-prone, and not scalable for modern crop breeding programs that require high-throughput phenotyping of structural traits.Method: Geometry-aware computer vision pipeline integrating deep learning-based instance segmentation, 3D point cloud reconstruction from RGB-D imagery, and axis-aligned slicing via Principal Component Analysis (PCA) for robust diameter estimation.
Result: The approach successfully mitigates effects of curvature, occlusion, and image noise, providing reliable stalk diameter measurements.
Conclusion: This method offers a scalable and reliable solution for high-throughput phenotyping that can support crop breeding and agronomic research programs.
Abstract: Accurate, high-throughput phenotyping is a critical component of modern crop breeding programs, especially for improving traits such as mechanical stability, biomass production, and disease resistance. Stalk diameter is a key structural trait, but traditional measurement methods are labor-intensive, error-prone, and unsuitable for scalable phenotyping. In this paper, we present a geometry-aware computer vision pipeline for estimating stalk diameter from RGB-D imagery. Our method integrates deep learning-based instance segmentation, 3D point cloud reconstruction, and axis-aligned slicing via Principal Component Analysis (PCA) to perform robust diameter estimation. By mitigating the effects of curvature, occlusion, and image noise, this approach offers a scalable and reliable solution to support high-throughput phenotyping in breeding and agronomic research.
[122] Neural Collapse-Inspired Multi-Label Federated Learning under Label-Distribution Skew
Can Peng, Yuyuan Liu, Yingyu Yang, Pramit Saha, Qianye Yang, J. Alison Noble
Main category: cs.CV
TL;DR: Proposes a federated learning method for multi-label classification using Neural Collapse theory and feature disentanglement to address data heterogeneity across clients.
Details
Motivation: Federated learning performance deteriorates with decentralized heterogeneous data, especially in multi-label scenarios with complex label relationships. Real-world applications like medical imaging often involve multi-label settings with skewed label distributions.Method: Uses Neural Collapse theory to align feature distributions across clients. Introduces feature disentanglement module to extract semantically specific features for multi-label data. Uses predefined shared NC structure to guide clustering and regularization losses for compact feature clustering.
Result: Experiments on four benchmark datasets across eight diverse settings show the approach outperforms existing methods.
Conclusion: The method effectively addresses the challenging FL scenario of multi-label classification with heterogeneous data distributions across clients.
Abstract: Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. However, the performance of deep learning often deteriorates in FL due to decentralized and heterogeneous data. This challenge is further amplified in multi-label scenarios, where data exhibit complex characteristics such as label co-occurrence, inter-label dependency, and discrepancies between local and global label relationships. While most existing FL research primarily focuses on single-label classification, many real-world applications, particularly in domains such as medical imaging, often involve multi-label settings. In this paper, we address this important yet underexplored scenario in FL, where clients hold multi-label data with skewed label distributions. Neural Collapse (NC) describes a geometric structure in the latent feature space where features of each class collapse to their class mean with vanishing intra-class variance, and the class means form a maximally separated configuration. Motivated by this theory, we propose a method to align feature distributions across clients and to learn high-quality, well-clustered representations. To make the NC-structure applicable to multi-label settings, where image-level features may contain multiple semantic concepts, we introduce a feature disentanglement module that extracts semantically specific features. The clustering of these disentangled class-wise features is guided by a predefined shared NC structure, which mitigates potential conflicts between client models due to diverse local data distributions. In addition, we design regularisation losses to encourage compact clustering in the latent feature space. Experiments conducted on four benchmark datasets across eight diverse settings demonstrate that our approach outperforms existing methods, validating its effectiveness in this challenging FL scenario.
[123] Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection
Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao
Main category: cs.CV
TL;DR: Agent4FaceForgery uses multi-agent LLM framework to simulate realistic face forgery creation processes, generating diverse training data that improves detection performance across multiple architectures.
Details
Motivation: Address the ecological invalidity gap between offline benchmarks and real-world face forgery detection efficacy by better capturing human forgery creation processes and social media text-image interactions.Method: Multi-agent framework with LLM-powered agents equipped with profile and memory modules, simulating forgery creation in social environments with Adaptive Rejection Sampling (ARS) for quality control.
Result: Generated data brings significant performance gains to detectors of multiple architectures, demonstrating framework effectiveness.
Conclusion: The simulation-driven approach successfully addresses the training data validity problem in face forgery detection through realistic multi-agent forgery process simulation.
Abstract: Face forgery detection faces a critical challenge: a persistent gap between offline benchmarks and real-world efficacy,which we attribute to the ecological invalidity of training data.This work introduces Agent4FaceForgery to address two fundamental problems: (1) how to capture the diverse intents and iterative processes of human forgery creation, and (2) how to model the complex, often adversarial, text-image interactions that accompany forgeries in social media. To solve this,we propose a multi-agent framework where LLM-poweredagents, equipped with profile and memory modules, simulate the forgery creation process. Crucially, these agents interact in a simulated social environment to generate samples labeled for nuanced text-image consistency, moving beyond simple binary classification. An Adaptive Rejection Sampling (ARS) mechanism ensures data quality and diversity. Extensive experiments validate that the data generated by our simulationdriven approach brings significant performance gains to detectors of multiple architectures, fully demonstrating the effectiveness and value of our framework.
[124] Explicit Multimodal Graph Modeling for Human-Object Interaction Detection
Wenxuan Ji, Haichao Shi, Xiao-Yu zhang
Main category: cs.CV
TL;DR: MGNM is a multimodal graph network that explicitly models human-object interaction relationships using GNNs, achieving state-of-the-art performance on HOI detection benchmarks.
Details
Motivation: Transformer-based HOI detection methods don't explicitly model relational structures, while GNNs are inherently better suited for capturing relationships between human-object pairs.Method: Proposes a four-stage graph structure framework with multi-level feature interaction mechanism that leverages both vision and language features to enhance information propagation across human-object pairs.
Result: Achieves state-of-the-art performance on HICO-DET and V-COCO benchmarks, shows significant performance gains with advanced object detectors, and maintains effective balance between rare and non-rare classes.
Conclusion: GNN-based relational modeling with multimodal features significantly enhances HOI detection performance compared to Transformer approaches.
Abstract: Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level vision and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.
[125] VQT-Light:Lightweight HDR Illumination Map Prediction with Richer Texture.pdf
Kunliang Xie
Main category: cs.CV
TL;DR: VQT-Light is a novel lighting estimation framework that combines VQVAE and ViT architectures to achieve high-fidelity illumination maps with fast inference speed (40FPS).
Details
Motivation: Existing lighting estimation methods struggle with either poor texture restoration in illumination maps or face challenges in running speed and texture fidelity.Method: Uses VQVAE to extract discrete features of illumination maps (avoiding posterior collapse) and ViT to capture global context and dependencies. Formulates lighting estimation as a multiclass classification task.
Result: Achieves 40FPS inference speed, improves multiple evaluation metrics, and produces light maps with richer texture and better fidelity compared to state-of-the-art methods.
Conclusion: The proposed VQT-Light framework demonstrates superior performance in both qualitative and quantitative experiments, offering a lightweight and fast solution for accurate lighting estimation.
Abstract: Accurate lighting estimation is a significant yet challenging task in computer vision and graphics. However, existing methods either struggle to restore detailed textures of illumination map, or face challenges in running speed and texture fidelity. To tackle this problem, we propose a novel framework (VQT-Light) based on VQVAE and ViT architecture. VQT-Light includes two modules: feature extraction and lighting estimation. First, we take advantages of VQVAE to extract discrete features of illumination map rather than continuous features to avoid “posterior collapse”. Second, we capture global context and dependencies of input image through ViT rather than CNNs to improve the prediction of illumination outside the field of view. Combining the above two modules, we formulate the lighting estimation as a multiclass classification task, which plays a key role in our pipeline. As a result, our model predicts light map with richer texture and better fidelity while keeping lightweight and fast. VQT-Light achieves an inference speed of 40FPS and improves multiple evaluation metrics. Qualitative and quantitative experiments demonstrate that the proposed method realizes superior results compared to existing state-of-the-art methods.
[126] Adaptive Sampling Scheduler
Qi Wang, Shuliang Zhu, Jinjia Zhou
Main category: cs.CV
TL;DR: Proposes an adaptive sampling scheduler for consistency distillation that dynamically selects target timesteps, optimizes alternating sampling, and uses smoothing techniques to improve flexibility and performance across various distillation frameworks.
Details
Motivation: Existing consistency distillation methods rely on fixed timestep selection strategies that limit flexibility and require custom schedulers for different distillation processes, restricting the full potential of diffusion models.Method: Develops an adaptive sampling scheduler with three strategies: dynamic target timestep selection based on importance, optimized alternating sampling along solution trajectories, and smoothing clipping with color balancing for stable high-quality generation.
Result: Comprehensive experimental evaluations show significant improvements in generative performance across various consistency distillation methods, demonstrating strong adaptability and effectiveness.
Conclusion: The proposed adaptive sampling scheduler overcomes limitations of existing methods by providing a flexible, framework-agnostic approach that enhances generation performance and expands applicability in complex scenarios.
Abstract: Consistent distillation methods have evolved into effective techniques that significantly accelerate the sampling process of diffusion models. Although existing methods have achieved remarkable results, the selection of target timesteps during distillation mainly relies on deterministic or stochastic strategies, which often require sampling schedulers to be designed specifically for different distillation processes. Moreover, this pattern severely limits flexibility, thereby restricting the full sampling potential of diffusion models in practical applications. To overcome these limitations, this paper proposes an adaptive sampling scheduler that is applicable to various consistency distillation frameworks. The scheduler introduces three innovative strategies: (i) dynamic target timestep selection, which adapts to different consistency distillation frameworks by selecting timesteps based on their computed importance; (ii) Optimized alternating sampling along the solution trajectory by guiding forward denoising and backward noise addition based on the proposed time step importance, enabling more effective exploration of the solution space to enhance generation performance; and (iii) Utilization of smoothing clipping and color balancing techniques to achieve stable and high-quality generation results at high guidance scales, thereby expanding the applicability of consistency distillation models in complex generation scenarios. We validated the effectiveness and flexibility of the adaptive sampling scheduler across various consistency distillation methods through comprehensive experimental evaluations. Experimental results consistently demonstrated significant improvements in generative performance, highlighting the strong adaptability achieved by our method.
[127] DisorientLiDAR: Physical Attacks on LiDAR-based Localization
Yizhen Lao, Yu Zhang, Ziting Wang, Chengbo Wang, Yifei Xue, Wanpeng Shao
Main category: cs.CV
TL;DR: DisorientLiDAR: A novel adversarial attack framework that targets LiDAR-based localization by strategically removing critical keypoints through reverse-engineering, causing significant localization drift in autonomous vehicles.
Details
Motivation: Deep learning models are vulnerable to adversarial attacks, but little research has focused on attacking LiDAR-based localization systems in self-driving cars, which poses serious security risks.Method: Reverse-engineer localization models to identify critical keypoints, then strategically remove them. Evaluated on three point-cloud registration models using KITTI dataset and validated on Autoware platform. Extended to physical world using near-infrared absorptive materials.
Result: Removing Top-K keypoint regions significantly degrades registration accuracy. Hiding few critical regions causes noticeable localization drift. Physical-world attacks successfully replicated digital attack effects.
Conclusion: The framework demonstrates effective adversarial attacks on LiDAR localization, proving vulnerability in both digital and physical domains, highlighting serious security concerns for autonomous vehicles.
Abstract: Deep learning models have been shown to be susceptible to adversarial attacks with visually imperceptible perturbations. Even this poses a serious security challenge for the localization of self-driving cars, there has been very little exploration of attack on it, as most of adversarial attacks have been applied to 3D perception. In this work, we propose a novel adversarial attack framework called DisorientLiDAR targeting LiDAR-based localization. By reverse-engineering localization models (e.g., feature extraction networks), adversaries can identify critical keypoints and strategically remove them, thereby disrupting LiDAR-based localization. Our proposal is first evaluated on three state-of-the-art point-cloud registration models (HRegNet, D3Feat, and GeoTransformer) using the KITTI dataset. Experimental results demonstrate that removing regions containing Top-K keypoints significantly degrades their registration accuracy. We further validate the attack’s impact on the Autoware autonomous driving platform, where hiding merely a few critical regions induces noticeable localization drift. Finally, we extended our attacks to the physical world by hiding critical regions with near-infrared absorptive materials, thereby successfully replicate the attack effects observed in KITTI data. This step has been closer toward the realistic physical-world attack that demonstrate the veracity and generality of our proposal.
[128] Exploring Spectral Characteristics for Single Image Reflection Removal
Pengbo Guo, Chengxu Liu, Guoshuai Zhao, Xingsong Hou, Jialie Shen, Xueming Qian
Main category: cs.CV
TL;DR: A novel spectral learning approach for reflection removal that uses a Spectral Codebook to reconstruct optical spectra and distinguish reflections through wavelength differences, outperforming existing methods.
Details
Motivation: Existing reflection removal methods work only in the image domain and ignore spectral properties of reflected light, limiting their ability to effectively distinguish and remove reflections from overlapping transmission components.Method: Proposes Spectral Codebook to reconstruct optical spectrum of reflection images, uses spectral prior refinement modules for spatial pixel redistribution and spectral difference enhancement, and employs Spectrum-Aware Transformer for joint recovery in spectral and pixel domains.
Result: Experimental results on three reflection benchmarks demonstrate superior performance and generalization ability compared to state-of-the-art models.
Conclusion: The spectral learning perspective and spectral codebook approach effectively address reflection removal by leveraging wavelength differences, providing a more accurate solution to this ill-posed problem.
Abstract: Eliminating reflections caused by incident light interacting with reflective medium remains an ill-posed problem in the image restoration area. The primary challenge arises from the overlapping of reflection and transmission components in the captured images, which complicates the task of accurately distinguishing and recovering the clean background. Existing approaches typically address reflection removal solely in the image domain, ignoring the spectral property variations of reflected light, which hinders their ability to effectively discern reflections. In this paper, we start with a new perspective on spectral learning, and propose the Spectral Codebook to reconstruct the optical spectrum of the reflection image. The reflections can be effectively distinguished by perceiving the wavelength differences between different light sources in the spectrum. To leverage the reconstructed spectrum, we design two spectral prior refinement modules to re-distribute pixels in the spatial dimension and adaptively enhance the spectral differences along the wavelength dimension. Furthermore, we present the Spectrum-Aware Transformer to jointly recover the transmitted content in spectral and pixel domains. Experimental results on three different reflection benchmarks demonstrate the superiority and generalization ability of our method compared to state-of-the-art models.
[129] Maps for Autonomous Driving: Full-process Survey and Frontiers
Pengxin Chen, Zhipeng Luo, Xiaoqi Jiang, Zhangcai Yin, Jonathan Li
Main category: cs.CV
TL;DR: The paper reviews the evolution of maps for autonomous driving through three stages: HD maps, Lite maps, and Implicit maps, analyzing production workflows, technical challenges, and academic solutions.
Details
Motivation: Maps are essential for autonomous driving, and understanding their evolution from traditional HD maps to modern implicit representations is crucial for advancing self-driving technology.Method: The authors categorize map evolution into three stages and provide comprehensive reviews of production workflows, technical challenges, and academic solutions for each stage.
Result: The paper systematically analyzes the progression of map technologies and explores how cutting-edge map representations can be integrated into end-to-end autonomous driving frameworks.
Conclusion: The evolution from HD to Lite to Implicit maps represents significant advancements in autonomous driving technology, with each stage addressing different technical challenges and enabling more sophisticated integration with end-to-end driving systems.
Abstract: Maps have always been an essential component of autonomous driving. With the advancement of autonomous driving technology, both the representation and production process of maps have evolved substantially. The article categorizes the evolution of maps into three stages: High-Definition (HD) maps, Lightweight (Lite) maps, and Implicit maps. For each stage, we provide a comprehensive review of the map production workflow, with highlighting technical challenges involved and summarizing relevant solutions proposed by the academic community. Furthermore, we discuss cutting-edge research advances in map representations and explore how these innovations can be integrated into end-to-end autonomous driving frameworks.
[130] CIARD: Cyclic Iterative Adversarial Robustness Distillation
Liming Lu, Shuchao Pang, Xu Zheng, Xiang Gu, Anan Du, Yunhuai Liu, Yongbin Zhou
Main category: cs.CV
TL;DR: CIARD is a new adversarial robustness distillation method that uses a multi-teacher framework with contrastive push-loss alignment and continuous adversarial retraining to improve both clean accuracy and robustness in student models.
Details
Motivation: Existing ARD methods suffer from degraded performance on clean examples due to divergent optimization objectives between clean and robust teachers, and performance deterioration of robust teachers from iteratively generated adversarial examples.Method: Proposes Cyclic Iterative ARD (CIARD) with: 1) multi-teacher framework with contrastive push-loss alignment to resolve optimization conflicts, and 2) continuous adversarial retraining to maintain dynamic teacher robustness against varying adversarial examples.
Result: Achieves average 3.53% improvement in adversarial defense rates across various attack scenarios and 5.87% increase in clean sample accuracy on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets.
Conclusion: CIARD establishes a new benchmark for balancing model robustness and generalization, demonstrating significant improvements in both adversarial defense and clean accuracy performance.
Abstract: Adversarial robustness distillation (ARD) aims to transfer both performance and robustness from teacher model to lightweight student model, enabling resilient performance on resource-constrained scenarios. Though existing ARD approaches enhance student model’s robustness, the inevitable by-product leads to the degraded performance on clean examples. We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: 1. The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and 2. The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: a. A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and b. Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CIARD achieves remarkable performance with an average 3.53 improvement in adversarial defense rates across various attack scenarios and a 5.87 increase in clean sample accuracy, establishing a new benchmark for balancing model robustness and generalization. Our code is available at https://github.com/eminentgu/CIARD
[131] Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations
Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong
Main category: cs.CV
TL;DR: This paper introduces a new approach for detecting multimodal manipulations where visual edits are paired with semantically consistent text, addressing limitations of existing benchmarks that create artificial cross-modal misalignment.
Details
Motivation: Existing benchmarks suffer from misalignment artifacts that don't reflect real-world manipulation patterns, where practical attacks maintain semantic consistency across modalities rather than creating easily detectable anomalies.Method: The approach involves: 1) Creating the Semantic-Aligned Multimodal Manipulation (SAMM) dataset through a two-stage pipeline with image manipulations and contextually-plausible text generation, and 2) A Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework that uses external knowledge to retrieve contextual evidence for detection.
Result: The framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on the SAMM dataset compared to state-of-the-art approaches.
Conclusion: The proposed semantically-coordinated manipulation detection approach and RamDG framework effectively address real-world manipulation patterns and demonstrate superior performance over existing methods.
Abstract: The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.
[132] MFAF: An EVA02-Based Multi-scale Frequency Attention Fusion Method for Cross-View Geo-Localization
YiTong Liu, TianZhu Liu, YanFeng GU
Main category: cs.CV
TL;DR: EVA02-based MFAF method improves cross-view geo-localization using multi-scale frequency attention to capture structural and edge features while reducing background noise interference.
Details
Motivation: Existing cross-view geo-localization methods neglect spatial and semantic information while struggling with appearance variations from different viewpoints and difficulty extracting discriminative features.Method: Proposes Multi-scale Frequency Attention Fusion (MFAF) method with Multi-Frequency Branch-wise Block (MFB) to capture low-frequency structural and high-frequency edge features, and Frequency-aware Spatial Attention (FSA) module to focus on key frequency regions.
Result: Achieves competitive performance on University-1652, SUES-200, and Dense-UAV benchmarks for both drone localization and navigation tasks.
Conclusion: The MFAF method effectively addresses viewpoint variability and background noise issues in cross-view geo-localization through frequency-based feature extraction and attention mechanisms.
Abstract: Cross-view geo-localization aims to determine the geographical location of a query image by matching it against a gallery of images. This task is challenging due to the significant appearance variations of objects observed from variable views, along with the difficulty in extracting discriminative features. Existing approaches often rely on extracting features through feature map segmentation while neglecting spatial and semantic information. To address these issues, we propose the EVA02-based Multi-scale Frequency Attention Fusion (MFAF) method. The MFAF method consists of Multi-Frequency Branch-wise Block (MFB) and the Frequency-aware Spatial Attention (FSA) module. The MFB block effectively captures both low-frequency structural features and high-frequency edge details across multiple scales, improving the consistency and robustness of feature representations across various viewpoints. Meanwhile, the FSA module adaptively focuses on the key regions of frequency features, significantly mitigating the interference caused by background noise and viewpoint variability. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and Dense-UAV, demonstrate that the MFAF method achieves competitive performance in both drone localization and drone navigation tasks.
[133] A Comparative Study of YOLOv8 to YOLOv11 Performance in Underwater Vision Tasks
Gordon Hung, Ivan Felipe Rodriguez
Main category: cs.CV
TL;DR: Comprehensive evaluation of YOLOv8-v11 on underwater imagery shows accuracy saturates after YOLOv9 while inference speed improves, with YOLOv10 offering the best speed-accuracy trade-off for AUV deployment.
Details
Motivation: Underwater computer vision faces challenges like light attenuation, turbidity, and class imbalance, while AUVs have limited computational resources. YOLO detectors are attractive but their performance in marine environments is unknown.Method: Used two underwater datasets (Coral Disease and Fish Species) with varying training sizes. Trained YOLOv8-s, YOLOv9-s, YOLOv10-s, and YOLOv11-s with identical hyperparameters and evaluated precision, recall, mAP, inference time, and FPS. Used Grad-CAM for visualization.
Result: Accuracy saturated after YOLOv9 across both datasets, while inference speed showed marked improvements. YOLOv10 provided the best speed-accuracy trade-off for embedded AUV deployment.
Conclusion: Architectural innovations in recent YOLO variants primarily target efficiency rather than accuracy. The study provides the first controlled comparison of YOLO variants on underwater imagery and delivers an open benchmark for future marine-vision research.
Abstract: Autonomous underwater vehicles (AUVs) increasingly rely on on-board computer-vision systems for tasks such as habitat mapping, ecological monitoring, and infrastructure inspection. However, underwater imagery is hindered by light attenuation, turbidity, and severe class imbalance, while the computational resources available on AUVs are limited. One-stage detectors from the YOLO family are attractive because they fuse localization and classification in a single, low-latency network; however, their terrestrial benchmarks (COCO, PASCAL-VOC, Open Images) leave open the question of how successive YOLO releases perform in the marine domain. We curate two openly available datasets that span contrasting operating conditions: a Coral Disease set (4,480 images, 18 classes) and a Fish Species set (7,500 images, 20 classes). For each dataset, we create four training regimes (25 %, 50 %, 75 %, 100 % of the images) while keeping balanced validation and test partitions fixed. We train YOLOv8-s, YOLOv9-s, YOLOv10-s, and YOLOv11-s with identical hyperparameters (100 epochs, 640 px input, batch = 16, T4 GPU) and evaluate precision, recall, mAP50, mAP50-95, per-image inference time, and frames-per-second (FPS). Post-hoc Grad-CAM visualizations probe feature utilization and localization faithfulness. Across both datasets, accuracy saturates after YOLOv9, suggesting architectural innovations primarily target efficiency rather than accuracy. Inference speed, however, improves markedly. Our results (i) provide the first controlled comparison of recent YOLO variants on underwater imagery, (ii) show that lightweight YOLOv10 offers the best speed-accuracy trade-off for embedded AUV deployment, and (iii) deliver an open, reproducible benchmark and codebase to accelerate future marine-vision research.
[134] StereoCarla: A High-Fidelity Driving Dataset for Generalizable Stereo
Xianda Guo, Chenming Zhang, Ruilin Wang, Youmin Zhang, Wenzhao Zheng, Matteo Poggi, Hao Zhao, Qin Zou, Long Chen
Main category: cs.CV
TL;DR: StereoCarla is a high-fidelity synthetic stereo dataset built on CARLA simulator that outperforms 11 existing datasets in cross-domain generalization for stereo matching in autonomous driving applications.
Details
Motivation: Address the limited diversity of existing stereo training data and improve generalization performance of learning-based stereo matching algorithms for autonomous driving.Method: Created StereoCarla dataset using CARLA simulator with diverse camera configurations (baselines, viewpoints, sensor placements) and varied environmental conditions (lighting, weather, road geometries).
Result: Models trained on StereoCarla outperform those trained on 11 existing stereo datasets across four standard evaluation datasets (KITTI2012, KITTI2015, Middlebury, ETH3D) in cross-domain generalization.
Conclusion: StereoCarla provides a valuable benchmark for developing robust stereo algorithms under realistic and diverse conditions, significantly improving generalization accuracy for autonomous vehicle depth perception systems.
Abstract: Stereo matching plays a crucial role in enabling depth perception for autonomous driving and robotics. While recent years have witnessed remarkable progress in stereo matching algorithms, largely driven by learning-based methods and synthetic datasets, the generalization performance of these models remains constrained by the limited diversity of existing training data. To address these challenges, we present StereoCarla, a high-fidelity synthetic stereo dataset specifically designed for autonomous driving scenarios. Built on the CARLA simulator, StereoCarla incorporates a wide range of camera configurations, including diverse baselines, viewpoints, and sensor placements as well as varied environmental conditions such as lighting changes, weather effects, and road geometries. We conduct comprehensive cross-domain experiments across four standard evaluation datasets (KITTI2012, KITTI2015, Middlebury, ETH3D) and demonstrate that models trained on StereoCarla outperform those trained on 11 existing stereo datasets in terms of generalization accuracy across multiple benchmarks. Furthermore, when integrated into multi-dataset training, StereoCarla contributes substantial improvements to generalization accuracy, highlighting its compatibility and scalability. This dataset provides a valuable benchmark for developing and evaluating stereo algorithms under realistic, diverse, and controllable settings, facilitating more robust depth perception systems for autonomous vehicles. Code can be available at https://github.com/XiandaGuo/OpenStereo, and data can be available at https://xiandaguo.net/StereoCarla.
[135] SmokeBench: A Real-World Dataset for Surveillance Image Desmoking in Early-Stage Fire Scenes
Wenzhuo Jin, Qianfeng Yang, Xianhao Wu, Hongming Chen, Pengpeng Li, Xiang Chen
Main category: cs.CV
TL;DR: SmokeBench dataset provides real-world paired smoke-free and smoke-degraded surveillance images for developing and evaluating smoke removal algorithms in early-stage fire scenarios.
Details
Motivation: Early-stage fire scenes (0-15 minutes) suffer from smoke-obscured visibility in surveillance systems, hindering emergency response. Lack of large-scale real-world datasets limits smoke removal algorithm development.Method: Created SmokeBench dataset with precisely aligned degraded and clean image pairs captured under diverse scenes and smoke concentrations, enabling supervised learning and rigorous evaluation.
Result: Comprehensive experiments benchmarking various desmoking methods on the dataset demonstrate its value for advancing robust image desmoking in real-world fire scenes.
Conclusion: SmokeBench provides a valuable foundation for practical image desmoking research and has been released publicly to support the development of effective smoke removal algorithms for emergency response applications.
Abstract: Early-stage fire scenes (0-15 minutes after ignition) represent a crucial temporal window for emergency interventions. During this stage, the smoke produced by combustion significantly reduces the visibility of surveillance systems, severely impairing situational awareness and hindering effective emergency response and rescue operations. Consequently, there is an urgent need to remove smoke from images to obtain clear scene information. However, the development of smoke removal algorithms remains limited due to the lack of large-scale, real-world datasets comprising paired smoke-free and smoke-degraded images. To address these limitations, we present a real-world surveillance image desmoking benchmark dataset named SmokeBench, which contains image pairs captured under diverse scenes setup and smoke concentration. The curated dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of desmoking methods on our dataset. Our dataset provides a valuable foundation for advancing robust and practical image desmoking in real-world fire scenes. This dataset has been released to the public and can be downloaded from https://github.com/ncfjd/SmokeBench.
[136] RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation
Siju Ma, Changsiyu Gong, Xiaofeng Fan, Yong Ma, Chengjie Jiang
Main category: cs.CV
TL;DR: RIS-FUSION is a cascaded framework that unifies text-driven infrared-visible image fusion with referring image segmentation through joint optimization, achieving state-of-the-art performance with over 11% improvement in mIoU.
Details
Motivation: Existing text-driven infrared and visible image fusion methods lack goal-aligned tasks to supervise and evaluate how effectively input text contributes to fusion outcomes. The authors observed that referring image segmentation and text-driven fusion share the common objective of highlighting text-referred objects.Method: Proposed RIS-FUSION framework with LangGatedFusion module that injects textual features into the fusion backbone for semantic alignment. Also introduced MM-RIS benchmark with 12.5k training and 3.5k testing triplets (infrared-visible image pairs, segmentation masks, and referring expressions).
Result: Extensive experiments show RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU.
Conclusion: The unified framework effectively bridges text-driven image fusion and referring image segmentation, demonstrating superior performance through joint optimization and semantic alignment with textual guidance.
Abstract: Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.
[137] Learning by Imagining: Debiased Feature Augmentation for Compositional Zero-Shot Learning
Haozhe Zhang, Chenchen Jing, Mingyu Liu, Qingsheng Wang, Hao Chen
Main category: cs.CV
TL;DR: DeFA (Debiased Feature Augmentation) is a novel approach for Compositional Zero-Shot Learning that uses disentangle-and-reconstruct framework with debiasing to synthesize high-fidelity composition features, achieving state-of-the-art performance.
Details
Motivation: Address challenges in CZSL including entangled attributes/objects and long-tailed distributions by leveraging neuroscientific findings that imagination and perception share similar neural processes.Method: Integrates disentangle-and-reconstruct framework for feature augmentation with debiasing strategy to synthesize high-fidelity composition features using prior knowledge of seen attributes and objects.
Result: Extensive experiments on three widely used datasets demonstrate state-of-the-art performance in both closed-world and open-world settings.
Conclusion: DeFA effectively addresses compositional representation challenges in CZSL through feature augmentation and debiasing, showing superior generalization capabilities.
Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions by learning prior knowledge of seen primitives, \textit{i.e.}, attributes and objects. Learning generalizable compositional representations in CZSL remains challenging due to the entangled nature of attributes and objects as well as the prevalence of long-tailed distributions in real-world data. Inspired by neuroscientific findings that imagination and perception share similar neural processes, we propose a novel approach called Debiased Feature Augmentation (DeFA) to address these challenges. The proposed DeFA integrates a disentangle-and-reconstruct framework for feature augmentation with a debiasing strategy. DeFA explicitly leverages the prior knowledge of seen attributes and objects by synthesizing high-fidelity composition features to support compositional generalization. Extensive experiments on three widely used datasets demonstrate that DeFA achieves state-of-the-art performance in both \textit{closed-world} and \textit{open-world} settings.
[138] AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models
Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan, Haochen You, Guo Cheng, Zijian Zhang, Lubin Gan, Huihui Wei, Hao Zhang, Jin Huang
Main category: cs.CV
TL;DR: AsyMoE is a novel Mixture of Experts architecture that addresses the asymmetry between visual and linguistic processing in LVLMs by using three specialized expert groups for better cross-modal interactions and reduced parametric bias.
Details
Motivation: Existing MoE approaches struggle with the asymmetry between visual (spatially complete) and linguistic (sequential context) processing, causing language experts to lose contextual grounding and rely too much on parametric knowledge rather than utilizing provided multimodal information.Method: Proposed AsyMoE with three specialized expert groups: 1) intra-modality experts for modality-specific processing, 2) hyperbolic inter-modality experts for hierarchical cross-modal interactions, and 3) evidence-priority language experts to suppress parametric biases and maintain contextual grounding.
Result: AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.
Conclusion: The proposed asymmetric expert architecture effectively addresses modality processing asymmetry in LVLMs, achieving significant performance improvements while reducing computational costs through specialized expert groups.
Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.
[139] EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer
Pukun Zhao, Longxiang Wang, Miaowei Wang, Chen Chen, Fanqing Zhou, Haojian Huang
Main category: cs.CV
TL;DR: Two dynamic spatial reasoning benchmarks (maze navigation and match-2 elimination) that test models’ abilities in spatial understanding and adaptive planning under partial observability and environmental changes, with a novel memory mechanism for experience transfer.
Details
Motivation: Existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes.Method: Introduces two dynamic spatial benchmarks where each action triggers structural changes, requiring continuous cognitive updates. Proposes a subjective experience-based memory mechanism for cross-task experience transfer and validation.
Result: Experiments reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory capabilities.
Conclusion: The benchmarks provide a comprehensive platform for evaluating and advancing methods in dynamic spatial reasoning, highlighting current model shortcomings in handling partial observability and environmental dynamics.
Abstract: Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models' abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at https://anonymous.4open.science/r/EvoEmpirBench-143C/.
[140] Scalable RF Simulation in Generative 4D Worlds
Zhiwei Zheng, Dongyin Hu, Mingmin Zhao
Main category: cs.CV
TL;DR: WaveVerse is a prompt-based framework that generates realistic RF signals from simulated indoor scenes with human motions, enabling scalable RF data generation for privacy-preserving indoor perception tasks.
Details
Motivation: Collecting high-quality RF data in dynamic indoor environments is challenging, and there's a need for privacy-preserving alternatives to vision-based methods for indoor perception.Method: Uses a language-guided 4D world generator with state-aware causal transformer for human motion generation, and phase-coherent ray tracing simulator for accurate RF signal simulation.
Result: Effective conditioned human motion generation, enables RF imaging data generation for the first time, and achieves performance gains in both data-limited and data-adequate scenarios for ML tasks.
Conclusion: WaveVerse provides a scalable solution for generating realistic RF data, overcoming collection challenges and enabling new applications in RF-based indoor perception while maintaining privacy.
Abstract: Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for indoor perception tasks. However, collecting high-quality RF data in dynamic and diverse indoor environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions. WaveVerse introduces a language-guided 4D world generator, which includes a state-aware causal transformer for human motion generation conditioned on spatial constraints and texts, and a phase-coherent ray tracing simulator that enables the simulation of accurate and coherent RF signals. Experiments demonstrate the effectiveness of our approach in conditioned human motion generation and highlight how phase coherence is applied to beamforming and respiration monitoring. We further present two case studies in ML-based high-resolution imaging and human activity recognition, demonstrating that WaveVerse not only enables data generation for RF imaging for the first time, but also consistently achieves performance gain in both data-limited and data-adequate scenarios.
[141] SPGen: Spherical Projection as Consistent and Flexible Representation for Single Image 3D Shape Generation
Jingdong Zhang, Weikai Chen, Yuan Liu, Jionghao Wang, Zhengming Yu, Zhuowen Shen, Bo Yang, Wenping Wang, Xin Li
Main category: cs.CV
TL;DR: SPGen introduces a novel spherical projection representation that maps 3D geometry to 2D images, enabling consistent single-view 3D generation with better internal structure representation and computational efficiency.
Details
Motivation: Existing single-view 3D generative models suffer from inter-view inconsistencies and cannot faithfully represent complex internal structures or nontrivial topologies when using multiview diffusion priors.Method: Encodes geometry information by projecting it onto a bounding sphere and unwrapping it into a compact multi-layer 2D Spherical Projection (SP) representation, operating solely in the image domain to leverage 2D diffusion priors.
Result: SPGen significantly outperforms existing baselines in geometric quality and computational efficiency, offering consistent geometry encoding, flexible representation of nested internal structures, and efficient finetuning capabilities.
Conclusion: The spherical projection approach provides a superior solution for single-view 3D generation by eliminating view inconsistency while maintaining flexibility for complex structures and computational efficiency through image-domain processing.
Abstract: Existing single-view 3D generative models typically adopt multiview diffusion priors to reconstruct object surfaces, yet they remain prone to inter-view inconsistencies and are unable to faithfully represent complex internal structure or nontrivial topologies. In particular, we encode geometry information by projecting it onto a bounding sphere and unwrapping it into a compact and structural multi-layer 2D Spherical Projection (SP) representation. Operating solely in the image domain, SPGen offers three key advantages simultaneously: (1) Consistency. The injective SP mapping encodes surface geometry with a single viewpoint which naturally eliminates view inconsistency and ambiguity; (2) Flexibility. Multi-layer SP maps represent nested internal structures and support direct lifting to watertight or open 3D surfaces; (3) Efficiency. The image-domain formulation allows the direct inheritance of powerful 2D diffusion priors and enables efficient finetuning with limited computational resources. Extensive experiments demonstrate that SPGen significantly outperforms existing baselines in geometric quality and computational efficiency.
[142] Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models
Yunhan Zhao, Xiang Zheng, Xingjun Ma
Main category: cs.CV
TL;DR: Defense2Attack is a novel jailbreak method that uses defensive patterns to enhance attack effectiveness on Vision-Language Models, achieving superior performance in single attempts compared to state-of-the-art methods.
Details
Motivation: Vision-Language Models (VLMs) are vulnerable to jailbreak attacks, and existing methods lack both effectiveness and efficiency. The paper reveals that incorporating weak defense into attacks can significantly improve jailbreak performance.Method: Three-component approach: (1) visual optimizer with adversarial perturbations and encouraging semantics, (2) textual optimizer using defense-styled prompts, and (3) red-team suffix generator with reinforcement fine-tuning.
Result: Achieves superior jailbreak performance on four VLMs and four safety benchmarks in single attempts, outperforming state-of-the-art methods that require multiple tries.
Conclusion: The work provides a new perspective on jailbreaking VLMs by demonstrating that leveraging defensive patterns can significantly enhance both effectiveness and efficiency of attacks.
Abstract: Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.
[143] Effective Gaussian Management for High-fidelity Object Reconstruction
Jiateng Liu, Hao Gao, Jiu-Cheng Xie, Chi-Man Pun, Jian Xiong, Haolun Li, Feng Xu
Main category: cs.CV
TL;DR: A novel Gaussian management approach for high-fidelity object reconstruction that dynamically activates spherical harmonics or normals, adaptively adjusts SH orders, and performs task-decoupled pruning to achieve superior reconstruction quality with fewer parameters.
Details
Motivation: To address the limitations of recent Gaussian Splatting methods that use indiscriminate attribute assignment, which causes gradient conflicts from dual supervision and inefficient representation.Method: Introduces a densification strategy with dynamic activation of spherical harmonics or normals supervised by surface reconstruction, adaptive SH order adjustment based on gradient magnitudes, and task-decoupled pruning to remove unnecessary Gaussians.
Result: Extensive experiments show the approach consistently outperforms state-of-the-art methods in both reconstruction quality and efficiency, achieving superior performance with significantly fewer parameters.
Conclusion: The proposed Gaussian management approach is model-agnostic, can be integrated into other frameworks, and effectively balances representational capacity with parameter quantity while mitigating gradient conflicts.
Abstract: This paper proposes an effective Gaussian management approach for high-fidelity object reconstruction. Departing from recent Gaussian Splatting (GS) methods that employ indiscriminate attribute assignment, our approach introduces a novel densification strategy that dynamically activates spherical harmonics (SHs) or normals under the supervision of a surface reconstruction module, which effectively mitigates the gradient conflicts caused by dual supervision and achieves superior reconstruction results. To further improve representation efficiency, we develop a lightweight Gaussian representation that adaptively adjusts the SH orders of each Gaussian based on gradient magnitudes and performs task-decoupled pruning to remove Gaussian with minimal impact on a reconstruction task without sacrificing others, which balances the representational capacity with parameter quantity. Notably, our management approach is model-agnostic and can be seamlessly integrated into other frameworks, enhancing performance while reducing model size. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art approaches in both reconstruction quality and efficiency, achieving superior performance with significantly fewer parameters.
[144] Modelling and analysis of the 8 filters from the “master key filters hypothesis” for depthwise-separable deep networks in relation to idealized receptive fields based on scale-space theory
Tony Lindeberg, Zahra Babaiee, Peyman M. Kiasari
Main category: cs.CV
TL;DR: Analysis of master key filters from ConvNeXt depthwise-separable networks shows they can be effectively modeled as separable Gaussian-based difference operators with spatial offsets near half grid units, and these idealized filters can replace learned filters while maintaining performance.
Details
Motivation: To understand and model the receptive fields learned in depthwise-separable deep networks based on ConvNeXt architecture, and demonstrate that these learned filters can be approximated by discrete scale-space filters.Method: Applied clustering to extract 8 master key filters, computed spatial spread measures, modeled filters as difference operators applied to Gaussian smoothing, and performed model fitting using spatial variance matching and norm minimization approaches.
Result: The idealized models show good qualitative similarity to learned filters and have good predictive properties when replacing learned filters in depthwise-separable networks.
Conclusion: Learned filters in depthwise-separable deep networks can be well approximated by discrete scale-space filters, supporting the hypothesis that they operate as separable filtering operations with specific spatial characteristics.
Abstract: This paper presents the results of analysing and modelling a set of 8
master key filters'', which have been extracted by applying a clustering approach to the receptive fields learned in depthwise-separable deep networks based on the ConvNeXt architecture. For this purpose, we first compute spatial spread measures in terms of weighted mean values and weighted variances of the absolute values of the learned filters, which support the working hypotheses that: (i) the learned filters can be modelled by separable filtering operations over the spatial domain, and that (ii) the spatial offsets of the those learned filters that are non-centered are rather close to half a grid unit. Then, we model the clustered
master key filters’’ in terms of difference operators applied to a spatial
smoothing operation in terms of the discrete analogue of the Gaussian kernel,
and demonstrate that the resulting idealized models of the receptive fields
show good qualitative similarity to the learned filters.
This modelling is performed in two different ways: (i) using possibly
different values of the scale parameters in the coordinate directions for each
filter, and (ii) using the same value of the scale parameter in both coordinate
directions. Then, we perform the actual model fitting by either (i) requiring
spatial spread measures in terms of spatial variances of the absolute values of
the receptive fields to be equal, or (ii) minimizing the discrete $l_1$- or
$l_2$-norms between the idealized receptive field models and the learned
filters.
Complementary experimental results then demonstrate the idealized models of
receptive fields have good predictive properties for replacing the learned
filters by idealized filters in depthwise-separable deep networks, thus showing
that the learned filters in depthwise-separable deep networks can be well
approximated by discrete scale-space filters.
[145] What Makes a Good Generated Image? Investigating Human and Multimodal LLM Image Preference Alignment
Rishab Parthasarathy, Jasmine Collins, Cory Stephenson
Main category: cs.CV
TL;DR: This paper analyzes how multimodal LLMs and humans evaluate image quality across different attributes like aesthetics, artifacts, anatomical accuracy, composition, object adherence, and style, finding significant differences in their perception and judgment capabilities.
Details
Motivation: Automated evaluation of text-to-image models is challenging, and while recent works use multimodal LLMs for image quality assessment, there's limited understanding of how these models utilize human-relevant concepts like style and composition in their judgments.Method: The researchers curated a dataset of human preferences using synthetically generated image pairs, analyzed inter-task correlation between image quality attributes for both humans and LLMs, and created synthetic datasets with controlled variations for each quality attribute to study individual attribute perception.
Result: Humans showed strong inter-task correlations between image quality attributes, while LLMs exhibited much weaker relationships. Humans could easily judge all specific image quality attributes, but LLMs struggled particularly with attributes like anatomical accuracy.
Conclusion: There are significant differences in how humans and multimodal LLMs perceive and judge image quality, with LLMs showing weaker understanding of the relationships between quality attributes and specific difficulties in assessing certain attributes like anatomical accuracy.
Abstract: Automated evaluation of generative text-to-image models remains a challenging problem. Recent works have proposed using multimodal LLMs to judge the quality of images, but these works offer little insight into how multimodal LLMs make use of concepts relevant to humans, such as image style or composition, to generate their overall assessment. In this work, we study what attributes of an image–specifically aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style–are important for both LLMs and humans to make judgments on image quality. We first curate a dataset of human preferences using synthetically generated image pairs. We use inter-task correlation between each pair of image quality attributes to understand which attributes are related in making human judgments. Repeating the same analysis with LLMs, we find that the relationships between image quality attributes are much weaker. Finally, we study individual image quality attributes by generating synthetic datasets with a high degree of control for each axis. Humans are able to easily judge the quality of an image with respect to all of the specific image quality attributes (e.g. high vs. low aesthetic image), however we find that some attributes, such as anatomical accuracy, are much more difficult for multimodal LLMs to learn to judge. Taken together, these findings reveal interesting differences between how humans and multimodal LLMs perceive images.
[146] Recurrent Cross-View Object Geo-Localization
Xiaohan Zhang, Si-Yuan Cao, Xiaokai Bai, Yiming Li, Zhangkai Shen, Zhe Wu, Xiaoxi Hu, Hui-liang Shen
Main category: cs.CV
TL;DR: ReCOT is a recurrent transformer framework for cross-view object geo-localization that uses iterative refinement with SAM-based knowledge distillation and hierarchical attention to achieve state-of-the-art performance with 60% fewer parameters.
Details
Motivation: Existing CVOGL approaches treat the task as one-shot detection, making them vulnerable to feature noise and lacking error correction mechanisms for accurate object localization.Method: Proposes ReCOT with learnable tokens that iteratively attend to reference features, SAM-based knowledge distillation for semantic guidance, and Reference Feature Enhancement Module with hierarchical attention to focus on object-relevant regions.
Result: Achieves state-of-the-art performance on standard CVOGL benchmarks while reducing parameters by 60% compared to previous SOTA approaches.
Conclusion: Reformulating CVOGL as a recurrent localization task with iterative refinement and semantic guidance significantly improves accuracy and efficiency in cross-view object geo-localization.
Abstract: Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.
[147] A-TDOM: Active TDOM via On-the-Fly 3DGS
Yiwei Xu, Xiang Wang, Yifei Yu, Wentian Gan, Luca Morelli, Giulio Perda, Xiongwu Xiao, Zongqian Zhan, Xin Wang, Fabio Remondino
Main category: cs.CV
TL;DR: A-TDOM is a near real-time True Digital Orthophoto Map generation method using On-the-Fly 3D Gaussian Splatting optimization that processes each new image in seconds while maintaining quality.
Details
Motivation: Traditional TDOM generation methods rely on complex offline photogrammetric pipelines causing delays, and quality degrades due to inaccurate camera poses, DSM errors, and scene occlusions.Method: Uses On-the-Fly SfM to compute pose and sparse point cloud for each new image, integrates new Gaussians into previously unseen regions, and employs orthogonal splatting for immediate rendering after each 3DGS field update.
Result: A-TDOM achieves near real-time TDOM generation with 3DGS optimization for each new image completed in seconds while maintaining acceptable rendering quality and geometric accuracy on multiple benchmarks.
Conclusion: The proposed method enables active TDOM rendering in near real-time, overcoming limitations of traditional offline approaches and addressing quality degradation challenges.
Abstract: True Digital Orthophoto Map (TDOM) serves as a crucial geospatial product in various fields such as urban management, city planning, land surveying, etc. However, traditional TDOM generation methods generally rely on a complex offline photogrammetric pipeline, resulting in delays that hinder real-time applications. Moreover, the quality of TDOM may degrade due to various challenges, such as inaccurate camera poses or Digital Surface Model (DSM) and scene occlusions. To address these challenges, this work introduces A-TDOM, a near real-time TDOM generation method based on On-the-Fly 3DGS optimization. As each image is acquired, its pose and sparse point cloud are computed via On-the-Fly SfM. Then new Gaussians are integrated and optimized into previously unseen or coarsely reconstructed regions. By integrating with orthogonal splatting, A-TDOM can render just after each update of a new 3DGS field. Initial experiments on multiple benchmarks show that the proposed A-TDOM is capable of actively rendering TDOM in near real-time, with 3DGS optimization for each new image in seconds while maintaining acceptable rendering quality and TDOM geometric accuracy.
[148] DyGLNet: Hybrid Global-Local Feature Fusion with Dynamic Upsampling for Medical Image Segmentation
Yican Zhao, Ce Wang, You Hao, Lei Li, Tianli Liao
Main category: cs.CV
TL;DR: DyGLNet is an efficient medical image segmentation model that fuses global and local features using a hybrid SHDCBlock and dynamic upsampling (DyFusionUp) to address multi-scale lesions and boundary challenges with reduced computation.
Details
Motivation: Medical image segmentation faces challenges with multi-scale lesion variability, ill-defined tissue boundaries, and high computational demands that need to be addressed for clinical applications.Method: Proposes DyGLNet with hybrid feature extraction (SHDCBlock combining single-head self-attention and multi-scale dilated convolutions) and dynamic adaptive upsampling (DyFusionUp) with learnable offsets for feature reconstruction, plus lightweight design.
Result: Outperforms existing methods on seven public datasets, particularly excelling in boundary accuracy and small-object segmentation, while maintaining lower computational complexity.
Conclusion: DyGLNet provides an efficient and reliable solution for clinical medical image analysis with superior performance and reduced computational overhead.
Abstract: Medical image segmentation grapples with challenges including multi-scale lesion variability, ill-defined tissue boundaries, and computationally intensive processing demands. This paper proposes the DyGLNet, which achieves efficient and accurate segmentation by fusing global and local features with a dynamic upsampling mechanism. The model innovatively designs a hybrid feature extraction module (SHDCBlock), combining single-head self-attention and multi-scale dilated convolutions to model local details and global context collaboratively. We further introduce a dynamic adaptive upsampling module (DyFusionUp) to realize high-fidelity reconstruction of feature maps based on learnable offsets. Then, a lightweight design is adopted to reduce computational overhead. Experiments on seven public datasets demonstrate that DyGLNet outperforms existing methods, particularly excelling in boundary accuracy and small-object segmentation. Meanwhile, it exhibits lower computation complexity, enabling an efficient and reliable solution for clinical medical image analysis. The code will be made available soon.
[149] BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers
Mohammed Al-Habib, Zuping Zhang, Abdulrahman Noman
Main category: cs.CV
TL;DR: BATR-FST is a two-stage Vision Transformer approach that uses bi-level adaptive token refinement with clustering, uncertainty weighting, and attention mechanisms to improve few-shot learning performance.
Details
Motivation: Vision Transformers struggle with few-shot learning due to limited token-level interaction refinement, insufficient training data, and weak inductive bias. Existing methods rely on inflexible token matching or basic similarity measures.Method: Two-stage approach: 1) Pre-training with Masked Image Modeling for patch-level representations, 2) Meta-fine-tuning with Bi-Level Adaptive Token Refinement (token clustering, uncertainty-aware weighting, bi-level attention), Graph Token Propagation, and Class Separation Penalty.
Result: Superior performance on three benchmark few-shot datasets in both 1-shot and 5-shot scenarios, demonstrating improved few-shot classification via transformers.
Conclusion: BATR-FST effectively addresses Vision Transformers’ limitations in few-shot learning through progressive token refinement and robust inductive bias maintenance, achieving state-of-the-art results.
Abstract: Vision Transformers (ViTs) have shown significant promise in computer vision applications. However, their performance in few-shot learning is limited by challenges in refining token-level interactions, struggling with limited training data, and developing a strong inductive bias. Existing methods often depend on inflexible token matching or basic similarity measures, which limit the effective incorporation of global context and localized feature refinement. To address these challenges, we propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST), a two-stage approach that progressively improves token representations and maintains a robust inductive bias for few-shot classification. During the pre-training phase, Masked Image Modeling (MIM) provides Vision Transformers (ViTs) with transferable patch-level representations by recreating masked image regions, providing a robust basis for subsequent adaptation. In the meta-fine-tuning phase, BATR-FST incorporates a Bi-Level Adaptive Token Refinement module that utilizes Token Clustering to capture localized interactions, Uncertainty-Aware Token Weighting to prioritize dependable features, and a Bi-Level Attention mechanism to balance intra-cluster and inter-cluster relationships, thereby facilitating thorough token refinement. Furthermore, Graph Token Propagation ensures semantic consistency between support and query instances, while a Class Separation Penalty preserves different class borders, enhancing discriminative capability. Extensive experiments on three benchmark few-shot datasets demonstrate that BATR-FST achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.
[150] CECT-Mamba: a Hierarchical Contrast-enhanced-aware Model for Pancreatic Tumor Subtyping from Multi-phase CECT
Zhifang Gong, Shuo Gao, Ben Zhao, Yingjing Xu, Yijun Yang, Shenghong Ju, Guangquan Zhou
Main category: cs.CV
TL;DR: Novel Mamba-based framework for pancreatic tumor subtyping using multi-phase CECT scans, achieving 97.4% accuracy in distinguishing PDAC from PNETs.
Details
Motivation: Current methods fail to effectively leverage contextual information across multiple CECT phases used in radiologists' diagnostic workflows, limiting performance for pancreatic tumor subtyping despite high heterogeneity and variability of these tumors.Method: Dual hierarchical contrast-enhanced-aware Mamba module with spatial and temporal sampling sequences to explore intra and inter-phase contrast variations. Includes similarity-guided refinement for temporal modeling, space complementary integrator, and multi-granularity fusion module for cross-scale semantics.
Result: Achieved 97.4% accuracy and 98.6% AUC on in-house dataset of 270 clinical cases for distinguishing pancreatic ductal adenocarcinoma (PDAC) from pancreatic neuroendocrine tumors (PNETs).
Conclusion: The proposed framework demonstrates potential as a more accurate and efficient tool for pancreatic tumor subtyping by effectively combining multi-phase CECT data with Mamba architecture for both temporal and spatial modeling.
Abstract: Contrast-enhanced computed tomography (CECT) is the primary imaging technique that provides valuable spatial-temporal information about lesions, enabling the accurate diagnosis and subclassification of pancreatic tumors. However, the high heterogeneity and variability of pancreatic tumors still pose substantial challenges for precise subtyping diagnosis. Previous methods fail to effectively explore the contextual information across multiple CECT phases commonly used in radiologists’ diagnostic workflows, thereby limiting their performance. In this paper, we introduce, for the first time, an automatic way to combine the multi-phase CECT data to discriminate between pancreatic tumor subtypes, among which the key is using Mamba with promising learnability and simplicity to encourage both temporal and spatial modeling from multi-phase CECT. Specifically, we propose a dual hierarchical contrast-enhanced-aware Mamba module incorporating two novel spatial and temporal sampling sequences to explore intra and inter-phase contrast variations of lesions. A similarity-guided refinement module is also imposed into the temporal scanning modeling to emphasize the learning on local tumor regions with more obvious temporal variations. Moreover, we design the space complementary integrator and multi-granularity fusion module to encode and aggregate the semantics across different scales, achieving more efficient learning for subtyping pancreatic tumors. The experimental results on an in-house dataset of 270 clinical cases achieve an accuracy of 97.4% and an AUC of 98.6% in distinguishing between pancreatic ductal adenocarcinoma (PDAC) and pancreatic neuroendocrine tumors (PNETs), demonstrating its potential as a more accurate and efficient tool.
[151] Modeling the Multivariate Relationship with Contextualized Representations for Effective Human-Object Interaction Detection
Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Zhihao Yang, Jiafei Wu
Main category: cs.CV
TL;DR: A Contextualized Representation Learning Network for HOI detection that integrates affordance-guided reasoning and contextual prompts to better capture complex interactions involving tools and auxiliary objects.
Details
Motivation: Existing two-stage HOI detection approaches face challenges due to incomplete context modeling, particularly for complex interactions involving tools and auxiliary entities.Method: Expands HOI detection beyond simple human-object pairs to include multivariate relationships through triplet structures <human, tool, object>. Uses affordance-guided reasoning and learnable prompts enriched with instance categories, integrated with visual features via attention mechanism for language-image alignment.
Result: Demonstrates superior performance on both HICO-Det and V-COCO datasets in most scenarios.
Conclusion: The proposed contextualized representation learning approach effectively captures complex tool-dependent interactions and provides enriched relational cues for more reliable reasoning in HOI detection.
Abstract: Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning Network that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures <human, tool, object>. This enables our model to identify tool-dependent interactions such as ‘filling’. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. Codes will be released upon acceptance.
[152] Double Helix Diffusion for Cross-Domain Anomaly Image Generation
Linchun Wu, Qin Zou, Xianbiao Qi, Bo Du, Zhongyuan Wang, Qingquan Li
Main category: cs.CV
TL;DR: DH-Diff is a novel cross-domain generative framework that simultaneously synthesizes high-fidelity anomaly images and pixel-level masks, addressing feature entanglement and structural inconsistency issues in synthetic anomaly generation.
Details
Motivation: Visual anomaly inspection suffers from scarcity of real anomaly samples for training robust detectors. Current synthetic data generation methods have limitations: structurally inconsistent anomalies and feature entanglement between images and annotation masks that undermine perceptual realism.Method: Double Helix Diffusion (DH-Diff) uses a double helix-inspired architecture with distinct modules for feature separation, connection, and merging. It employs domain-decoupled attention to mitigate feature entanglement and semantic score map alignment for structural authenticity. Offers flexible control via text prompts and optional graphical guidance.
Result: DH-Diff significantly outperforms state-of-the-art methods in diversity and authenticity of generated anomalies. This leads to significant improvements in downstream anomaly detection performance.
Conclusion: DH-Diff provides an effective solution for synthetic anomaly generation that addresses key limitations of existing methods, enabling better training of anomaly detectors through high-quality synthetic data.
Abstract: Visual anomaly inspection is critical in manufacturing, yet hampered by the scarcity of real anomaly samples for training robust detectors. Synthetic data generation presents a viable strategy for data augmentation; however, current methods remain constrained by two principal limitations: 1) the generation of anomalies that are structurally inconsistent with the normal background, and 2) the presence of undesirable feature entanglement between synthesized images and their corresponding annotation masks, which undermines the perceptual realism of the output. This paper introduces Double Helix Diffusion (DH-Diff), a novel cross-domain generative framework designed to simultaneously synthesize high-fidelity anomaly images and their pixel-level annotation masks, explicitly addressing these challenges. DH-Diff employs a unique architecture inspired by a double helix, cycling through distinct modules for feature separation, connection, and merging. Specifically, a domain-decoupled attention mechanism mitigates feature entanglement by enhancing image and annotation features independently, and meanwhile a semantic score map alignment module ensures structural authenticity by coherently integrating anomaly foregrounds. DH-Diff offers flexible control via text prompts and optional graphical guidance. Extensive experiments demonstrate that DH-Diff significantly outperforms state-of-the-art methods in diversity and authenticity, leading to significant improvements in downstream anomaly detection performance.
[153] Superpixel Anything: A general object-based framework for accurate yet regular superpixel segmentation
Julien Walther, Rémi Giraud, Michaël Clément
Main category: cs.CV
TL;DR: SPAM is a deep learning framework that generates accurate and regular superpixels by combining image features with large-scale pretrained segmentation models, outperforming state-of-the-art methods.
Details
Motivation: Traditional superpixel methods use low-level features while deep learning approaches sacrifice regularity for accuracy. There's a need for superpixels that are both accurate and regular/interpretable.Method: Train a model to extract image features for superpixel generation, then leverage a large-scale pretrained semantic-agnostic segmentation model at inference to align superpixels with object masks. Can handle prior segmentation and interactive object focusing.
Result: Comprehensive experiments show SPAM qualitatively and quantitatively outperforms state-of-the-art methods on segmentation tasks.
Conclusion: SPAM provides a versatile, robust framework for generating accurate yet regular superpixels, making it valuable for various computer vision applications.
Abstract: Superpixels are widely used in computer vision to simplify image representation and reduce computational complexity. While traditional methods rely on low-level features, deep learning-based approaches leverage high-level features but also tend to sacrifice regularity of superpixels to capture complex objects, leading to accurate but less interpretable segmentations. In this work, we introduce SPAM (SuperPixel Anything Model), a versatile framework for segmenting images into accurate yet regular superpixels. We train a model to extract image features for superpixel generation, and at inference, we leverage a large-scale pretrained model for semantic-agnostic segmentation to ensure that superpixels align with object masks. SPAM can handle any prior high-level segmentation, resolving uncertainty regions, and is able to interactively focus on specific objects. Comprehensive experiments demonstrate that SPAM qualitatively and quantitatively outperforms state-of-the-art methods on segmentation tasks, making it a valuable and robust tool for various applications. Code and pre-trained models are available here: https://github.com/waldo-j/spam.
[154] Hunyuan3D Studio: End-to-End AI Pipeline for Game-Ready 3D Asset Generation
Biwen Lei, Yang Li, Xinhai Liu, Shuhui Yang, Lixin Xu, Jingwei Huang, Ruining Tang, Haohan Weng, Jian Liu, Jing Xu, Zhen Zhou, Yiling Zhu, Jiankai Xing, Jiachen Xu, Changfeng Ma, Xinhao Yan, Yunhan Yang, Chunshi Wang, Duoteng Xu, Xueqi Ma, Yuguang Chen, Jing Li, Mingxin Yang, Sheng Zhang, Yifei Feng, Xin Huang, Di Luo, Zebin He, Puhua Jiang, Changrong Hu, Zihan Qin, Shiwei Miao, Haolin Liu, Yunfei Zhao, Zeqiang Lai, Qingxiang Lin, Zibo Zhao, Kunhong Li, Xianghui Yang, Huiwen Shi, Xin Yang, Yuxuan Wang, Zebin Yao, Yihang Lian, Sicong Liu, Xintong Han, Wangchen Qin, Caisheng Ouyang, Jianyin Liu, Tianwen Yuan, Shuai Jiang, Hong Duan, Yanqi Niu, Wencong Lin, Yifu Sun, Shirui Huang, Lin Niu, Gu Gong, Guojian Xiao, Bojian Zheng, Xiang Yuan, Qi Chen, Jie Xiao, Dongyang Zheng, Xiaofeng Yang, Kai Liu, Jianchen Zhu, Lifu Wang, Qinglin Lu, Jie Liu, Liang Dong, Fan Jiang, Ruibin Chen, Lei Wang, Chao Zhang, Jiaxin Lin, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Yinhe Wu, Jiayao Du, Jupeng Chen, Xinyue Mao, Dongyuan Guo, Yixuan Tang, Yulin Tsai, Yonghao Tan, Jiaao Yu, Junlin Yu, Keren Zhang, Yifan Li, Peng Chen, Tian Liu, Di Wang, Yuhong Liu, Linus, Jie Jiang, Zhuo Chen, Chunchao Guo
Main category: cs.CV
TL;DR: Hunyuan3D Studio is an AI-powered platform that automates game-ready 3D asset creation from concept images or text descriptions, integrating neural modules for part-level generation, polygon creation, and semantic UV mapping.
Details
Motivation: Traditional 3D asset creation is labor-intensive and requires specialized workflows, creating bottlenecks in game development that need automation and streamlining.Method: End-to-end platform integrating advanced neural modules (Part-level 3D Generation, Polygon Generation, Semantic UV) into a unified framework that transforms concept images or text into production-quality 3D models with optimized geometry and PBR textures.
Result: Generated assets are visually compelling and meet technical requirements of modern game engines, significantly reducing iteration time and lowering barriers to 3D content creation.
Conclusion: Hunyuan3D Studio represents a significant advancement in AI-assisted workflows for game development, providing a seamless bridge from creative intent to technical assets.
Abstract: The creation of high-quality 3D assets, a cornerstone of modern game development, has long been characterized by labor-intensive and specialized workflows. This paper presents Hunyuan3D Studio, an end-to-end AI-powered content creation platform designed to revolutionize the game production pipeline by automating and streamlining the generation of game-ready 3D assets. At its core, Hunyuan3D Studio integrates a suite of advanced neural modules (such as Part-level 3D Generation, Polygon Generation, Semantic UV, etc.) into a cohesive and user-friendly system. This unified framework allows for the rapid transformation of a single concept image or textual description into a fully-realized, production-quality 3D model complete with optimized geometry and high-fidelity PBR textures. We demonstrate that assets generated by Hunyuan3D Studio are not only visually compelling but also adhere to the stringent technical requirements of contemporary game engines, significantly reducing iteration time and lowering the barrier to entry for 3D content creation. By providing a seamless bridge from creative intent to technical asset, Hunyuan3D Studio represents a significant leap forward for AI-assisted workflows in game development and interactive media.
[155] SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention
Yuan Cao, Dong Wang
Main category: cs.CV
TL;DR: SAGA introduces selective adaptive gating to improve linear attention by addressing feature redundancy and low-rank constraints, achieving better computational efficiency and accuracy compared to standard linear attention methods.
Details
Motivation: Transformer architectures face quadratic complexity issues with softmax attention, especially for high-resolution images. Linear attention reduces complexity but suffers from uniform compression of KV information leading to feature redundancy and performance gaps.Method: Proposes SAGA with input-adaptive learnable gates to selectively modulate information aggregation into KV feature maps, plus an efficient Hadamard-product decomposition for gate computation without additional memory overhead.
Result: Achieves 1.76× throughput improvement and 2.69× reduction in peak GPU memory at 1280×1280 resolution compared to PVT-T. Improves top-1 accuracy by up to 4.4% on ImageNet.
Conclusion: SAGA effectively bridges the performance gap between linear and softmax attention while maintaining computational efficiency, demonstrating both improved model effectiveness and reduced resource requirements.
Abstract: While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from $(QK)V$ to $Q(KV)$, thereby reducing the complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$ while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank $KV$ feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76$\times$ improvement in throughput and a 2.69$\times$ reduction in peak GPU memory compared to PVT-T at a resolution of $1280 \times 1280$. Moreover, it improves top-1 accuracy by up to 4.4% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.
[156] Data Scaling Laws for Radiology Foundation Models
Maximilian Ilse, Harshita Sharma, Anton Schwaighofer, Sam Bond-Taylor, Fernando Pérez-García, Olesya Melnichenko, Anne-Marie G. Sykes, Kelly K. Horst, Ashish Khandelwal, Maxwell Reynolds, Maria T. Wetscherek, Noel C. F. Codella, Javier Alvarez-Valle, Korfiatis Panagiotis, Valentina Salvatelli
Main category: cs.CV
TL;DR: Systematic study of continual pretraining for medical vision encoders (MI2 and RAD-DINO) on 3.5M chest x-rays shows task-specific performance scaling - MI2 excels at finding-related tasks while RAD-DINO is better for tube-related tasks, with structured supervision proving valuable.
Details
Motivation: Medical imaging foundation models are constrained by smaller datasets compared to general vision models like CLIP and DINOv2. The research aims to understand how data scale and pretraining paradigms affect performance in medical imaging, particularly for diverse tasks beyond just radiology findings.Method: Continual pretraining of two vision encoders (MI2 representing CLIP paradigm and RAD-DINO representing DINOv2 paradigm) on up to 3.5M chest x-rays from a single institution. Evaluation includes classification (findings, lines, tubes), segmentation (lines, tubes), and radiology report generation tasks.
Result: MI2 scales more effectively for finding-related tasks, while RAD-DINO is stronger on tube-related tasks. Continual pretraining MI2 with both reports and structured labels using UniCL improves performance. For some tasks, as few as 30k in-domain samples can surpass open-weights foundation models.
Conclusion: Center-specific continual pretraining enables medical institutions to achieve significant performance gains using in-domain data, with different encoder paradigms excelling at different medical imaging tasks and structured supervision proving valuable at scale.
Abstract: Foundation vision encoders such as CLIP and DINOv2, trained on web-scale data, exhibit strong transfer performance across tasks and datasets. However, medical imaging foundation models remain constrained by smaller datasets, limiting our understanding of how data scale and pretraining paradigms affect performance in this setting. In this work, we systematically study continual pretraining of two vision encoders, MedImageInsight (MI2) and RAD-DINO representing the two major encoder paradigms CLIP and DINOv2, on up to 3.5M chest x-rays from a single institution, holding compute and evaluation protocols constant. We evaluate on classification (radiology findings, lines and tubes), segmentation (lines and tubes), and radiology report generation. While prior work has primarily focused on tasks related to radiology findings, we include lines and tubes tasks to counterbalance this bias and evaluate a model’s ability to extract features that preserve continuity along elongated structures. Our experiments show that MI2 scales more effectively for finding-related tasks, while RAD-DINO is stronger on tube-related tasks. Surprisingly, continually pretraining MI2 with both reports and structured labels using UniCL improves performance, underscoring the value of structured supervision at scale. We further show that for some tasks, as few as 30k in-domain samples are sufficient to surpass open-weights foundation models. These results highlight the utility of center-specific continual pretraining, enabling medical institutions to derive significant performance gains by utilizing in-domain data.
[157] Exploring Metric Fusion for Evaluation of NeRFs
Shreyas Shivakumara, Gabriel Eilertsen, Karljohan Lundin Palmerius
Main category: cs.CV
TL;DR: Combining DISTS and VMAF metrics improves correlation with subjective quality scores for NeRF evaluation across different datasets and configurations.
Details
Motivation: NeRF-generated outputs have unique artifacts and no single metric performs well across all datasets, requiring a combined approach for better evaluation.Method: Used two normalization strategies for DISTS and VMAF metrics, tested two fusion strategies, and evaluated across three configurations on Synthetic and Outdoor datasets.
Result: The fusion metrics showed improved correlation with subjective scores compared to individual metrics, demonstrating robustness and generalizability.
Conclusion: Combining perceptual metrics with different approaches (DISTS and VMAF) overcomes limitations of individual metrics and provides better evaluation of NeRF quality.
Abstract: Neural Radiance Fields (NeRFs) have demonstrated significant potential in synthesizing novel viewpoints. Evaluating the NeRF-generated outputs, however, remains a challenge due to the unique artifacts they exhibit, and no individual metric performs well across all datasets. We hypothesize that combining two successful metrics, Deep Image Structure and Texture Similarity (DISTS) and Video Multi-Method Assessment Fusion (VMAF), based on different perceptual methods, can overcome the limitations of individual metrics and achieve improved correlation with subjective quality scores. We experiment with two normalization strategies for the individual metrics and two fusion strategies to evaluate their impact on the resulting correlation with the subjective scores. The proposed pipeline is tested on two distinct datasets, Synthetic and Outdoor, and its performance is evaluated across three different configurations. We present a detailed analysis comparing the correlation coefficients of fusion methods and individual scores with subjective scores to demonstrate the robustness and generalizability of the fusion metrics.
[158] Leveraging Large Language Models to Effectively Generate Visual Data for Canine Musculoskeletal Diagnoses
Martin Thißen, Thi Ngoc Diep Tran, Barbara Esteve Ratsch, Ben Joel Schönbein, Ute Trapp, Beate Egner, Romana Piat, Elke Hergenröther
Main category: cs.CV
TL;DR: LLMs can generate synthetic visual training data for canine musculoskeletal diagnoses by mapping visual abnormalities to text, achieving 88% F1 score on real data with synthetic-only training.
Details
Motivation: Data collection is challenging for rare medical conditions and high-cost tasks. The paper addresses data scarcity in canine musculoskeletal diagnosis documentation by leveraging LLMs' text generation capabilities.Method: Developed mapping of visual documentations to 200+ labeled regions, used guided decoding, chain-of-thought reasoning, and few-shot prompting to generate synthetic visual data for patellar luxation and other diagnoses.
Result: Generated 1,000 synthetic documentations sensitive to location/severity but independent of sex. Model trained on synthetic data alone achieved 88% F1 score on 70 real-world cases.
Conclusion: LLM-generated synthetic data shows strong potential for addressing data scarcity, particularly in rare medical conditions, with methodology adaptable to other domains.
Abstract: It is well-established that more data generally improves AI model performance. However, data collection can be challenging for certain tasks due to the rarity of occurrences or high costs. These challenges are evident in our use case, where we apply AI models to a novel approach for visually documenting the musculoskeletal condition of dogs. Here, abnormalities are marked as colored strokes on a body map of a dog. Since these strokes correspond to distinct muscles or joints, they can be mapped to the textual domain in which large language models (LLMs) operate. LLMs have demonstrated impressive capabilities across a wide range of tasks, including medical applications, offering promising potential for generating synthetic training data. In this work, we investigate whether LLMs can effectively generate synthetic visual training data for canine musculoskeletal diagnoses. For this, we developed a mapping that segments visual documentations into over 200 labeled regions representing muscles or joints. Using techniques like guided decoding, chain-of-thought reasoning, and few-shot prompting, we generated 1,000 synthetic visual documentations for patellar luxation (kneecap dislocation) diagnosis, the diagnosis for which we have the most real-world data. Our analysis shows that the generated documentations are sensitive to location and severity of the diagnosis while remaining independent of the dog’s sex. We further generated 1,000 visual documentations for various other diagnoses to create a binary classification dataset. A model trained solely on this synthetic data achieved an F1 score of 88% on 70 real-world documentations. These results demonstrate the potential of LLM-generated synthetic data, which is particularly valuable for addressing data scarcity in rare diseases. While our methodology is tailored to the medical domain, the insights and techniques can be adapted to other fields.
[159] Cumulative Consensus Score: Label-Free and Model-Agnostic Evaluation of Object Detectors in Deployment
Avinaash Manoharan, Xiangyu Yin, Domenik Helm, Chih-Hong Cheng
Main category: cs.CV
TL;DR: CCS is a label-free metric for evaluating object detectors using test-time data augmentation and spatial consistency to measure reliability without ground-truth annotations.
Details
Motivation: Ground-truth annotations are rarely available in real-world deployment, making it challenging to continuously monitor and compare object detection models.Method: Apply test-time data augmentation to images, collect predicted bounding boxes across augmented views, compute overlaps using IoU, normalize maximum overlaps, and average across augmentation pairs to measure spatial consistency.
Result: CCS achieved over 90% congruence with F1-score, Probabilistic Detection Quality, and Optimal Correction Cost in controlled experiments on Open Images and KITTI datasets.
Conclusion: CCS provides a robust, model-agnostic foundation for DevOps-style monitoring of object detectors, working across different detector types and highlighting under-performing scenarios at the case level.
Abstract: Evaluating object detection models in deployment is challenging because ground-truth annotations are rarely available. We introduce the Cumulative Consensus Score (CCS), a label-free metric that enables continuous monitoring and comparison of detectors in real-world settings. CCS applies test-time data augmentation to each image, collects predicted bounding boxes across augmented views, and computes overlaps using Intersection over Union. Maximum overlaps are normalized and averaged across augmentation pairs, yielding a measure of spatial consistency that serves as a proxy for reliability without annotations. In controlled experiments on Open Images and KITTI, CCS achieved over 90% congruence with F1-score, Probabilistic Detection Quality, and Optimal Correction Cost. The method is model-agnostic, working across single-stage and two-stage detectors, and operates at the case level to highlight under-performing scenarios. Altogether, CCS provides a robust foundation for DevOps-style monitoring of object detectors.
[160] Few to Big: Prototype Expansion Network via Diffusion Learner for Point Cloud Few-shot Semantic Segmentation
Qianguang Zhao, Dongli Wang, Yan Zhou, Jianxun Li, Richard Irampa
Main category: cs.CV
TL;DR: PENet introduces a dual-stream prototype expansion framework using diffusion model features to address intra-class diversity and inter-set inconsistency in few-shot 3D point cloud segmentation, achieving state-of-the-art performance.
Details
Motivation: Existing prototype-based methods for few-shot 3D point cloud segmentation suffer from limited representational capacity (intra-class diversity) and misalignment between support and query feature spaces (inter-set inconsistency).Method: PENet uses a dual-stream architecture with an Intrinsic Learner for representative features and a Diffusion Learner using pre-trained diffusion model encoder for generalizable features. It includes Prototype Assimilation Module with push-pull cross-guidance attention and Prototype Calibration Mechanism.
Result: Extensive experiments on S3DIS and ScanNet datasets show PENet significantly outperforms state-of-the-art methods across various few-shot settings.
Conclusion: The framework successfully addresses prototype limitations by leveraging diffusion model features and cross-space alignment, demonstrating effective few-shot 3D segmentation.
Abstract: Few-shot 3D point cloud semantic segmentation aims to segment novel categories using a minimal number of annotated support samples. While existing prototype-based methods have shown promise, they are constrained by two critical challenges: (1) Intra-class Diversity, where a prototype’s limited representational capacity fails to cover a class’s full variations, and (2) Inter-set Inconsistency, where prototypes derived from the support set are misaligned with the query feature space. Motivated by the powerful generative capability of diffusion model, we re-purpose its pre-trained conditional encoder to provide a novel source of generalizable features for expanding the prototype’s representational range. Under this setup, we introduce the Prototype Expansion Network (PENet), a framework that constructs big-capacity prototypes from two complementary feature sources. PENet employs a dual-stream learner architecture: it retains a conventional fully supervised Intrinsic Learner (IL) to distill representative features, while introducing a novel Diffusion Learner (DL) to provide rich generalizable features. The resulting dual prototypes are then processed by a Prototype Assimilation Module (PAM), which adopts a novel push-pull cross-guidance attention block to iteratively align the prototypes with the query space. Furthermore, a Prototype Calibration Mechanism (PCM) regularizes the final big capacity prototype to prevent semantic drift. Extensive experiments on the S3DIS and ScanNet datasets demonstrate that PENet significantly outperforms state-of-the-art methods across various few-shot settings.
[161] Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder
Qifei Jia, Yu Liu, Yajie Chai, Xintong Yao, Qiming Lu, Yasen Zhang, Runyu Shi, Ying Huang, Guoquan Zhang
Main category: cs.CV
TL;DR: Lego-Edit is a novel instruction-based image editing system that uses MLLM to organize model-level editing tools, achieving SOTA performance through a toolkit design and progressive reinforcement learning.
Details
Motivation: Existing instruction-based image editing methods fail to generalize effectively to diverse real-world user instructions outside their training domain, limiting practical applications.Method: Uses Multi-modal Large Language Model (MLLM) with two key designs: (1) model-level toolkit with diverse models trained on limited data and image manipulation functions, (2) three-stage progressive reinforcement learning using feedback on unannotated open-domain instructions.
Result: Achieves state-of-the-art performance on GEdit-Bench and ImgBench, exhibits robust reasoning capabilities for open-domain instructions, and can utilize new editing tools without additional fine-tuning.
Conclusion: Lego-Edit effectively addresses generalization challenges in instruction-based image editing by leveraging MLLM’s capabilities and a well-designed toolkit approach, demonstrating strong performance across diverse real-world instructions.
Abstract: Instruction-based image editing has garnered significant attention due to its direct interaction with users. However, real-world user instructions are immensely diverse, and existing methods often fail to generalize effectively to instructions outside their training domain, limiting their practical application. To address this, we propose Lego-Edit, which leverages the generalization capability of Multi-modal Large Language Model (MLLM) to organize a suite of model-level editing tools to tackle this challenge. Lego-Edit incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM, equipping it with generalized reasoning capabilities for handling real-world instructions. Experiments demonstrate that Lego-Edit achieves state-of-the-art performance on GEdit-Bench and ImgBench. It exhibits robust reasoning capabilities for open-domain instructions and can utilize newly introduced editing tools without additional fine-tuning. Code is available: https://github.com/xiaomi-research/lego-edit.
[162] Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing
Weiming Chen, Zhihan Zhu, Yijia Wang, Zhihai He
Main category: cs.CV
TL;DR: Proposes high-order inversion method using Runge-Kutta solver and Decoupled Diffusion Transformer Attention (DDTA) to improve rectified flow models’ inversion accuracy and attention control.
Details
Motivation: Rectified flow models suffer from low inversion accuracy (hindering source image consistency) and entangled multimodal attention in diffusion transformers (hindering precise attention control).Method: 1) Efficient high-order inversion method based on Runge-Kutta solver for differential equations; 2) DDTA mechanism that disentangles text and image attention in multimodal diffusion transformers.
Result: Achieves state-of-the-art performance in image reconstruction and text-guided editing tasks, demonstrating improved fidelity and editability.
Conclusion: The proposed methods effectively address the major challenges of rectified flow models, enabling better inversion accuracy and more precise semantic control for enhanced generative performance.
Abstract: Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA.
[163] MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization
Yiyi Zhang, Yuchen Yuan, Ying Zheng, Jialun Pei, Jinpeng Li, Zheng Li, Pheng-Ann Heng
Main category: cs.CV
TL;DR: MEJO framework addresses surgical triplet recognition challenges by disentangling task representations and rebalancing class gradients, achieving superior performance on benchmark datasets.
Details
Motivation: Surgical triplet recognition suffers from long-tailed data distribution and optimization conflicts between tasks (inter-task) and within classes (intra-task), limiting model performance.Method: Proposes MLLM-Engaged Joint Optimization with Shared-Specific-Disentangled learning for inter-task optimization and Coordinated Gradient Learning for intra-task optimization, using MLLM-powered probabilistic prompts.
Result: Extensive experiments on CholecT45 and CholecT50 datasets demonstrate superior performance, validating effectiveness in handling optimization conflicts.
Conclusion: The MEJO framework successfully addresses both inter-task and intra-task optimization conflicts in surgical triplet recognition through representation disentanglement and gradient coordination.
Abstract: Surgical triplet recognition, which involves identifying instrument, verb, target, and their combinations, is a complex surgical scene understanding challenge plagued by long-tailed data distribution. The mainstream multi-task learning paradigm benefiting from cross-task collaborative promotion has shown promising performance in identifying triples, but two key challenges remain: 1) inter-task optimization conflicts caused by entangling task-generic and task-specific representations; 2) intra-task optimization conflicts due to class-imbalanced training data. To overcome these difficulties, we propose the MLLM-Engaged Joint Optimization (MEJO) framework that empowers both inter- and intra-task optimization for surgical triplet recognition. For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$^2$D) learning scheme that decomposes representations into task-shared and task-specific components. To enhance task-shared representations, we construct a Multimodal Large Language Model (MLLM) powered probabilistic prompt pool to dynamically augment visual features with expert-level semantic cues. Additionally, comprehensive task-specific cues are modeled via distinct task prompts covering the temporal-spatial dimensions, effectively mitigating inter-task ambiguities. To tackle intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative gradients originating from head and tail classes for more coordinated learning behaviors. Extensive experiments on the CholecT45 and CholecT50 datasets demonstrate the superiority of our proposed framework, validating its effectiveness in handling optimization conflicts.
[164] DialNav: Multi-turn Dialog Navigation with a Remote Guide
Leekyeung Han, Hyunji Min, Gyeom Hwangbo, Jonghyun Choi, Paul Hongsuck Seo
Main category: cs.CV
TL;DR: DialNav introduces a collaborative embodied dialog task where a Navigator and Guide communicate to reach goals, with the RAIN dataset and benchmark for evaluation.
Details
Motivation: Prior work lacks holistic evaluation where the Guide must infer the Navigator's location, making communication essential for task success in embodied navigation.Method: Collected human-human dialog paired with navigation trajectories in photorealistic environments (RAIN dataset), designed comprehensive benchmark, and conducted experiments with different Navigator/Guide models.
Result: Created a novel task and dataset that highlights key challenges in collaborative embodied navigation through dialog.
Conclusion: DialNav and RAIN dataset provide foundation for future research in embodied dialog, with publicly released data, code, and evaluation framework.
Abstract: We introduce DialNav, a novel collaborative embodied dialog task, where a navigation agent (Navigator) and a remote guide (Guide) engage in multi-turn dialog to reach a goal location. Unlike prior work, DialNav aims for holistic evaluation and requires the Guide to infer the Navigator’s location, making communication essential for task success. To support this task, we collect and release the Remote Assistance in Navigation (RAIN) dataset, human-human dialog paired with navigation trajectories in photorealistic environments. We design a comprehensive benchmark to evaluate both navigation and dialog, and conduct extensive experiments analyzing the impact of different Navigator and Guide models. We highlight key challenges and publicly release the dataset, code, and evaluation framework to foster future research in embodied dialog.
[165] Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models
Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhixing Tan, Chong Feng
Main category: cs.CV
TL;DR: CLVS introduces vision memory to maintain sustained attention on key objects across model layers, improving LVLMs’ visual capabilities without additional parameters.
Details
Motivation: LVLMs have brief attention on key objects despite accurate localization. Sustained focus on important visual elements can enhance visual understanding capabilities.Method: Cross-Layer Vision Smoothing (CLVS) with vision memory that initializes with position-unbiased attention in first layer, then iteratively updates across layers while using uncertainty to terminate smoothing when visual understanding is complete.
Result: State-of-the-art performance on four benchmarks across three LVLMs, with significant improvements in relation and attribute understanding tasks.
Conclusion: Maintaining smooth, sustained attention on key objects through cross-layer vision smoothing effectively enhances LVLMs’ visual capabilities, particularly for complex visual reasoning tasks.
Abstract: Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs’ visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model’s visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art performance on a variety of visual understanding tasks, with particularly significant improvements in relation and attribute understanding.
[166] MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion
Guihui Li, Bowei Dong, Kaizhi Dong, Jiayi Li, Haiyong Zheng
Main category: cs.CV
TL;DR: MSGFusion is a novel infrared and visible image fusion framework that uses structured scene graphs to guide fusion, explicitly modeling entities, attributes, and spatial relationships for superior semantic-aware fusion performance.
Details
Motivation: Current deep learning fusion methods rely too much on low-level visual cues and struggle with high-level semantic information. Existing text-guided approaches use unstructured descriptions without explicit modeling of entities, attributes, and spatial relationships, limiting fine-grained fusion quality.Method: MSGFusion uses multimodal scene graphs derived from text and vision to explicitly represent entities, attributes, and spatial relations. It employs successive modules for scene graph representation, hierarchical aggregation, and graph-driven fusion to synchronously refine high-level semantics and low-level details.
Result: Extensive experiments show MSGFusion significantly outperforms state-of-the-art methods, particularly in detail preservation and structural clarity. It delivers superior semantic consistency and generalizability in downstream tasks including low-light object detection, semantic segmentation, and medical image fusion.
Conclusion: The structured scene graph approach effectively bridges the gap between low-level visual features and high-level semantic understanding, providing a robust framework for infrared and visible image fusion with strong performance across multiple applications.
Abstract: Infrared and visible image fusion has garnered considerable attention owing to the strong complementarity of these two modalities in complex, harsh environments. While deep learning-based fusion methods have made remarkable advances in feature extraction, alignment, fusion, and reconstruction, they still depend largely on low-level visual cues, such as texture and contrast, and struggle to capture the high-level semantic information embedded in images. Recent attempts to incorporate text as a source of semantic guidance have relied on unstructured descriptions that neither explicitly model entities, attributes, and relationships nor provide spatial localization, thereby limiting fine-grained fusion performance. To overcome these challenges, we introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery. By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations, and then synchronously refines high-level semantics and low-level details through successive modules for scene graph representation, hierarchical aggregation, and graph-driven fusion. Extensive experiments on multiple public benchmarks show that MSGFusion significantly outperforms state-of-the-art approaches, particularly in detail preservation and structural clarity, and delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.
[167] AREPAS: Anomaly Detection in Fine-Grained Anatomy with Reconstruction-Based Semantic Patch-Scoring
Branko Mitic, Philipp Seeböck, Helmut Prosch, Georg Langs
Main category: cs.CV
TL;DR: A novel generative anomaly detection method using image-to-image translation and patch similarity scoring for precise anomaly localization in medical images, showing improved performance in chest CT and brain MRI.
Details
Motivation: Early disease detection and medical anomaly segmentation are crucial but challenging due to normal tissue variability. Existing generative methods struggle with fine-grained anatomical variations like those in pulmonary anatomy.Method: Proposes a two-stage approach: 1) image-to-image translation for anomaly-free reconstruction, and 2) patch similarity scoring between observed and generated image pairs for precise anomaly localization.
Result: Achieved improved pixel-level anomaly segmentation with relative DICE score improvements of +1.9% in chest CTs and +4.4% in brain MRIs compared to state-of-the-art reconstruction-based methods.
Conclusion: The proposed method effectively addresses the challenge of normal tissue variability in medical anomaly detection and shows strong generalizability across different medical imaging modalities and conditions.
Abstract: Early detection of newly emerging diseases, lesion severity assessment, differentiation of medical conditions and automated screening are examples for the wide applicability and importance of anomaly detection (AD) and unsupervised segmentation in medicine. Normal fine-grained tissue variability such as present in pulmonary anatomy is a major challenge for existing generative AD methods. Here, we propose a novel generative AD approach addressing this issue. It consists of an image-to-image translation for anomaly-free reconstruction and a subsequent patch similarity scoring between observed and generated image-pairs for precise anomaly localization. We validate the new method on chest computed tomography (CT) scans for the detection and segmentation of infectious disease lesions. To assess generalizability, we evaluate the method on an ischemic stroke lesion segmentation task in T1-weighted brain MRI. Results show improved pixel-level anomaly segmentation in both chest CTs and brain MRIs, with relative DICE score improvements of +1.9% and +4.4%, respectively, compared to other state-of-the-art reconstruction-based methods.
[168] T-SiamTPN: Temporal Siamese Transformer Pyramid Networks for Robust and Efficient UAV Tracking
Hojat Ardi, Amir Jahanshahi, Ali Diba
Main category: cs.CV
TL;DR: T-SiamTPN enhances Siamese tracking with temporal modeling, achieving 13.7% higher success rate and 14.7% better precision while maintaining real-time performance on embedded devices.
Details
Motivation: Address limitations of existing trackers that overlook temporal dependencies and struggle with complex appearance changes, scale variations, and occlusions in aerial object tracking.Method: Extends SiamTPN architecture with temporal feature fusion and attention-based interactions for explicit temporal modeling, strengthening temporal consistency and feature representations.
Result: Significant improvements over baseline: 13.7% higher success rate and 14.7% better precision. Runs at 7.1 FPS on Jetson Nano with real-time performance and no notable runtime overhead.
Conclusion: Temporal modeling is crucial for Siamese tracking frameworks. T-SiamTPN provides an efficient, real-time solution for aerial object tracking suitable for embedded applications.
Abstract: Aerial object tracking remains a challenging task due to scale variations, dynamic backgrounds, clutter, and frequent occlusions. While most existing trackers emphasize spatial cues, they often overlook temporal dependencies, resulting in limited robustness in long-term tracking and under occlusion. Furthermore, correlation-based Siamese trackers are inherently constrained by the linear nature of correlation operations, making them ineffective against complex, non-linear appearance changes. To address these limitations, we introduce T-SiamTPN, a temporal-aware Siamese tracking framework that extends the SiamTPN architecture with explicit temporal modeling. Our approach incorporates temporal feature fusion and attention-based interactions, strengthening temporal consistency and enabling richer feature representations. These enhancements yield significant improvements over the baseline and achieve performance competitive with state-of-the-art trackers. Crucially, despite the added temporal modules, T-SiamTPN preserves computational efficiency. Deployed on the resource-constrained Jetson Nano, the tracker runs in real time at 7.1 FPS, demonstrating its suitability for real-world embedded applications without notable runtime overhead. Experimental results highlight substantial gains: compared to the baseline, T-SiamTPN improves success rate by 13.7% and precision by 14.7%. These findings underscore the importance of temporal modeling in Siamese tracking frameworks and establish T-SiamTPN as a strong and efficient solution for aerial object tracking. Code is available at: https://github.com/to/be/released
[169] A Novel Compression Framework for YOLOv8: Achiev-ing Real-Time Aerial Object Detection on Edge Devices via Structured Pruning and Channel-Wise Distillation
Melika Sabaghian, Mohammad Ali Keyvanrad, Seyyedeh Mahila Moghadami
Main category: cs.CV
TL;DR: A three-stage compression pipeline for YOLOv8 that combines sparsity-aware training, structured channel pruning, and knowledge distillation, achieving 73.51% parameter reduction with minimal accuracy loss and 68 FPS inference speed.
Details
Motivation: Efficient deployment of deep learning models for aerial object detection on resource-constrained devices requires significant compression without compromising performance.Method: Three-stage compression pipeline: 1) Sparsity-aware training with dynamic sparsity, 2) Structured channel pruning using batch normalization scaling factors, 3) Channel-Wise Knowledge Distillation with adjustable temperature and loss weighting for small/medium object detection.
Result: For YOLOv8m: parameters reduced from 25.85M to 6.85M (73.51% reduction), FLOPs from 49.6G to 13.3G, MACs from 101G to 34.5G, AP50 reduced by only 2.7% to 47.9, inference speed increased from 26 FPS to 45 FPS. With TensorRT: 68 FPS with AP50 47.6.
Conclusion: The approach enables real-time deployment on edge devices with high compression rates and minimal performance degradation, demonstrating practicality for resource-constrained aerial object detection scenarios.
Abstract: Efficient deployment of deep learning models for aerial object detection on resource-constrained devices requires significant compression without com-promising performance. In this study, we propose a novel three-stage compression pipeline for the YOLOv8 object detection model, integrating sparsity-aware training, structured channel pruning, and Channel-Wise Knowledge Distillation (CWD). First, sparsity-aware training introduces dynamic sparsity during model optimization, effectively balancing parameter reduction and detection accuracy. Second, we apply structured channel pruning by leveraging batch normalization scaling factors to eliminate redundant channels, significantly reducing model size and computational complexity. Finally, to mitigate the accuracy drop caused by pruning, we employ CWD to transfer knowledge from the original model, using an adjustable temperature and loss weighting scheme tailored for small and medium object detection. Extensive experiments on the VisDrone dataset demonstrate the effectiveness of our approach across multiple YOLOv8 variants. For YOLOv8m, our method reduces model parameters from 25.85M to 6.85M (a 73.51% reduction), FLOPs from 49.6G to 13.3G, and MACs from 101G to 34.5G, while reducing AP50 by only 2.7%. The resulting compressed model achieves 47.9 AP50 and boosts inference speed from 26 FPS (YOLOv8m baseline) to 45 FPS, enabling real-time deployment on edge devices. We further apply TensorRT as a lightweight optimization step. While this introduces a minor drop in AP50 (from 47.9 to 47.6), it significantly improves inference speed from 45 to 68 FPS, demonstrating the practicality of our approach for high-throughput, re-source-constrained scenarios.
[170] MATTER: Multiscale Attention for Registration Error Regression
Shipeng Liu, Ziliang Xiong, Khac-Hoang Ngo, Per-Erik Forssén
Main category: cs.CV
TL;DR: This paper proposes a regression-based approach for point cloud registration quality validation, using multiscale feature extraction and attention-based aggregation to provide fine-grained quantification of registration errors.
Details
Motivation: Existing methods treat point cloud registration quality validation as a classification task with limited classes, which lacks fine-grained quantification. The authors aim to provide more precise error estimation, especially for point clouds with heterogeneous spatial densities.Method: The authors use regression instead of classification for PCR validation, extend misalignment-related features through multiscale extraction, and employ attention-based aggregation for robust feature processing.
Result: The method achieves accurate and robust registration error estimation on diverse datasets, particularly for point clouds with heterogeneous spatial densities. When used to guide mapping tasks, it significantly improves mapping quality compared to state-of-the-art classification-based methods.
Conclusion: Regression-based PCR validation with multiscale feature extraction and attention-based aggregation provides superior fine-grained quality quantification and improves downstream task performance compared to classification approaches.
Abstract: Point cloud registration (PCR) is crucial for many downstream tasks, such as simultaneous localization and mapping (SLAM) and object tracking. This makes detecting and quantifying registration misalignment, i.e.,~{\it PCR quality validation}, an important task. All existing methods treat validation as a classification task, aiming to assign the PCR quality to a few classes. In this work, we instead use regression for PCR validation, allowing for a more fine-grained quantification of the registration quality. We also extend previously used misalignment-related features by using multiscale extraction and attention-based aggregation. This leads to accurate and robust registration error estimation on diverse datasets, especially for point clouds with heterogeneous spatial densities. Furthermore, when used to guide a mapping downstream task, our method significantly improves the mapping quality for a given amount of re-registered frames, compared to the state-of-the-art classification-based method.
[171] 4DRadar-GS: Self-Supervised Dynamic Driving Scene Reconstruction with 4D Radar
Xiao Tang, Guirong Zhuo, Cong Wang, Boyuan Zheng, Minqing Huang, Lianqing Zheng, Long Chen, Shouyi Lu
Main category: cs.CV
TL;DR: 4DRadar-GS is a novel self-supervised 3D reconstruction framework that uses 4D Radar data to improve dynamic object reconstruction in autonomous driving scenes, addressing motion estimation and temporal consistency issues.
Details
Motivation: Existing self-supervised methods struggle with accurate dynamic object reconstruction due to imprecise motion estimation and weak temporal consistency, especially when annotated bounding boxes are unavailable.Method: Proposes a 4D Radar-assisted Gaussian initialization scheme for dynamic object segmentation and depth scale recovery, plus a Velocity-guided PointTrack model for fine-grained dynamic trajectory tracking under scene flow supervision.
Result: Achieves state-of-the-art performance in dynamic driving scene 3D reconstruction on the OmniHD-Scenes dataset.
Conclusion: 4DRadar-GS effectively leverages 4D Radar data to overcome limitations of existing methods, providing more accurate and temporally consistent 3D reconstruction for dynamic driving scenes.
Abstract: 3D reconstruction and novel view synthesis are critical for validating autonomous driving systems and training advanced perception models. Recent self-supervised methods have gained significant attention due to their cost-effectiveness and enhanced generalization in scenarios where annotated bounding boxes are unavailable. However, existing approaches, which often rely on frequency-domain decoupling or optical flow, struggle to accurately reconstruct dynamic objects due to imprecise motion estimation and weak temporal consistency, resulting in incomplete or distorted representations of dynamic scene elements. To address these challenges, we propose 4DRadar-GS, a 4D Radar-augmented self-supervised 3D reconstruction framework tailored for dynamic driving scenes. Specifically, we first present a 4D Radar-assisted Gaussian initialization scheme that leverages 4D Radar’s velocity and spatial information to segment dynamic objects and recover monocular depth scale, generating accurate Gaussian point representations. In addition, we propose a Velocity-guided PointTrack (VGPT) model, which is jointly trained with the reconstruction pipeline under scene flow supervision, to track fine-grained dynamic trajectories and construct temporally consistent representations. Evaluated on the OmniHD-Scenes dataset, 4DRadar-GS achieves state-of-the-art performance in dynamic driving scene 3D reconstruction.
[172] Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings
Abdalla Arafa, Didier Stricker
Main category: cs.CV
TL;DR: A novel approach for 3D scene understanding using object-level Gaussian representations with CLIP feature aggregation, enabling accurate open-vocabulary object retrieval and seamless 2D/3D task adaptation without differentiable rendering for semantics.
Details
Motivation: 3D Gaussian Splatting enables real-time photorealistic rendering but lacks semantic understanding capabilities due to alpha blending limitations that average semantics across objects, restricting AR/VR and robotics applications.Method: Leverages predecomposed object-level Gaussians and represents each object through multiview CLIP feature aggregation to create comprehensive “bags of embeddings” that holistically describe objects, bypassing differentiable rendering for semantics.
Result: Effectively overcomes challenges of 3D open-vocabulary object extraction while maintaining comparable performance to state-of-the-art in 2D open-vocabulary segmentation with minimal compromise.
Conclusion: The proposed paradigm shift enables accurate object-level semantic understanding in 3D Gaussian representations, opening new possibilities for AR/VR and robotics applications that require both photorealistic rendering and semantic scene comprehension.
Abstract: Novel view synthesis has seen significant advancements with 3D Gaussian Splatting (3DGS), enabling real-time photorealistic rendering. However, the inherent fuzziness of Gaussian Splatting presents challenges for 3D scene understanding, restricting its broader applications in AR/VR and robotics. While recent works attempt to learn semantics via 2D foundation model distillation, they inherit fundamental limitations: alpha blending averages semantics across objects, making 3D-level understanding impossible. We propose a paradigm-shifting alternative that bypasses differentiable rendering for semantics entirely. Our key insight is to leverage predecomposed object-level Gaussians and represent each object through multiview CLIP feature aggregation, creating comprehensive “bags of embeddings” that holistically describe objects. This allows: (1) accurate open-vocabulary object retrieval by comparing text queries to object-level (not Gaussian-level) embeddings, and (2) seamless task adaptation: propagating object IDs to pixels for 2D segmentation or to Gaussians for 3D extraction. Experiments demonstrate that our method effectively overcomes the challenges of 3D open-vocabulary object extraction while remaining comparable to state-of-the-art performance in 2D open-vocabulary segmentation, ensuring minimal compromise.
[173] Time-step Mixup for Efficient Spiking Knowledge Transfer from Appearance to Event Domain
Yuqi Xie, Shuhan Ye, Chong Wang, Jiazhen Xu, Le Shen, Yuanbin Qian, Jiangbo Qian
Main category: cs.CV
TL;DR: TMKT is a novel fine-grained mixing strategy that interpolates RGB and DVS inputs at various time-steps to enable effective knowledge transfer between modalities for spiking neural networks.
Details
Motivation: Event cameras and spiking neural networks offer energy-efficient visual processing, but limited event data availability and sparse DVS outputs create training challenges. Existing methods overlook the significant distribution gap between RGB and DVS modalities.Method: Proposes Time-step Mixup knowledge transfer (TMKT) that exploits SNNs’ asynchronous nature by interpolating RGB and DVS inputs at different time-steps. Introduces modality-aware auxiliary learning objectives to enable label mixing in cross-modal scenarios.
Result: The approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks across multiple datasets.
Conclusion: TMKT effectively bridges the modality gap between RGB and DVS inputs for spiking neural networks, demonstrating improved performance through time-step interpolation and auxiliary learning objectives.
Abstract: The integration of event cameras and spiking neural networks holds great promise for energy-efficient visual processing. However, the limited availability of event data and the sparse nature of DVS outputs pose challenges for effective training. Although some prior work has attempted to transfer semantic knowledge from RGB datasets to DVS, they often overlook the significant distribution gap between the two modalities. In this paper, we propose Time-step Mixup knowledge transfer (TMKT), a novel fine-grained mixing strategy that exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time-steps. To enable label mixing in cross-modal scenarios, we further introduce modality-aware auxiliary learning objectives. These objectives support the time-step mixup process and enhance the model’s ability to discriminate effectively across different modalities. Our approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks. Extensive experiments demonstrate the effectiveness of our method across multiple datasets. The code will be released after the double-blind review process.
[174] Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection
Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Sicong Li, Qingming Huang
Main category: cs.CV
TL;DR: A Dual-Stage Reweighted Mixture-of-Experts framework for detecting subtle and infrequent user action mistakes from egocentric video data, combining feature extraction from frozen and tuned ViViT models with multiple specialized classifiers.
Details
Motivation: To address the challenge of identifying subtle and infrequent mistakes in user actions from egocentric video data, where traditional methods struggle with class imbalance and ambiguous instances.Method: Two-stage framework: 1) Feature extraction using frozen ViViT and LoRA-tuned ViViT models combined through feature-level expert module; 2) Three classifiers with different objectives (reweighted cross-entropy, AUC loss, label-aware loss with sharpness-aware minimization) fused via classification-level expert module.
Result: Achieves strong performance, particularly in identifying rare and ambiguous mistake instances from egocentric video data.
Conclusion: The proposed DR-MoE framework effectively handles class imbalance and subtle mistake detection in egocentric video analysis through its dual-stage expert fusion approach.
Abstract: In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.
[175] MMMS: Multi-Modal Multi-Surface Interactive Segmentation
Robin Schön, Julian Lorenz, Katja Ludwig, Daniel Kienzle, Rainer Lienhart
Main category: cs.CV
TL;DR: A method for interactive multi-surface segmentation using user clicks and multi-modal inputs, with a network architecture that integrates RGB, non-RGB modalities, erroneous masks, and encoded clicks to predict improved segmentation masks.
Details
Motivation: To address the challenge of segmenting multiple entangled and adjacent surfaces simultaneously present in images, requiring interactive user guidance and multi-modal inputs for accurate segmentation.Method: Network architecture that takes RGB image, non-RGB modalities, erroneous mask, and encoded clicks as input to predict improved segmentation masks, with RGB backbone as black-box and late integration of interaction-specific information to reduce response time.
Result: Reduces NoC@90 by up to 1.28 clicks per surface on DeLiVER and up to 1.19 on MFNet using additional modalities, with RGB-only baseline achieving competitive performance in single-mask scenarios.
Conclusion: The method effectively handles multi-modal multi-surface interactive segmentation, demonstrating improved efficiency through reduced user clicks and maintaining competitive performance in traditional single-mask settings.
Abstract: In this paper, we present a method to interactively create segmentation masks on the basis of user clicks. We pay particular attention to the segmentation of multiple surfaces that are simultaneously present in the same image. Since these surfaces may be heavily entangled and adjacent, we also present a novel extended evaluation metric that accounts for the challenges of this scenario. Additionally, the presented method is able to use multi-modal inputs to facilitate the segmentation task. At the center of this method is a network architecture which takes as input an RGB image, a number of non-RGB modalities, an erroneous mask, and encoded clicks. Based on this input, the network predicts an improved segmentation mask. We design our architecture such that it adheres to two conditions: (1) The RGB backbone is only available as a black-box. (2) To reduce the response time, we want our model to integrate the interaction-specific information after the image feature extraction and the multi-modal fusion. We refer to the overall task as Multi-Modal Multi-Surface interactive segmentation (MMMS). We are able to show the effectiveness of our multi-modal fusion strategy. Using additional modalities, our system reduces the NoC@90 by up to 1.28 clicks per surface on average on DeLiVER and up to 1.19 on MFNet. On top of this, we are able to show that our RGB-only baseline achieves competitive, and in some cases even superior performance when tested in a classical, single-mask interactive segmentation scenario.
[176] ICDAR 2025 Competition on FEw-Shot Text line segmentation of ancient handwritten documents (FEST)
Silvia Zottin, Axel De Nardin, Giuseppe Branca, Claudio Piciarelli, Gian Luca Foresti
Main category: cs.CV
TL;DR: FEST Competition introduces a few-shot learning challenge for text line segmentation in ancient handwritten documents using only 3 annotated images per manuscript for training.
Details
Motivation: Historical handwritten documents present unique challenges like irregular handwriting, faded ink, complex layouts, and scarcity of annotated data, making supervised learning impractical.Method: Participants develop few-shot learning systems using the U-DIADS-TL dataset with diverse ancient manuscripts featuring various layouts, degradation levels, and non-standard formatting.
Result: The competition provides a framework for evaluating robust text line segmentation methods that require minimal manual annotation.
Conclusion: FEST aims to promote adaptable automated document analysis tools that can be practically employed by humanities scholars for historical research with reduced annotation effort.
Abstract: Text line segmentation is a critical step in handwritten document image analysis. Segmenting text lines in historical handwritten documents, however, presents unique challenges due to irregular handwriting, faded ink, and complex layouts with overlapping lines and non-linear text flow. Furthermore, the scarcity of large annotated datasets renders fully supervised learning approaches impractical for such materials. To address these challenges, we introduce the Few-Shot Text Line Segmentation of Ancient Handwritten Documents (FEST) Competition. Participants are tasked with developing systems capable of segmenting text lines in U-DIADS-TL dataset, using only three annotated images per manuscript for training. The competition dataset features a diverse collection of ancient manuscripts exhibiting a wide range of layouts, degradation levels, and non-standard formatting, closely reflecting real-world conditions. By emphasizing few-shot learning, FEST competition aims to promote the development of robust and adaptable methods that can be employed by humanities scholars with minimal manual annotation effort, thus fostering broader adoption of automated document analysis tools in historical research.
[177] SHREC 2025: Protein surface shape retrieval including electrostatic potential
Taher Yacoub, Camille Depenveiller, Atsushi Tatsuma, Tin Barisin, Eugen Rusakov, Udo Gobel, Yuxu Peng, Shiqiang Deng, Yuki Kagaya, Joon Hong Park, Daisuke Kihara, Marco Guerra, Giorgio Palmieri, Andrea Ranieri, Ulderico Fugacci, Silvia Biasotti, Ruiwen He, Halim Benhabiles, Adnane Cabani, Karim Hammoudi, Haotian Li, Hao Huang, Chunyan Li, Alireza Tehrani, Fanwang Meng, Farnaz Heidar-Zadeh, Tuan-Anh Yang, Matthieu Montes
Main category: cs.CV
TL;DR: SHREC 2025 track evaluated 9 teams’ 15 methods for protein surface shape retrieval using electrostatic potential on 11,555 protein surfaces, with methods combining shape and electrostatic potential performing best.
Details
Motivation: To evaluate and compare different methods for protein surface shape retrieval, particularly focusing on the value of incorporating electrostatic potential as an additional molecular surface descriptor.Method: Evaluation of 15 proposed methods from 9 teams on a dataset of 11,555 protein surfaces with calculated electrostatic potential, using multiple metrics including Accuracy, Balanced accuracy, F1 score, Precision and Recall.
Result: Methods that used electrostatic potential complementary to molecular surface shape achieved the best retrieval performance, including for classes with limited data.
Conclusion: Incorporating additional molecular surface descriptors like electrostatic potential alongside shape information significantly improves protein surface retrieval performance.
Abstract: This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.
[178] Improving Accuracy and Efficiency of Implicit Neural Representations: Making SIREN a WINNER
Hemanth Chandravamsi, Dhanush V. Shenoy, Steven H. Frankel
Main category: cs.CV
TL;DR: WINNER addresses SIREN’s spectral bottleneck by using Gaussian noise weight initialization scaled to target signal’s spectral centroid, achieving state-of-the-art performance without extra parameters.
Details
Motivation: SIRENs struggle with signals outside their frequency support due to poor initialization, leading to spectral bottleneck where networks fail to represent even frequencies within their capacity.Method: WINNER perturbs uniformly initialized SIREN weights with Gaussian noise, where noise scales are adaptively determined by the spectral centroid of the target signal.
Result: Achieves state-of-the-art audio fitting and significant gains in image and 3D shape fitting tasks over base SIREN, without introducing additional trainable parameters.
Conclusion: WINNER provides an effective solution to SIREN’s spectral limitations and suggests new adaptive, target-aware initialization strategies for neural network optimization.
Abstract: We identify and address a fundamental limitation of sinusoidal representation networks (SIRENs), a class of implicit neural representations. SIRENs Sitzmann et al. (2020), when not initialized appropriately, can struggle at fitting signals that fall outside their frequency support. In extreme cases, when the network’s frequency support misaligns with the target spectrum, a ‘spectral bottleneck’ phenomenon is observed, where the model yields to a near-zero output and fails to recover even the frequency components that are within its representational capacity. To overcome this, we propose WINNER - Weight Initialization with Noise for Neural Representations. WINNER perturbs uniformly initialized weights of base SIREN with Gaussian noise - whose noise scales are adaptively determined by the spectral centroid of the target signal. Similar to random Fourier embeddings, this mitigates ‘spectral bias’ but without introducing additional trainable parameters. Our method achieves state-of-the-art audio fitting and significant gains in image and 3D shape fitting tasks over base SIREN. Beyond signal fitting, WINNER suggests new avenues in adaptive, target-aware initialization strategies for optimizing deep neural network training. For code and data visit cfdlabtechnion.github.io/siren_square/.
[179] Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
Yan Chen, Long Li, Teng Xi, Long Zeng, Jingdong Wang
Main category: cs.CV
TL;DR: Two-stage RL framework for vision-language models that separately enhances visual perception and reasoning capabilities, addressing the complexity of VLM tasks compared to LLMs.
Details
Motivation: Directly applying RL methods from language models to vision-language models is suboptimal because VLMs require accurate visual perception before reasoning can occur, making their tasks inherently more complex.Method: A two-stage reinforcement learning framework with dataset-level sampling to mitigate vanishing advantage. Stage 1 focuses on improving visual perception through coarse- and fine-grained visual understanding. Stage 2 targets reasoning ability enhancement.
Result: Developed PeBR-R1 model that shows significantly enhanced perceptual and reasoning capabilities, with superior performance demonstrated across seven benchmark datasets on diverse visual reasoning tasks.
Conclusion: The proposed two-stage RL approach effectively addresses the unique challenges of vision-language models by separately optimizing perception and reasoning, resulting in a model with strong performance across multiple visual reasoning benchmarks.
Abstract: Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model’s visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.
[180] PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era
Xu Zheng, Chenfei Liao, Ziqiao Weng, Kaiyu Lei, Zihao Dongfang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Lu Qi, Li Chen, Danda Pani Paudel, Kailun Yang, Linfeng Zhang, Luc Van Gool, Xuming Hu
Main category: cs.CV
TL;DR: Omnidirectional vision is gaining importance in embodied AI era, with recent breakthroughs in generation, perception, understanding, and datasets. The paper proposes PANORAMA architecture and discusses future challenges.
Details
Motivation: Omnidirectional vision provides holistic environmental awareness compared to traditional pinhole vision, enhancing scene perception completeness and decision-making reliability, but foundational research has lagged behind.Method: The paper presents an overview of emerging trends, highlights recent breakthroughs, proposes PANORAMA architecture with four key subsystems, and provides insights from academia and industry.
Result: Synthesis of state-of-the-art advancements in omnidirectional vision, proposing an ideal panoramic system architecture for embodied AI era, and identifying emerging trends and cross-community impacts.
Conclusion: The paper outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems, providing a roadmap for the intersection of panoramic vision and embodied AI.
Abstract: Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.
[181] Brought a Gun to a Knife Fight: Modern VFM Baselines Outgun Specialized Detectors on In-the-Wild AI Image Detection
Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Jinhua Zeng, Bin Li
Main category: cs.CV
TL;DR: A simple linear classifier on modern Vision Foundation Models (VFMs) outperforms specialized AI-generated image detectors by over 20% in real-world scenarios, due to VFMs’ learned alignment with forgery-related concepts from pre-training data exposure.
Details
Motivation: Specialized AI-generated image detectors perform well on curated benchmarks but fail catastrophically in real-world scenarios with high false-negative rates. The authors aim to find a more robust solution using modern vision foundation models.Method: The authors use a simple linear classifier on top of modern Vision Foundation Models (VFMs) like Perception Encoder and Meta CLIP2. They analyze text-image similarities to understand how these models align synthetic images with forgery-related concepts, and test on a novel dataset scraped after the VFM’s pre-training cut-off date to ensure unseen data.
Result: The VFM-based linear classifier decisively outperforms specialized detectors, boosting in-the-wild accuracy by over 20%. The analysis reveals that recent VLMs have learned to align synthetic images with forgery concepts, but this capability plummets on data unseen during pre-training.
Conclusion: Modern VFMs provide superior ‘firepower’ for real-world AI-generated image detection compared to static specialized detectors. True generalization evaluation requires test data independent of the model’s entire training history, including pre-training data exposure.
Abstract: While specialized detectors for AI-generated images excel on curated
benchmarks, they fail catastrophically in real-world scenarios, as evidenced by
their critically high false-negative rates on in-the-wild' benchmarks. Instead of crafting another specialized
knife’ for this problem, we bring a gun' to the fight: a simple linear classifier on a modern Vision Foundation Model (VFM). Trained on identical data, this baseline decisively
outguns’ bespoke
detectors, boosting in-the-wild accuracy by a striking margin of over 20%.
Our analysis pinpoints the source of the VFM’s firepower': First, by probing text-image similarities, we find that recent VLMs (e.g., Perception Encoder, Meta CLIP2) have learned to align synthetic images with forgery-related concepts (e.g.,
AI-generated’), unlike previous versions. Second, we speculate
that this is due to data exposure, as both this alignment and overall accuracy
plummet on a novel dataset scraped after the VFM’s pre-training cut-off date,
ensuring it was unseen during pre-training. Our findings yield two critical
conclusions: 1) For the real-world gunfight' of AI-generated image detection, the raw
firepower’ of an updated VFM is far more effective than the
`craftsmanship’ of a static detector. 2) True generalization evaluation
requires test data to be independent of the model’s entire training history,
including pre-training.
[182] Drone Detection Using a Low-Power Neuromorphic Virtual Tripwire
Anton Eldeborg Lundin, Rasmus Winzell, Hanna Hamrell, David Gustafsson, Hannes Ovrén
Main category: cs.CV
TL;DR: A neuromorphic drone detection system using spiking neural networks and event cameras that is highly energy efficient and can run for over a year on battery power.
Details
Motivation: Small drones pose increasing threats to military and civilian infrastructure, requiring early automated detection systems.Method: Uses spiking neural networks with neuromorphic cameras (event cameras) deployed on neuromorphic chips, creating a fully neuromorphic system that can form virtual tripwires. Training utilizes synthetically generated data.
Result: The system is several orders of magnitude more energy efficient than GPU-based solutions, enabling over a year of battery operation. The model primarily relies on drone shape rather than propeller temporal characteristics.
Conclusion: The small size and low power consumption make this neuromorphic system ideal for deployment in contested areas or locations lacking power infrastructure.
Abstract: Small drones are an increasing threat to both military personnel and civilian infrastructure, making early and automated detection crucial. In this work we develop a system that uses spiking neural networks and neuromorphic cameras (event cameras) to detect drones. The detection model is deployed on a neuromorphic chip making this a fully neuromorphic system. Multiple detection units can be deployed to create a virtual tripwire which detects when and where drones enter a restricted zone. We show that our neuromorphic solution is several orders of magnitude more energy efficient than a reference solution deployed on an edge GPU, allowing the system to run for over a year on battery power. We investigate how synthetically generated data can be used for training, and show that our model most likely relies on the shape of the drone rather than the temporal characteristics of its propellers. The small size and low power consumption allows easy deployment in contested areas or locations that lack power infrastructure.
[183] TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation
Qianqi Lu, Yuxiang Xie, Jing Zhang, Shiwei Zou, Yan Chen, Xidao Luan
Main category: cs.CV
TL;DR: TFANet is a three-stage network for referring image segmentation that addresses multimodal misalignment through hierarchical feature alignment modules at different scales and granularities.
Details
Motivation: Existing methods struggle with multimodal misalignment and language semantic loss, especially in complex scenes with multiple visually similar objects, leading to mislocalization and incomplete segmentation.Method: Three-stage framework: 1) Knowledge Plus Stage with Multiscale Linear Cross-Attention Module for bidirectional semantic exchange, 2) Knowledge Fusion Stage with Cross-modal Feature Scanning Module for long-range dependencies, 3) Knowledge Intensification Stage with Word-level Linguistic Feature-guided Semantic Deepening Module to compensate for semantic degradation.
Result: The proposed method systematically enhances multimodal alignment through hierarchical feature processing, enabling better handling of complex scenes with multiple similar objects.
Conclusion: TFANet provides a comprehensive solution for referring image segmentation by addressing multimodal misalignment through a structured three-stage approach that maintains language semantics while improving alignment accuracy.
Abstract: Referring Image Segmentation (RIS) is a task that segments image regions based on language expressions, requiring fine-grained alignment between two modalities. However, existing methods often struggle with multimodal misalignment and language semantic loss, especially in complex scenes containing multiple visually similar objects, where uniquely described targets are frequently mislocalized or incompletely segmented. To tackle these challenges, this paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network that systematically enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS). In the first stage, we design the Multiscale Linear Cross-Attention Module (MLAM), which facilitates bidirectional semantic exchange between visual features and textual representations across multiple scales. This establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions. Subsequently, the KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies and construct a unified multimodal representation. This is essential for modeling long-range cross-modal dependencies and enhancing alignment accuracy in complex scenes. Finally, in the KIS, we propose the Word-level Linguistic Feature-guided Semantic Deepening Module (WFDM) to compensate for semantic degradation introduced in earlier stages.
[184] Dream3DAvatar: Text-Controlled 3D Avatar Reconstruction from a Single Image
Gaofeng Liu, Hengsen Li, Ruoyu Gao, Xuetong Li, Zhiyuan Ma, Tao Fang
Main category: cs.CV
TL;DR: Dream3DAvatar is a two-stage framework that generates realistic, animation-ready 3D avatars from single images using multi-view generation with pose and identity adapters, followed by 3D Gaussian Splat reconstruction with feature fusion.
Details
Motivation: Single-image 3D avatar reconstruction remains challenging due to limited monocular input information and difficulty controlling occluded regions' geometry and texture during generation.Method: Two-stage framework: 1) Lightweight adapter-enhanced multi-view generation with Pose-Adapter (SMPL-X/skeletal info injection) and ID-Adapter-G (facial identity preservation), plus BLIP2 for text descriptions; 2) Feedforward Transformer with multi-view feature fusion for 3DGS reconstruction and ID-Adapter-R for facial feature integration.
Result: Extensive experiments show the method generates realistic, animation-ready 3D avatars without post-processing and consistently outperforms existing baselines across multiple evaluation metrics.
Conclusion: Dream3DAvatar effectively addresses the ill-posed nature of single-image 3D avatar reconstruction through its innovative two-stage approach with specialized adapters, achieving superior performance in generating controllable, high-fidelity 3D avatars.
Abstract: With the rapid advancement of 3D representation techniques and generative models, substantial progress has been made in reconstructing full-body 3D avatars from a single image. However, this task remains fundamentally ill-posedness due to the limited information available from monocular input, making it difficult to control the geometry and texture of occluded regions during generation. To address these challenges, we redesign the reconstruction pipeline and propose Dream3DAvatar, an efficient and text-controllable two-stage framework for 3D avatar generation. In the first stage, we develop a lightweight, adapter-enhanced multi-view generation model. Specifically, we introduce the Pose-Adapter to inject SMPL-X renderings and skeletal information into SDXL, enforcing geometric and pose consistency across views. To preserve facial identity, we incorporate ID-Adapter-G, which injects high-resolution facial features into the generation process. Additionally, we leverage BLIP2 to generate high-quality textual descriptions of the multi-view images, enhancing text-driven controllability in occluded regions. In the second stage, we design a feedforward Transformer model equipped with a multi-view feature fusion module to reconstruct high-fidelity 3D Gaussian Splat representations (3DGS) from the generated images. Furthermore, we introduce ID-Adapter-R, which utilizes a gating mechanism to effectively fuse facial features into the reconstruction process, improving high-frequency detail recovery. Extensive experiments demonstrate that our method can generate realistic, animation-ready 3D avatars without any post-processing and consistently outperforms existing baselines across multiple evaluation metrics.
[185] HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models
Xu Li, Yuxuan Liang, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue
Main category: cs.CV
TL;DR: HERO is a training-free framework that reduces computational overhead in high-resolution LVLMs by selectively dropping less important visual tokens based on content importance and complementary function analysis.
Details
Motivation: High-resolution LVLMs generate excessive visual tokens through tile cropping, causing significant computational and memory overhead that needs to be addressed for efficient inference.Method: Empirical analysis of visual token utilization revealed three key patterns, leading to HERO framework with content-adaptive token budget allocation and function-aware token selection based on tile importance estimation.
Result: HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales without requiring additional training.
Conclusion: The study provides both empirical insights into visual token utilization and practical training-free solutions for efficient inference in high-resolution LVLMs.
Abstract: By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
[186] Hierarchical Deep Fusion Framework for Multi-dimensional Facial Forgery Detection – The 2024 Global Deepfake Image Detection Challenge
Kohou Wang, Huan Hu, Xiang Liu, Zezhou Chen, Ping Chen, Zhaoxiang Liu, Shiguo Lian
Main category: cs.CV
TL;DR: Hierarchical Deep Fusion Framework (HDFF) combines four pre-trained models for deepfake detection, achieving 0.96852 score and 20th place in competition.
Details
Motivation: Address challenges in detecting sophisticated deepfakes across various manipulation techniques with robust generalized models.Method: Ensemble architecture integrating Swin-MLP, CoAtNet, EfficientNetV2, and DaViT models fine-tuned on MultiFFDI dataset, with feature concatenation and final classifier training.
Result: Achieved final score of 0.96852 on private leaderboard, ranking 20th out of 184 teams.
Conclusion: Hierarchical fusion approach demonstrates efficacy for complex image classification tasks like deepfake detection.
Abstract: The proliferation of sophisticated deepfake technology poses significant challenges to digital security and authenticity. Detecting these forgeries, especially across a wide spectrum of manipulation techniques, requires robust and generalized models. This paper introduces the Hierarchical Deep Fusion Framework (HDFF), an ensemble-based deep learning architecture designed for high-performance facial forgery detection. Our framework integrates four diverse pre-trained sub-models, Swin-MLP, CoAtNet, EfficientNetV2, and DaViT, which are meticulously fine-tuned through a multi-stage process on the MultiFFDI dataset. By concatenating the feature representations from these specialized models and training a final classifier layer, HDFF effectively leverages their collective strengths. This approach achieved a final score of 0.96852 on the competition’s private leaderboard, securing the 20th position out of 184 teams, demonstrating the efficacy of hierarchical fusion for complex image classification tasks.
[187] Using KL-Divergence to Focus Frequency Information in Low-Light Image Enhancement
Yan Xingyang, Huang Xiaohong, Zhang Zhao, You Tian, Xu Ziheng
Main category: cs.CV
TL;DR: LLFDisc is a U-shaped deep enhancement network that uses cross-attention and gating mechanisms for frequency-aware enhancement, with novel distribution-aware loss functions that outperform traditional pixel-wise losses.
Details
Motivation: Traditional Fourier frequency fitting uses pixel-wise loss functions that focus too much on local information and cause global information loss. There's a need for better Fourier-domain alignment and structural fidelity.Method: Proposes LLFDisc network with cross-attention and gating mechanisms. Introduces distribution-aware loss using KL-Divergence for Fourier-domain fitting and enhances VGG perceptual loss with KL-Divergence on deep features.
Result: Extensive experiments show state-of-the-art performance in both qualitative and quantitative evaluations across multiple benchmarks.
Conclusion: LLFDisc effectively addresses Fourier-domain information alignment and structural fidelity issues, achieving superior performance through novel distribution-aware loss functions and frequency-aware architecture.
Abstract: In the Fourier domain, luminance information is primarily encoded in the amplitude spectrum, while spatial structures are captured in the phase components. The traditional Fourier Frequency information fitting employs pixel-wise loss functions, which tend to focus excessively on local information and may lead to global information loss. In this paper, we present LLFDisc, a U-shaped deep enhancement network that integrates cross-attention and gating mechanisms tailored for frequency-aware enhancement. We propose a novel distribution-aware loss that directly fits the Fourier-domain information and minimizes their divergence using a closed-form KL-Divergence objective. This enables the model to align Fourier-domain information more robustly than with conventional MSE-based losses. Furthermore, we enhance the perceptual loss based on VGG by embedding KL-Divergence on extracted deep features, enabling better structural fidelity. Extensive experiments across multiple benchmarks demonstrate that LLFDisc achieves state-of-the-art performance in both qualitative and quantitative evaluations. Our code will be released at: https://github.com/YanXY000/LLFDisc
[188] Enhancing Dual Network Based Semi-Supervised Medical Image Segmentation with Uncertainty-Guided Pseudo-Labeling
Yunyao Lu, Yihang Wu, Ahmad Chaddad, Tareef Daqqaq, Reem Kateb
Main category: cs.CV
TL;DR: A novel semi-supervised 3D medical image segmentation framework using dual-network architecture with cross consistency enhancement and self-supervised contrastive learning to address noisy pseudo-labels and insufficient supervision.
Details
Motivation: Supervised medical image segmentation requires large labeled datasets which are impractical in real-world scenarios. Existing semi-supervised methods suffer from noisy pseudo-labels and insufficient feature space supervision.Method: Dual-network architecture with Cross Consistency Enhancement module using cross pseudo and entropy-filtered supervision, dynamic weighting strategy with uncertainty-aware mechanism (KL divergence), and self-supervised contrastive learning to align uncertain voxel features with reliable class prototypes.
Result: Achieved superior performance across three 3D segmentation datasets (Left Atrial, NIH Pancreas, BraTS-2019), with 89.95% Dice score on Left Atrial using only 10% labeled data, outperforming state-of-the-art methods.
Conclusion: The proposed framework effectively reduces noisy pseudo-labels and prediction uncertainty through innovative consistency enhancement and contrastive learning mechanisms, demonstrating strong performance in semi-supervised 3D medical image segmentation.
Abstract: Despite the remarkable performance of supervised medical image segmentation models, relying on a large amount of labeled data is impractical in real-world situations. Semi-supervised learning approaches aim to alleviate this challenge using unlabeled data through pseudo-label generation. Yet, existing semi-supervised segmentation methods still suffer from noisy pseudo-labels and insufficient supervision within the feature space. To solve these challenges, this paper proposes a novel semi-supervised 3D medical image segmentation framework based on a dual-network architecture. Specifically, we investigate a Cross Consistency Enhancement module using both cross pseudo and entropy-filtered supervision to reduce the noisy pseudo-labels, while we design a dynamic weighting strategy to adjust the contributions of pseudo-labels using an uncertainty-aware mechanism (i.e., Kullback-Leibler divergence). In addition, we use a self-supervised contrastive learning mechanism to align uncertain voxel features with reliable class prototypes by effectively differentiating between trustworthy and uncertain predictions, thus reducing prediction uncertainty. Extensive experiments are conducted on three 3D segmentation datasets, Left Atrial, NIH Pancreas and BraTS-2019. The proposed approach consistently exhibits superior performance across various settings (e.g., 89.95% Dice score on left Atrial with 10% labeled data) compared to the state-of-the-art methods. Furthermore, the usefulness of the proposed modules is further validated via ablation experiments.
[189] A Synthetic Data Pipeline for Supporting Manufacturing SMEs in Visual Assembly Control
Jonas Werheid, Shengjie He, Aymen Gannouni, Anas Abdelrazeq, Robert H. Schmitt
Main category: cs.CV
TL;DR: Synthetic data generation from CAD models enables efficient visual assembly control with high accuracy (up to 93% on real data), reducing manual annotation costs for SMEs.
Details
Motivation: Traditional computer vision methods for assembly quality control require expensive image acquisition and annotation, making them challenging for small- and medium-sized enterprises (SMEs) to implement due to resource constraints.Method: Leveraging simulated scene generation based on CAD data and object detection algorithms to create synthetic training data, eliminating the need for manual image collection and annotation.
Result: Achieved mean Average Precision (mAP@0.5:0.95) up to 99.5% on synthetic data and 93% on real-world camera-captured testing data for identifying planetary gear system components.
Conclusion: Synthetic data generation provides an effective, resource-efficient solution for visual assembly control that can support SMEs in implementing automated quality control without extensive manual data collection.
Abstract: Quality control of assembly processes is essential in manufacturing to ensure not only the quality of individual components but also their proper integration into the final product. To assist in this matter, automated assembly control using computer vision methods has been widely implemented. However, the costs associated with image acquisition, annotation, and training of computer vision algorithms pose challenges for integration, especially for small- and medium-sized enterprises (SMEs), which often lack the resources for extensive training, data collection, and manual image annotation. Synthetic data offers the potential to reduce manual data collection and labeling. Nevertheless, its practical application in the context of assembly quality remains limited. In this work, we present a novel approach for easily integrable and data-efficient visual assembly control. Our approach leverages simulated scene generation based on computer-aided design (CAD) data and object detection algorithms. The results demonstrate a time-saving pipeline for generating image data in manufacturing environments, achieving a mean Average Precision (mAP@0.5:0.95) up to 99,5% for correctly identifying instances of synthetic planetary gear system components within our simulated training data, and up to 93% when transferred to real-world camera-captured testing data. This research highlights the effectiveness of synthetic data generation within an adaptable pipeline and underscores its potential to support SMEs in implementing resource-efficient visual assembly control solutions.
[190] Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation
Hugo Carlesso, Josiane Mothe, Radu Tudor Ionescu
Main category: cs.CV
TL;DR: A novel curriculum multi-task self-supervised learning framework (CMTSSL) for lightweight hyperspectral imaging analysis that combines masked image modeling with spatial/spectral jigsaw puzzles, achieving strong performance with models 16,000x lighter than SOTA.
Details
Motivation: Hyperspectral imaging generates high-dimensional data with slow satellite transmission rates, requiring compact models for onboard processing to minimize transmission of redundant data like cloud-covered areas.Method: CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, using curriculum learning to progressively increase data complexity during self-supervision for joint feature capture.
Result: Validated on four public benchmarks, the approach shows consistent gains in downstream segmentation tasks using architectures over 16,000x lighter than state-of-the-art models.
Conclusion: CMTSSL enables generalizable representation learning with lightweight architectures suitable for real-world hyperspectral imaging applications and onboard satellite deployment.
Abstract: Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data, e.g. cloud-covered areas. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data complexity during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16,000x lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at https://github.com/hugocarlesso/CMTSSL.
[191] Weakly and Self-Supervised Class-Agnostic Motion Prediction for Autonomous Driving
Ruibo Li, Hanyu Shi, Zhe Wang, Guosheng Lin
Main category: cs.CV
TL;DR: Weakly and self-supervised class-agnostic motion prediction from LiDAR using foreground/background or non-ground/ground masks instead of motion annotations, with novel loss functions that outperform existing self-supervised methods.
Details
Motivation: Motion understanding is critical for autonomous driving, but motion annotations are expensive. Outdoor scenes naturally separate into mobile foregrounds and static backgrounds, allowing motion prediction to be associated with scene parsing to reduce annotation requirements.Method: Proposes weakly supervised paradigm using foreground/background masks (1%, 0.1% annotated) or non-ground/ground masks (0.01% annotated) instead of motion labels. Develops self-supervised approach without annotations. Uses Robust Consistency-aware Chamfer Distance loss with multi-frame information and robust penalty functions to handle outliers.
Result: Weakly and self-supervised models outperform existing self-supervised counterparts. Weakly supervised models rival some supervised methods, effectively balancing annotation effort and performance.
Conclusion: The proposed approaches demonstrate that motion prediction can be effectively learned with minimal or no motion annotations by leveraging scene parsing cues, making autonomous driving systems more practical and scalable.
Abstract: Understanding motion in dynamic environments is critical for autonomous driving, thereby motivating research on class-agnostic motion prediction. In this work, we investigate weakly and self-supervised class-agnostic motion prediction from LiDAR point clouds. Outdoor scenes typically consist of mobile foregrounds and static backgrounds, allowing motion understanding to be associated with scene parsing. Based on this observation, we propose a novel weakly supervised paradigm that replaces motion annotations with fully or partially annotated (1%, 0.1%) foreground/background masks for supervision. To this end, we develop a weakly supervised approach utilizing foreground/background cues to guide the self-supervised learning of motion prediction models. Since foreground motion generally occurs in non-ground regions, non-ground/ground masks can serve as an alternative to foreground/background masks, further reducing annotation effort. Leveraging non-ground/ground cues, we propose two additional approaches: a weakly supervised method requiring fewer (0.01%) foreground/background annotations, and a self-supervised method without annotations. Furthermore, we design a Robust Consistency-aware Chamfer Distance loss that incorporates multi-frame information and robust penalty functions to suppress outliers in self-supervised learning. Experiments show that our weakly and self-supervised models outperform existing self-supervised counterparts, and our weakly supervised models even rival some supervised ones. This demonstrates that our approaches effectively balance annotation effort and performance.
[192] Advancing Real-World Parking Slot Detection with Large-Scale Dataset and Semi-Supervised Baseline
Zhihao Zhang, Chunyu Lin, Lang Nie, Jiyuan Wang, Yao Zhao
Main category: cs.CV
TL;DR: This paper introduces CRPS-D, a large-scale parking slot detection dataset with diverse real-world conditions, and SS-PSD, a semi-supervised method that outperforms state-of-the-art approaches.
Details
Motivation: Current parking slot detection datasets are limited in scale and lack real-world noise variations. Manual annotation is error-prone and costly for large-scale datasets, creating a need for better data and methods.Method: Constructed CRPS-D dataset with diverse lighting, weather conditions, and challenging parking slot variants. Developed SS-PSD, a semi-supervised teacher-student model with confidence-guided mask consistency and adaptive feature perturbation.
Result: SS-PSD demonstrates superiority over state-of-the-art solutions on both the proposed dataset and existing datasets. Performance gains increase with more unlabeled data.
Conclusion: The proposed CRPS-D dataset and SS-PSD method effectively address limitations in parking slot detection, providing a robust solution that leverages unlabeled data to improve performance in real-world conditions.
Abstract: As automatic parking systems evolve, the accurate detection of parking slots has become increasingly critical. This study focuses on parking slot detection using surround-view cameras, which offer a comprehensive bird’s-eye view of the parking environment. However, the current datasets are limited in scale, and the scenes they contain are seldom disrupted by real-world noise (e.g., light, occlusion, etc.). Moreover, manual data annotation is prone to errors and omissions due to the complexity of real-world conditions, significantly increasing the cost of annotating large-scale datasets. To address these issues, we first construct a large-scale parking slot detection dataset (named CRPS-D), which includes various lighting distributions, diverse weather conditions, and challenging parking slot variants. Compared with existing datasets, the proposed dataset boasts the largest data scale and consists of a higher density of parking slots, particularly featuring more slanted parking slots. Additionally, we develop a semi-supervised baseline for parking slot detection, termed SS-PSD, to further improve performance by exploiting unlabeled data. To our knowledge, this is the first semi-supervised approach in parking slot detection, which is built on the teacher-student model with confidence-guided mask consistency and adaptive feature perturbation. Experimental results demonstrate the superiority of SS-PSD over the existing state-of-the-art (SoTA) solutions on both the proposed dataset and the existing dataset. Particularly, the more unlabeled data there is, the more significant the gains brought by our semi-supervised scheme. The relevant source codes and the dataset have been made publicly available at https://github.com/zzh362/CRPS-D.
[193] MSDNet: Efficient 4D Radar Super-Resolution via Multi-Stage Distillation
Minqing Huang, Shouyi Lu, Boyuan Zheng, Ziyao Li, Xiao Tang, Guirong Zhuo
Main category: cs.CV
TL;DR: MSDNet is a multi-stage distillation framework that efficiently transfers LiDAR priors to 4D radar features for high-quality super-resolution with low latency.
Details
Motivation: Existing 4D radar super-resolution methods suffer from high training costs, complex diffusion-based sampling, high inference latency, and poor generalization, making it difficult to balance accuracy and efficiency.Method: Two-stage distillation: 1) Reconstruction-guided feature distillation to align and densify features, 2) Diffusion-guided feature distillation with a lightweight diffusion network and noise adapter for adaptive noise alignment and refinement.
Result: Extensive experiments on VoD and in-house datasets show MSDNet achieves high-fidelity reconstruction and low-latency inference, with consistent performance improvements on downstream tasks.
Conclusion: MSDNet effectively addresses the limitations of existing methods by providing both accurate 4D radar point cloud super-resolution and computational efficiency through multi-stage distillation.
Abstract: 4D radar super-resolution, which aims to reconstruct sparse and noisy point clouds into dense and geometrically consistent representations, is a foundational problem in autonomous perception. However, existing methods often suffer from high training cost or rely on complex diffusion-based sampling, resulting in high inference latency and poor generalization, making it difficult to balance accuracy and efficiency. To address these limitations, we propose MSDNet, a multi-stage distillation framework that efficiently transfers dense LiDAR priors to 4D radar features to achieve both high reconstruction quality and computational efficiency. The first stage performs reconstruction-guided feature distillation, aligning and densifying the student’s features through feature reconstruction. In the second stage, we propose diffusion-guided feature distillation, which treats the stage-one distilled features as a noisy version of the teacher’s representations and refines them via a lightweight diffusion network. Furthermore, we introduce a noise adapter that adaptively aligns the noise level of the feature with a predefined diffusion timestep, enabling a more precise denoising. Extensive experiments on the VoD and in-house datasets demonstrate that MSDNet achieves both high-fidelity reconstruction and low-latency inference in the task of 4D radar point cloud super-resolution, and consistently improves performance on downstream tasks. The code will be publicly available upon publication.
[194] RadGame: An AI-Powered Platform for Radiology Education
Mohammed Baharoon, Siavash Raissi, John S. Jun, Thibault Heintz, Mahmoud Alabbad, Ali Alburkani, Sung Eun Kim, Kent Kleinschmidt, Abdulrahman O. Alhumaydhi, Mohannad Mohammed G. Alghamdi, Jeremy Francis Palacio, Mohammed Bukhaytan, Noah Michael Prudlo, Rithvik Akula, Brady Chrisler, Benjamin Galligos, Mohammed O. Almutairi, Mazeen Mohammed Alanazi, Nasser M. Alrashdi, Joel Jihwan Hwang, Sri Sai Dinesh Jaliparthi, Luke David Nelson, Nathaniel Nguyen, Sathvik Suryadevara, Steven Kim, Mohammed F. Mohammed, Yevgeniy R. Semenov, Kun-Hsing Yu, Abdulrhman Aljouie, Hassan AlOmaish, Adam Rodman, Pranav Rajpurkar
Main category: cs.CV
TL;DR: RadGame is an AI-powered gamified platform for radiology education that improves localization and report-writing skills through automated feedback, showing significant performance improvements over traditional methods.
Details
Motivation: Traditional radiology training lacks scalable, immediate feedback opportunities, limiting effective skill development in localization and report generation.Method: Combines gamification with AI-driven feedback using public datasets. RadGame Localize uses bounding box drawing compared to radiologist annotations with visual explanations. RadGame Report provides structured AI feedback on report writing using radiology report generation metrics.
Result: Participants achieved 68% improvement in localization accuracy (vs 17% with traditional methods) and 31% improvement in report-writing accuracy (vs 4% with traditional methods).
Conclusion: AI-driven gamification shows strong potential for scalable, feedback-rich radiology training and reimagines medical AI applications in education.
Abstract: We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist’s written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.
[195] TexTAR : Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images
Rohan Kumar, Jyothi Swaroopa Jinka, Ravi Kiran Sarvadevabhatla
Main category: cs.CV
TL;DR: TexTAR is a multi-task Transformer model for recognizing text attributes (bold, italic, underline, strikeout) that uses 2D RoPE embeddings and context awareness to achieve state-of-the-art performance on multilingual documents.
Details
Motivation: Existing methods for textual attribute recognition struggle with computational efficiency and adaptability in noisy, multilingual settings, making it difficult to accurately identify text formatting attributes that are crucial for document understanding.Method: Introduces TexTAR, a multi-task context-aware Transformer with a novel data selection pipeline and 2D RoPE-style mechanism to incorporate input context. Also creates MMTAD dataset - a diverse multilingual multi-domain dataset annotated with text attributes.
Result: Extensive evaluations show TexTAR outperforms existing methods, demonstrating that contextual awareness contributes to state-of-the-art performance in textual attribute recognition.
Conclusion: Contextual awareness through the proposed 2D RoPE mechanism and multi-task architecture significantly improves textual attribute recognition performance across diverse multilingual document types.
Abstract: Recognizing textual attributes such as bold, italic, underline and strikeout is essential for understanding text semantics, structure, and visual presentation. These attributes highlight key information, making them crucial for document analysis. Existing methods struggle with computational efficiency or adaptability in noisy, multilingual settings. To address this, we introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR). Our novel data selection pipeline enhances context awareness, and our architecture employs a 2D RoPE (Rotary Positional Embedding)-style mechanism to incorporate input context for more accurate attribute predictions. We also introduce MMTAD, a diverse, multilingual, multi-domain dataset annotated with text attributes across real-world documents such as legal records, notices, and textbooks. Extensive evaluations show TexTAR outperforms existing methods, demonstrating that contextual awareness contributes to state-of-the-art TAR performance.
[196] Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)
Zhihao He, Tianyao He, Tieyuan Chen, Yun Xu, Huabin Liu, Chaofan Gan, Gui Zou, Weiyao Lin
Main category: cs.CV
TL;DR: A multi-video collaborative framework that uses spatio-temporal graphs to represent video knowledge and fuses information from related videos to enhance video language model reasoning.
Details
Motivation: Current video language models suffer from hallucinations and inaccuracies due to spatio-temporal incompleteness in individual videos, requiring augmentation with multiple related videos for better reasoning.Method: Proposes a framework with Video Structuring Module to represent videos as spatio-temporal graphs, Graph Fusion Module to fuse knowledge from related videos into augmented graph nodes, and multi-video structured prompts combining graph, visual, and textual tokens.
Result: Extensive experiments show the framework effectively enhances video language model performance, demonstrating its potential for advancing video reasoning capabilities.
Conclusion: The multi-video collaborative framework with structured graph representation provides a promising solution to address spatio-temporal incompleteness and improve video language model accuracy.
Abstract: Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video’s knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models.
[197] WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory
Ruifei Ding, Zhe Chen, Wen Fan, Chen Long, Huijuan Xiao, Yelu Zeng, Zhen Dong, Bisheng Yang
Main category: cs.CV
TL;DR: WHU-STree is a comprehensive multi-modal urban street tree dataset collected across two cities, containing 21,007 annotated tree instances across 50 species with synchronized point clouds and high-resolution images, supporting over 10 street tree inventory tasks.
Details
Motivation: Traditional field surveys for street tree inventory are time-consuming and labor-intensive, while existing Mobile Mapping System datasets are limited by small-scale scenes, limited annotation, or single modality, restricting comprehensive urban tree analysis.Method: The authors collected synchronized point clouds and high-resolution images across two distinct cities using Mobile Mapping Systems, creating a dataset with 21,007 annotated tree instances across 50 species and 2 morphological parameters.
Result: Extensive experiments demonstrate the significant potential of multi-modal data fusion for street tree analysis and highlight cross-domain applicability as critical for practical algorithm deployment. The dataset supports benchmarking for key tasks like tree species classification and individual tree segmentation.
Conclusion: WHU-STree addresses limitations of existing datasets and enables comprehensive street tree analysis. Future work should focus on multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Models for street tree asset management.
Abstract: Street trees are vital to urban livability, providing ecological and social benefits. Establishing a detailed, accurate, and dynamically updated street tree inventory has become essential for optimizing these multifunctional assets within space-constrained urban environments. Given that traditional field surveys are time-consuming and labor-intensive, automated surveys utilizing Mobile Mapping Systems (MMS) offer a more efficient solution. However, existing MMS-acquired tree datasets are limited by small-scale scene, limited annotation, or single modality, restricting their utility for comprehensive analysis. To address these limitations, we introduce WHU-STree, a cross-city, richly annotated, and multi-modal urban street tree dataset. Collected across two distinct cities, WHU-STree integrates synchronized point clouds and high-resolution images, encompassing 21,007 annotated tree instances across 50 species and 2 morphological parameters. Leveraging the unique characteristics, WHU-STree concurrently supports over 10 tasks related to street tree inventory. We benchmark representative baselines for two key tasks–tree species classification and individual tree segmentation. Extensive experiments and in-depth analysis demonstrate the significant potential of multi-modal data fusion and underscore cross-domain applicability as a critical prerequisite for practical algorithm deployment. In particular, we identify key challenges and outline potential future works for fully exploiting WHU-STree, encompassing multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Model for street tree asset management. The WHU-STree dataset is accessible at: https://github.com/WHU-USI3DV/WHU-STree.
[198] More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era
Yingtai Li, Haoran Lai, Xiaoqian Zhou, Shuai Ming, Wenxin Ma, Wei Wei, Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: LLMs can automatically extract diagnostic labels from radiology reports with high accuracy (>96% AUC), enabling cost-effective creation of large-scale medical datasets for supervised pre-training that achieves state-of-the-art performance in medical vision-language tasks.
Details
Motivation: Large Language Models present unprecedented opportunities to revolutionize medical contrastive vision-language pre-training by facilitating large-scale supervised pre-training and advancing vision-language alignment in medical AI systems.Method: Using modern LLMs to automatically extract diagnostic labels from radiology reports without complex prompt engineering, creating “silver-standard” datasets at minimal cost. Training vision encoders on these datasets with vanilla CLIP training using a 3D ResNet-18 architecture.
Result: Achieved remarkable precision (>96% AUC) in label extraction, created 50k CT image-report pairs for ~$3. Achieved state-of-the-art performance: 83.8% AUC for zero-shot diagnosis on CT-RATE, 77.3% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7% for image-image, Recall@100=52.2% for report-image).
Conclusion: LLMs can facilitate more performant and scalable medical AI systems by enabling cost-effective large-scale supervised pre-training that fundamentally improves contrastive vision-language alignment, with performance comparable to specialized BERT-based models.
Abstract: The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale “silver-standard” datasets at a minimal cost (~$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this “silver-standard” dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8% AUC for zero-shot diagnosis on CT-RATE, 77.3% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7% for image-image, Recall@100=52.2% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.
[199] Road Obstacle Video Segmentation
Shyam Nandan Rai, Shyamgopal Karthik, Mariana-Iuliana Georgescu, Barbara Caputo, Carlo Masone, Zeynep Akata
Main category: cs.CV
TL;DR: This paper addresses road-obstacle video segmentation, showing that existing frame-by-frame methods overlook temporal consistency and proposing new benchmarks and baseline methods using vision foundation models to achieve state-of-the-art performance.
Details
Motivation: Existing road-obstacle segmentation methods work on individual frames, ignoring temporal consistency which leads to inconsistent predictions between consecutive frames in autonomous driving scenarios.Method: The authors curated and adapted four evaluation benchmarks for road-obstacle video segmentation, evaluated 11 state-of-the-art segmentation methods, and introduced two strong baseline methods based on vision foundation models.
Result: The approach establishes a new state-of-the-art in road-obstacle video segmentation for long-range video sequences, demonstrating superior performance compared to existing image- and video-based methods.
Conclusion: Road-obstacle segmentation is inherently temporal, and the proposed methods provide valuable insights and direction for future research in consistent video segmentation for autonomous navigation.
Abstract: With the growing deployment of autonomous driving agents, the detection and segmentation of road obstacles have become critical to ensure safe autonomous navigation. However, existing road-obstacle segmentation methods are applied on individual frames, overlooking the temporal nature of the problem, leading to inconsistent prediction maps between consecutive frames. In this work, we demonstrate that the road-obstacle segmentation task is inherently temporal, since the segmentation maps for consecutive frames are strongly correlated. To address this, we curate and adapt four evaluation benchmarks for road-obstacle video segmentation and evaluate 11 state-of-the-art image- and video-based segmentation methods on these benchmarks. Moreover, we introduce two strong baseline methods based on vision foundation models. Our approach establishes a new state-of-the-art in road-obstacle video segmentation for long-range video sequences, providing valuable insights and direction for future research.
[200] Vi-SAFE: A Spatial-Temporal Framework for Efficient Violence Detection in Public Surveillance
Ligang Chang, Shengkai Xu, Liangchang Shen, Binhan Xu, Junqiao Wang, Tianyu Shi, Yanhui Du
Main category: cs.CV
TL;DR: Vi-SAFE is a spatial-temporal framework combining enhanced YOLOv8 with Temporal Segment Network for violence detection in surveillance videos, achieving 0.88 accuracy on RWF-2000 dataset.
Details
Motivation: Address challenges in violence detection including small-scale targets, complex environments, and real-time temporal analysis for public safety surveillance.Method: Integrates optimized YOLOv8 (with GhostNetV3 backbone, EMA attention, and pruning) for human detection and TSN for binary violence classification. Models trained separately on pedestrian and violence datasets.
Result: Achieves 0.88 accuracy on RWF-2000 dataset, outperforming TSN alone (0.77) and existing methods in both accuracy and efficiency.
Conclusion: Vi-SAFE demonstrates effectiveness for public safety surveillance with improved performance and computational efficiency.
Abstract: Violence detection in public surveillance is critical for public safety. This study addresses challenges such as small-scale targets, complex environments, and real-time temporal analysis. We propose Vi-SAFE, a spatial-temporal framework that integrates an enhanced YOLOv8 with a Temporal Segment Network (TSN) for video surveillance. The YOLOv8 model is optimized with GhostNetV3 as a lightweight backbone, an exponential moving average (EMA) attention mechanism, and pruning to reduce computational cost while maintaining accuracy. YOLOv8 and TSN are trained separately on pedestrian and violence datasets, where YOLOv8 extracts human regions and TSN performs binary classification of violent behavior. Experiments on the RWF-2000 dataset show that Vi-SAFE achieves an accuracy of 0.88, surpassing TSN alone (0.77) and outperforming existing methods in both accuracy and efficiency, demonstrating its effectiveness for public safety surveillance. Code is available at https://anonymous.4open.science/r/Vi-SAFE-3B42/README.md.
[201] End4: End-to-end Denoising Diffusion for Diffusion-Based Inpainting Detection
Fei Wang, Xuecheng Wu, Zheng Zhang, Danlei Huang, Yuheng Huang, BoWang
Main category: cs.CV
TL;DR: A novel detection method called End4 is proposed to identify images generated by diffusion-based inpainting models, using denoising reconstruction and scale-aware feature fusion to improve detection accuracy.
Details
Motivation: Diffusion models have advanced image synthesis but raise concerns about malicious misuse. Existing approaches struggle to detect images generated by diffusion-based inpainting models, even when similar images are in training data.Method: End4 uses an end-to-end denoising diffusion approach with a denoising reconstruction model to align latent spaces, and a Scale-aware Pyramid-like Fusion Module (SPFM) to refine local image features using attention pyramid layers at different scales.
Result: Extensive experiments show that End4 effectively generalizes to unseen masking patterns and remains robust under various perturbations. A comprehensive benchmark with images from five distinct masked regions was established for evaluation.
Conclusion: The proposed End4 method successfully addresses the challenge of detecting diffusion-based inpainted images and demonstrates strong generalization capabilities and robustness to different masking patterns and perturbations.
Abstract: The powerful generative capabilities of diffusion models have significantly advanced the field of image synthesis, enhancing both full image generation and inpainting-based image editing. Despite their remarkable advancements, diffusion models also raise concerns about potential misuse for malicious purposes. However, existing approaches struggle to identify images generated by diffusion-based inpainting models, even when similar inpainted images are included in their training data. To address this challenge, we propose a novel detection method based on End-to-end denoising diffusion (End4). Specifically, End4 designs a denoising reconstruction model to improve the alignment degree between the latent spaces of the reconstruction and detection processes, thus reconstructing features that are more conducive to detection. Meanwhile, it leverages a Scale-aware Pyramid-like Fusion Module (SPFM) that refines local image features under the guidance of attention pyramid layers at different scales, enhancing feature discriminability. Additionally, to evaluate detection performance on inpainted images, we establish a comprehensive benchmark comprising images generated from five distinct masked regions. Extensive experiments demonstrate that our End4 effectively generalizes to unseen masking patterns and remains robust under various perturbations. Our code and dataset will be released soon.
[202] Intelligent Vacuum Thermoforming Process
Andi Kuswoyo, Christos Margadji, Sebastian W. Pattinson
Main category: cs.CV
TL;DR: Vision-based system using k-NN algorithm to optimize vacuum thermoforming parameters and improve part quality with minimal data requirements.
Details
Motivation: Vacuum thermoforming faces quality consistency challenges due to material property variations and tooling configurations, requiring an efficient quality control solution.Method: Developed comprehensive dataset with visual data from vacuum-formed samples under various parameters, used image augmentation, and employed k-Nearest Neighbour algorithm to map low-quality parts to high-quality counterparts for parameter adjustment.
Result: Model showed strong performance in adjusting heating power, heating time, and vacuum time, effectively reducing defects and improving production efficiency.
Conclusion: The vision-based quality control system successfully predicts and optimizes process parameters, enhancing part quality in vacuum thermoforming with minimal data requirements.
Abstract: Ensuring consistent quality in vacuum thermoforming presents challenges due to variations in material properties and tooling configurations. This research introduces a vision-based quality control system to predict and optimise process parameters, thereby enhancing part quality with minimal data requirements. A comprehensive dataset was developed using visual data from vacuum-formed samples subjected to various process parameters, supplemented by image augmentation techniques to improve model training. A k-Nearest Neighbour algorithm was subsequently employed to identify adjustments needed in process parameters by mapping low-quality parts to their high-quality counterparts. The model exhibited strong performance in adjusting heating power, heating time, and vacuum time to reduce defects and improve production efficiency.
[203] StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance
Zefan Qu, Zhenwei Wang, Haoyuan Wang, Ke Xu, Gerhard Hancke, Rynson W. H. Lau
Main category: cs.CV
TL;DR: StyleSculptor is a training-free approach for generating style-guided 3D assets from content and style images using a novel Style Disentangled Attention module and Style Guided Control mechanism.
Details
Motivation: Creating 3D assets that match existing style in texture and geometry is essential for applications like gaming and VR, but current methods struggle with fine-grained style control.Method: Uses Style Disentangled Attention (SD-Attn) module with cross-3D attention for dynamic content-style interaction, plus style-disentangled feature selection to prevent semantic leakage. Style Guided Control enables texture/geometry-only stylization.
Result: Outperforms existing baseline methods in producing high-fidelity 3D assets with fine-grained style control.
Conclusion: StyleSculptor enables zero-shot, training-free style-guided 3D generation with effective texture and geometry control through novel attention mechanisms.
Abstract: Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality. While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images. Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation. To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets.
[204] 3D Aware Region Prompted Vision Language Model
An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, Sifei Liu
Main category: cs.CV
TL;DR: SR-3D is a vision-language model that bridges 2D images and 3D data through shared visual tokens, enabling flexible region prompting without multi-frame labeling by using 3D positional embeddings with 2D features.
Details
Motivation: To connect single-view 2D images with multi-view 3D data and enable accurate spatial reasoning across frames without requiring exhaustive multi-frame labeling or co-occurring objects in the same view.Method: Enriches 2D visual features with 3D positional embeddings, creating a shared visual token space that allows the 3D model to leverage strong 2D priors for spatial reasoning across different frames.
Result: Achieves state-of-the-art performance on both general 2D vision language and specialized 3D spatial benchmarks, and demonstrates applicability to in-the-wild videos without 3D inputs or ground-truth annotations.
Conclusion: SR-3D effectively unifies 2D and 3D representation spaces for scene understanding, enabling accurate spatial relationship inference and metric measurements even without sensory 3D inputs.
Abstract: We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.
[205] Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara
Main category: cs.CV
TL;DR: Talk2DINO combines DINOv2’s spatial accuracy with CLIP’s language understanding for open-vocabulary segmentation, achieving SOTA performance without fine-tuning backbones.
Details
Motivation: Existing vision-language models like CLIP struggle with spatial localization due to global feature alignment, while self-supervised models like DINO lack language integration. This gap needs bridging for better open-vocabulary segmentation.Method: Hybrid approach that aligns CLIP’s textual embeddings to DINOv2’s patch-level features via learned mapping function. Uses DINOv2 attention maps to selectively align local visual patches with text embeddings during training.
Result: Achieves state-of-the-art performance across multiple unsupervised OVS benchmarks, produces more natural and less noisy segmentations, and effectively distinguishes foreground from background.
Conclusion: Talk2DINO successfully combines the strengths of DINOv2 and CLIP for superior open-vocabulary segmentation without requiring backbone fine-tuning, demonstrating effective spatial-textual alignment.
Abstract: Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.
[206] MEIL-NeRF: Memory-Efficient Incremental Learning of Neural Radiance Fields
Jaeyoung Chung, Kanggeon Lee, Sungyong Baik, Kyoung Mu Lee
Main category: cs.CV
TL;DR: MEIL-NeRF is a memory-efficient incremental learning algorithm for Neural Radiance Fields that prevents catastrophic forgetting by using NeRF itself as a memory system and self-distillation with selectively queried rays.
Details
Motivation: NeRF faces challenges in practical applications like large-scale scenes and edge devices with limited memory, where incremental learning is needed but traditional methods suffer from catastrophic forgetting or memory scalability issues.Method: The framework learns which rays to query NeRF to extract previous pixel values, then uses these extracted values to train NeRF in a self-distillation manner to prevent forgetting of previously seen data.
Result: MEIL-NeRF demonstrates constant memory consumption and competitive performance compared to previous incremental learning approaches.
Conclusion: The approach successfully addresses memory scalability while maintaining performance in incremental learning scenarios for NeRF applications.
Abstract: Hinged on the representation power of neural networks, neural radiance fields (NeRF) have recently emerged as one of the promising and widely applicable methods for 3D object and scene representation. However, NeRF faces challenges in practical applications, such as large-scale scenes and edge devices with a limited amount of memory, where data needs to be processed sequentially. Under such incremental learning scenarios, neural networks are known to suffer catastrophic forgetting: easily forgetting previously seen data after training with new data. We observe that previous incremental learning algorithms are limited by either low performance or memory scalability issues. As such, we develop a Memory-Efficient Incremental Learning algorithm for NeRF (MEIL-NeRF). MEIL-NeRF takes inspiration from NeRF itself in that a neural network can serve as a memory that provides the pixel RGB values, given rays as queries. Upon the motivation, our framework learns which rays to query NeRF to extract previous pixel values. The extracted pixel values are then used to train NeRF in a self-distillation manner to prevent catastrophic forgetting. As a result, MEIL-NeRF demonstrates constant memory consumption and competitive performance.
[207] Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Ziqi Miao, Yi Ding, Lijun Li, Jing Shao
Main category: cs.CV
TL;DR: VisCo Attack is a novel vision-centric jailbreak method that uses contextual dialogue with visual strategies to exploit security vulnerabilities in multimodal LLMs, achieving 85% attack success rate on GPT-4o.
Details
Motivation: Existing visual attacks on MLLMs primarily use images as triggers with semantic ambiguity and lack realistic grounding. The paper aims to create more effective and realistic vision-centric jailbreak scenarios where visual information is essential.Method: Proposes VisCo Attack with four vision-focused strategies to fabricate contextual dialogue, dynamically generating auxiliary images when needed. Includes automatic toxicity obfuscation and semantic refinement to create effective attack prompts for black-box MLLMs.
Result: Achieves toxicity score of 4.78 and Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming baseline (toxicity 2.48, ASR 22.2%).
Conclusion: VisCo demonstrates that vision-centric jailbreak attacks are highly effective, revealing critical security vulnerabilities in MLLMs that need addressing for safe deployment in open-world environments.
Abstract: With the emergence of strong vision language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: vision-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct vision-focused strategies, dynamically generating auxiliary images when necessary to construct a vision-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which achieves a toxicity score of 2.48 and an ASR of 22.2%. Code: https://github.com/Dtc7w3PQ/Visco-Attack.
[208] Detection of Synthetic Face Images: Accuracy, Robustness, Generalization
Nela Petrzelkova, Jan Cech
Main category: cs.CV
TL;DR: A study shows that simple models can detect synthetic face images from specific generators with near-perfect accuracy, but they fail to generalize to unseen generators and are vulnerable to adversarial attacks.
Details
Motivation: To investigate the effectiveness of detecting synthetic face images generated by modern AI systems, including recent diffusion models, and understand the limitations of current detection methods.Method: Collected FF5 dataset with five fake face generators, trained simple models with data augmentation for distortion handling, used YOLO architecture for partial manipulation localization, and tested generalization on unseen generators including Realistic Vision.
Result: Models achieved near-perfect accuracy on specific generators and could localize manipulated areas, but failed to generalize to unseen generators and were vulnerable to adversarial attacks. State-of-the-art methods also failed on newer generators.
Conclusion: While current detection methods work well on known generators, they lack generalization capability to unseen models, highlighting a significant limitation in synthetic image detection that needs to be addressed.
Abstract: An experimental study on detecting synthetic face images is presented. We collected a dataset, called FF5, of five fake face image generators, including recent diffusion models. We find that a simple model trained on a specific image generator can achieve near-perfect accuracy in separating synthetic and real images. The model handles common image distortions (reduced resolution, compression) by using data augmentation. Moreover, partial manipulations, where synthetic images are blended into real ones by inpainting, are identified and the area of the manipulation is localized by a simple model of YOLO architecture. However, the model turned out to be vulnerable to adversarial attacks and does not generalize to unseen generators. Failure to generalize to detect images produced by a newer generator also occurs for recent state-of-the-art methods, which we tested on Realistic Vision, a fine-tuned version of StabilityAI’s Stable Diffusion image generator.
[209] Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
Pu Jian, Donglei Yu, Wen Yang, Shuo Ren, Jiajun Zhang
Main category: cs.CV
TL;DR: ClearVQA benchmark addresses interactive clarification for ambiguous questions in VQA by creating a comprehensive evaluation framework and overcoming VLMs’ tendency to answer rather than ask clarifying questions.
Details
Motivation: Users often pose ambiguous questions to visual language models due to varying expression habits, but existing research only rephrases questions rather than leveraging the interactive nature of user-VLM interactions where ambiguities can be clarified through feedback.Method: Introducing the ClearVQA benchmark that targets three common categories of ambiguity in VQA context and encompasses various VQA scenarios to assess VLMs’ capacity for resolving ambiguities through interaction.
Result: The paper presents a new benchmark framework designed to evaluate VLMs’ ability to handle ambiguous questions through interactive clarification, addressing the gap in existing evaluation methods.
Conclusion: ClearVQA provides a necessary benchmark to advance research on interactive clarification in VQA, enabling VLMs to better handle ambiguous user questions through proper clarification mechanisms rather than just rephrasing.
Abstract: In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs’ capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.
[210] RingMo-Aerial: An Aerial Remote Sensing Foundation Model With Affine Transformation Contrastive Learning
Wenhui Diao, Haichen Yu, Kaiyue Kang, Tong Ling, Di Liu, Yingchao Feng, Hanbo Bi, Libo Ren, Xuexue Li, Yongqiang Mao, Xian Sun
Main category: cs.CV
TL;DR: RingMo-Aerial is a foundation model for aerial remote sensing vision that addresses unique viewing angle challenges through frequency-enhanced attention, affine-based contrastive learning, and an efficient adapter for various downstream tasks.
Details
Motivation: Existing ARS vision research focuses on task-specific algorithms with limited applicability. There's a gap in foundation model research for aerial remote sensing that can handle unique viewing angles and small object representation challenges.Method: Proposes RingMo-Aerial with three key components: 1) Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) for small-object representation, 2) Affine transformation-based contrastive learning for tilted viewing angle adaptation, 3) ARS-Adapter for efficient parameter fine-tuning across various tasks.
Result: Experimental results show RingMo-Aerial achieves state-of-the-art (SOTA) performance on multiple downstream aerial remote sensing vision tasks.
Conclusion: RingMo-Aerial demonstrates practicality and efficacy in enhancing ARS vision task performance, providing a comprehensive foundation model solution for aerial remote sensing applications.
Abstract: Aerial Remote Sensing (ARS) vision tasks present significant challenges due to the unique viewing angle characteristics. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes RingMo-Aerial, aiming to fill the gap in foundation model research in the field of ARS vision. A Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism is introduced to strengthen the model’s capacity for small-object representation. Complementarily, an affine transformation-based contrastive learning method improves its adaptability to the tilted viewing angles inherent in ARS tasks. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model’s adaptability and performance in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and efficacy of RingMo-Aerial in enhancing the performance of ARS vision tasks.
[211] T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
Jiachen Li, Qian Long, Jian Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, William Yang Wang
Main category: cs.CV
TL;DR: T2V-Turbo-v2 enhances diffusion-based text-to-video models through consistency distillation with multiple supervision signals including high-quality data, reward feedback, and conditional guidance, achieving state-of-the-art results on VBench.
Details
Motivation: To improve text-to-video generation quality during post-training by distilling a consistency model from pretrained T2V models with enhanced supervision strategies.Method: Integrates high-quality training data, reward model feedback, and conditional guidance into consistency distillation process, with tailored datasets and energy function design for teacher ODE solver.
Result: Achieves new state-of-the-art on VBench with Total score of 85.13, surpassing proprietary systems like Gen-3 and Kling, with improved motion quality metrics.
Conclusion: The approach demonstrates the effectiveness of comprehensive supervision signals and conditional guidance strategies in enhancing both visual quality and text-video alignment in T2V models.
Abstract: In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the post-training phase by distilling a highly capable consistency model from a pretrained T2V model. Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals, including high-quality training data, reward model feedback, and conditional guidance, into the consistency distillation process. Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-CompBench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, with a Total score of 85.13, surpassing proprietary systems such as Gen-3 and Kling.
[212] Adversarial Prompt Distillation for Vision-Language Models
Lin Luo, Xin Wang, Bojia Zi, Shihao Zhao, Xingjun Ma, Yu-Gang Jiang
Main category: cs.CV
TL;DR: APD is a bimodal knowledge distillation framework that enhances adversarial prompt tuning by optimizing prompts for both visual and textual modalities while distilling knowledge from a clean teacher CLIP model, achieving superior robustness and accuracy.
Details
Motivation: Existing adversarial prompt tuning methods are mostly single-modal (only visual or textual), limiting their effectiveness in either robustness or clean accuracy for vision-language models.Method: Adversarial Prompt Distillation (APD) - a bimodal knowledge distillation framework that integrates adversarial prompt tuning with multi-modal knowledge transfer from a clean pre-trained teacher CLIP model.
Result: Extensive experiments show APD outperforms state-of-the-art APT methods in both adversarial robustness and clean accuracy across multiple benchmark datasets.
Conclusion: APD validates the possibility of using a non-robust teacher to improve generalization and robustness of fine-tuned VLMs, providing superior bimodal adversarial defense.
Abstract: Large pre-trained Vision-Language Models (VLMs) such as Contrastive Language-Image Pre-training (CLIP) have been shown to be susceptible to adversarial attacks, raising concerns about their deployment in safety-critical applications like autonomous driving and medical diagnosis. One promising approach for robustifying pre-trained VLMs is Adversarial Prompt Tuning (APT), which applies adversarial training during the process of prompt tuning. However, existing APT methods are mostly single-modal methods that design prompt(s) for only the visual or textual modality, limiting their effectiveness in either robustness or clean accuracy. In this work, we propose Adversarial Prompt Distillation (APD), a bimodal knowledge distillation framework that enhances APT by integrating it with multi-modal knowledge transfer. APD optimizes prompts for both visual and textual modalities while distilling knowledge from a clean pre-trained teacher CLIP model. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our APD method over the current state-of-the-art APT methods in terms of both adversarial robustness and clean accuracy. The effectiveness of APD also validates the possibility of using a non-robust teacher to improve the generalization and robustness of fine-tuned VLMs.
[213] IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding
Junxian Li, Beining Xu, Di Zhang
Main category: cs.CV
TL;DR: IAG is a novel input-aware backdoor attack method that manipulates VLMs to ground specific target objects regardless of user queries, achieving high attack success rates while maintaining stealth through minimal visual discrepancies.
Details
Motivation: Security issues in visual grounding tasks for VLMs remain underexplored, particularly regarding backdoor attacks that could manipulate model behavior to ground incorrect objects based on malicious triggers.Method: Uses an adaptive trigger generator with text-conditional U-Net to embed semantic information of attack targets into images, employs reconstruction loss for stealth, and introduces unified attack data generation method.
Result: Achieves over 65% ASR@0.5 on InternVL-2.5-8B across various testing sets, successfully manipulates Ferret-7B and LlaVA-1.5-7B with minimal accuracy decrease on clean samples, demonstrating robustness and transferability.
Conclusion: IAG presents a feasible and effective backdoor attack method for VLMs, highlighting significant security vulnerabilities in visual grounding systems that require attention and potential defense mechanisms.
Abstract: Vision-language models (VLMs) have shown significant advancements in tasks such as visual grounding, where they localize specific objects in images based on natural language queries and images. However, security issues in visual grounding tasks for VLMs remain underexplored, especially in the context of backdoor attacks. In this paper, we introduce a novel input-aware backdoor attack method, IAG, designed to manipulate the grounding behavior of VLMs. This attack forces the model to ground a specific target object in the input image, regardless of the user’s query. We propose an adaptive trigger generator that embeds the semantic information of the attack target’s description into the original image using a text-conditional U-Net, thereby overcoming the open-vocabulary attack challenge. To ensure the attack’s stealthiness, we utilize a reconstruction loss to minimize visual discrepancies between poisoned and clean images. Additionally, we introduce a unified method for generating attack data. IAG is evaluated theoretically and empirically, demonstrating its feasibility and effectiveness. Notably, our ASR@0.5 on InternVL-2.5-8B reaches over 65% on various testing sets. IAG also shows promising potential on manipulating Ferret-7B and LlaVA-1.5-7B with very little accuracy decrease on clean samples. Extensive specific experiments, such as ablation study and potential defense, also indicate the robustness and transferability of our attack.
[214] 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M de Melo, Alan Yuille
Main category: cs.CV
TL;DR: The paper introduces 3DSRBench, the first comprehensive benchmark for evaluating 3D spatial reasoning in large multi-modal models, featuring 2,772 annotated visual QA pairs across 12 question types and specialized evaluation strategies.
Details
Motivation: While LMMs have advanced in 2D image/video understanding, their 3D spatial reasoning capabilities on natural images remain understudied, limiting their applicability to fields like autonomous navigation, robotics, and AR/VR.Method: Created 3DSRBench with manually annotated visual QA pairs, balanced data distribution, and FlipEval strategy. Includes two subsets with common and uncommon 6D camera viewpoints to test robustness.
Result: Benchmarking revealed LMMs’ limitations in height, orientation, location, and multi-object reasoning, with degraded performance on images from uncommon viewpoints.
Conclusion: 3DSRBench provides valuable insights for developing LMMs with stronger spatial reasoning abilities, highlighting current limitations and future research directions.
Abstract: 3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning abilities by balancing data distribution and adopting a novel FlipEval strategy. To further study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench includes two subsets with 3D spatial reasoning questions on paired images with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, uncovering their limitations in various aspects of 3D awareness, such as height, orientation, location, and multi-object reasoning, as well as their degraded performance on images from uncommon 6D viewpoints. Our 3DSRBench provide valuable findings and insights about future development of LMMs with strong spatial reasoning abilities. Our project page is available at https://3dsrbench.github.io/.
[215] VARCO-VISION-2.0 Technical Report
Young-rok Cha, Jeongho Ju, SunYoung Park, Jong-Hyeon Lee, Younghyun Yu, Youngjune Kim
Main category: cs.CV
TL;DR: VARCO-VISION-2.0 is an improved bilingual Korean-English vision-language model that supports multi-image understanding, layout-aware OCR, and achieves strong performance with competitive benchmark results.
Details
Motivation: To develop an advanced bilingual vision-language model for Korean and English that improves upon previous models, supports complex multi-image inputs, and provides practical deployment options.Method: Four-stage curriculum training with memory-efficient techniques, supporting multi-image understanding, layout-aware OCR (predicting text content and spatial location), and preference optimization for safety.
Result: Achieves enhanced multimodal alignment, preserves core language abilities, strong spatial grounding, competitive bilingual performance (8th place on OpenCompass VLM leaderboard for 14B model), with both 14B and 1.7B optimized versions available.
Conclusion: VARCO-VISION-2.0 advances bilingual VLM development with practical applications, offering both full-scale and lightweight models for different deployment needs.
Abstract: We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model.
[216] IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Experts
Eric Xue, Ke Chen, Zeyi Huang, Yuyang Ji, Haohan Wang
Main category: cs.CV
TL;DR: Iterative Refinement strategy for LLM-driven ML pipeline design that updates components one at a time based on training feedback, implemented in IMPROVE framework for object classification pipelines.
Details
Motivation: Existing LLM agents optimize entire ML pipelines in single steps, making it hard to attribute improvements to specific changes, leading to unstable optimization and slower convergence.Method: Introduces Iterative Refinement strategy inspired by human ML experts - focuses on one component at a time rather than sweeping changes. Implemented in IMPROVE framework for object classification pipelines with theoretical evidence.
Result: Extensive evaluations across datasets show Iterative Refinement enables IMPROVE to consistently achieve better performance over existing zero-shot LLM-based approaches.
Conclusion: Iterative Refinement strategy provides more granular optimization of ML pipelines, leading to improved performance and convergence compared to single-step optimization approaches.
Abstract: Large language model (LLM) agents have emerged as a promising solution to automate the workflow of machine learning, but most existing methods share a common limitation: they attempt to optimize entire pipelines in a single step before evaluation, making it difficult to attribute improvements to specific changes. This lack of granularity leads to unstable optimization and slower convergence, limiting their effectiveness. To address this, we introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design inspired by how human ML experts iteratively refine models, focusing on one component at a time rather than making sweeping changes all at once. By systematically updating individual components based on real training feedback, Iterative Refinement improves overall model performance. We also provide some theoretical edvience of the superior properties of this Iterative Refinement. Further, we implement this strategy in IMPROVE, an end-to-end LLM agent framework for automating and optimizing object classification pipelines. Through extensive evaluations across datasets of varying sizes and domains, we demonstrate that Iterative Refinement enables IMPROVE to consistently achieve better performance over existing zero-shot LLM-based approaches.
[217] Semantic-ICP: Iterative Closest Point for Non-rigid Multi-Organ Point Cloud Registration
Wanwen Chen, Carson Studders, Jamie J. Y. Kwon, Emily H. T. Pang, Eitan Prisman, Septimiu E. Salcudean
Main category: cs.CV
TL;DR: Semantic ICP (SemICP) method that incorporates anatomical labels and biomechanical energy constraints for improved point cloud registration in clinical applications.
Details
Motivation: Classical ICP methods fail to consider semantic meaning of anatomical points and biomechanical constraints, limiting their effectiveness in clinical applications despite learning-based methods having generalizability and explainability issues.Method: Novel semantic ICP method that handles multiple point labels, uses semantic labels to improve closest point matching robustness, and employs a novel point cloud deformation representation with linear elastic energy regularization based on biomechanical constraints.
Result: Experiments on trans-oral robotic surgery ultrasound-CT registration and two public Learn2reg challenge datasets show improved Hausdorff distance and mean surface distance compared to other point-matching-based registration methods.
Conclusion: The proposed SemICP method successfully addresses limitations of classical ICP by incorporating semantic information and biomechanical constraints, demonstrating superior performance in clinical point cloud registration tasks.
Abstract: Point cloud registration is important in computer-aided interventions (CAI). While learning-based point cloud registration methods have been developed, their clinical application is hampered by issues of generalizability and explainability. Therefore, classical point cloud registration methods, such as Iterative Closest Point (ICP), are still widely applied in CAI. ICP methods fail to consider that: (1) the points have well-defined semantic meaning, in that each point can be related to a specific anatomical label; (2) the deformation required for registration needs to follow biomechanical energy constraints. In this paper, we present a novel semantic ICP (SemICP) method that handles multiple point labels and uses linear elastic energy regularization. We use semantic labels to improve the robustness of the closest point matching and propose a novel point cloud deformation representation to apply explicit biomechanical energy regularization. Our experiments on a trans-oral robotic surgery ultrasound-computed tomography registration dataset and two public Learn2reg challenge datasets show that our method improves the Hausdorff distance and mean surface distance compared with other point-matching-based registration methods.
[218] Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering
Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, Basura Fernando
Main category: cs.CV
TL;DR: PKR-QA is a new benchmark for procedural knowledge reasoning QA, built using a semi-automatically constructed procedural knowledge graph from instructional videos and enriched with commonsense knowledge, with a neurosymbolic approach for interpretable reasoning.
Details
Motivation: To address the need for structured reasoning over procedural tasks that require understanding step-by-step processes and procedural knowledge across diverse domains.Method: Semi-automatic construction of procedural knowledge graph (PKG) using COIN instructional video dataset, ConceptNet commonsense knowledge, and LLM outputs with manual verification. Question-answer generation via graph traversal templates. Neurosymbolic Knowledge Module Learning (KML) approach combining neural modules and LLMs for structured reasoning.
Result: The proposed paradigm improves reasoning performance on PKR-QA benchmark and enables step-by-step reasoning traces that facilitate interpretability.
Conclusion: PKR-QA provides a valuable benchmark for procedural knowledge reasoning, and the neurosymbolic KML approach effectively handles structured reasoning with improved interpretability through step-by-step reasoning traces.
Abstract: We introduce PKR-QA (Procedural Knowledge Reasoning Question Answering), a new benchmark for question answering over procedural tasks that require structured reasoning. PKR-QA is constructed semi-automatically using a procedural knowledge graph (PKG), which encodes task-specific knowledge across diverse domains. The PKG is built by curating and linking information from the COIN instructional video dataset and the ontology, enriched with commonsense knowledge from ConceptNet and structured outputs from Large Language Models (LLMs), followed by manual verification. To generate question-answer pairs, we design graph traversal templates where each template is applied systematically over PKG. To enable interpretable reasoning, we propose a neurosymbolic approach called Knowledge Module Learning (KML), which learns procedural relations via neural modules and composes them for structured reasoning with LLMs. Experiments demonstrate that this paradigm improves reasoning performance on PKR-QA and enables step-by-step reasoning traces that facilitate interpretability. Code and dataset will be released soon https://github.com/LUNAProject22/KML.
[219] HierRelTriple: Guiding Indoor Layout Generation with Hierarchical Relationship Triplet Losses
Kaifan Sun, Bingchen Yang, Peter Wonka, Jun Xiao, Haiyong Jiang
Main category: cs.CV
TL;DR: HierRelTriple is a hierarchical triplet-based method for indoor spatial relationship learning that captures multi-object relationships through three levels (O2R, O2O, C2C) using geometric triplets and Delaunay Triangulation, improving spatial realism and reducing collisions.
Details
Motivation: Existing approaches rely on manual spatial rules or simplified pairwise representations, which fail to capture complex multi-object relationships and lead to overcrowded or physically implausible arrangements in indoor scenes.Method: Proposes HierRelTriple framework that partitions functional regions and extracts three spatial relationship levels (object-to-region, object-to-object, corner-to-corner) as geometric triplets. Uses Delaunay Triangulation for spatial priors and integrates IoU loss between denoised and ground truth triplets into diffusion denoising process.
Result: Extensive experiments show HierRelTriple improves spatial-relation metrics by over 15% and substantially reduces collisions and boundary violations compared to state-of-the-art methods in unconditional layout synthesis, floorplan-conditioned generation, and scene rearrangement.
Conclusion: The hierarchical triplet-based approach with joint formulation of distances, orientations, and spatial relationships significantly enhances physical realism of generated indoor scenes and outperforms existing methods in spatial relationship modeling.
Abstract: We present a hierarchical triplet-based indoor relationship learning method, coined HierRelTriple, with a focus on spatial relationship learning. Existing approaches often depend on manually defined spatial rules or simplified pairwise representations, which fail to capture complex, multi-object relationships found in real scenarios and lead to overcrowded or physically implausible arrangements. We introduce HierRelTriple, a hierarchical relational triplets modeling framework that first partitions functional regions and then automatically extracts three levels of spatial relationships: object-to-region (O2R), object-to-object (O2O), and corner-to-corner (C2C). By representing these relationships as geometric triplets and employing approaches based on Delaunay Triangulation to establish spatial priors, we derive IoU loss between denoised and ground truth triplets and integrate them seamlessly into the diffusion denoising process. The introduction of the joint formulation of inter-object distances, angular orientations, and spatial relationships enhances the physical realism of the generated scenes. Extensive experiments on unconditional layout synthesis, floorplan-conditioned layout generation, and scene rearrangement demonstrate that HierRelTriple improves spatial-relation metrics by over 15% and substantially reduces collisions and boundary violations compared to state-of-the-art methods.
[220] HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer’s Disease
Qiuhui Chen, Jintao Wang, Gang Wang, Yi Hong
Main category: cs.CV
TL;DR: HoloDx is a knowledge- and data-driven framework that improves Alzheimer’s disease diagnosis by dynamically integrating domain knowledge from LLMs and clinical expertise with multimodal clinical data through specialized attention mechanisms.
Details
Motivation: Existing AD diagnosis methods struggle to fully utilize multimodal information and lack structured mechanisms to incorporate dynamic domain knowledge, limiting their effectiveness and interpretability.Method: Proposes HoloDx framework with: 1) Knowledge injection module using knowledge-aware gated cross-attention to integrate domain-specific insights from LLMs and clinical expertise, 2) Memory injection module with prototypical memory attention to retain and retrieve subject-specific information for consistent decision-making.
Result: Evaluations on five AD datasets show HoloDx outperforms state-of-the-art methods, achieving superior diagnostic accuracy and strong generalization across diverse cohorts.
Conclusion: HoloDx effectively aligns prior knowledge with current subject data, enhancing interpretability, improving robustness, and demonstrating superior performance in AD diagnosis compared to existing approaches.
Abstract: Accurate diagnosis of Alzheimer’s disease (AD) requires effectively integrating multimodal data and clinical expertise. However, existing methods often struggle to fully utilize multimodal information and lack structured mechanisms to incorporate dynamic domain knowledge. To address these limitations, we propose HoloDx, a knowledge- and data-driven framework that enhances AD diagnosis by aligning domain knowledge with multimodal clinical data. HoloDx incorporates a knowledge injection module with a knowledge-aware gated cross-attention, allowing the model to dynamically integrate domain-specific insights from both large language models (LLMs) and clinical expertise. Also, a memory injection module with a designed prototypical memory attention enables the model to retain and retrieve subject-specific information, ensuring consistency in decision-making. By jointly leveraging these mechanisms, HoloDx enhances interpretability, improves robustness, and effectively aligns prior knowledge with current subject data. Evaluations on five AD datasets demonstrate that HoloDx outperforms state-of-the-art methods, achieving superior diagnostic accuracy and strong generalization across diverse cohorts. The source code will be released upon publication acceptance.
[221] Cross-Image Contrastive Decoding: Precise, Lossless Suppression of Language Priors in Large Vision-Language Models
Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhixing Tan, Chong Feng
Main category: cs.CV
TL;DR: CICD is a training-free method that uses unrelated images as contrastive inputs to reduce hallucinations in LVLMs by selectively suppressing language priors without compromising response quality.
Details
Motivation: Over-reliance on language priors causes hallucinations in LVLMs, leading to linguistically plausible but visually inconsistent outputs. Existing contrastive methods using perturbed images result in distorted distributions and excessive language prior suppression.Method: Cross-Image Contrastive Decoding (CICD) uses unrelated images as contrastive visual inputs with a dynamic selection mechanism based on cross-image differences to selectively suppress language priors.
Result: Extensive experiments across multiple benchmarks and LVLMs confirm CICD’s effectiveness and generalizability, particularly in image captioning where language priors are dominant.
Conclusion: CICD effectively reduces hallucinations in LVLMs by using unrelated images for contrastive decoding and dynamic selective suppression of language priors, maintaining response quality while improving visual consistency.
Abstract: Over-reliance on language priors is a major cause of hallucinations in Large Vision-Language Models (LVLMs), often leading to outputs that are linguistically plausible but visually inconsistent. Recent studies have explored contrastive decoding as a training-free solution. However, these methods typically construct contrastive visual inputs by perturbing the original image, resulting in distorted contrastive distributions, incomplete contrastive signals, and excessive suppression of language priors. Motivated by the observation that language priors tend to remain consistent across different images, we propose Cross-Image Contrastive Decoding (CICD), a simple yet effective training-free method that uses unrelated images as contrastive visual inputs. To address the issue of over-suppressing language priors, which can negatively affect the quality of generated responses, we further introduce a dynamic selection mechanism based on the cross-image differences in model behavior. By selectively suppressing language priors, our method reduces hallucinations without compromising the model’s performance. Extensive experiments across multiple benchmarks and LVLMs confirm the effectiveness and generalizability of CICD, particularly in image captioning, where language priors are especially dominant.
[222] Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation
Ke Zhang, Cihan Xiao, Jiacong Xu, Yiqun Mei, Vishal M. Patel
Main category: cs.CV
TL;DR: DiffPhy is a framework that enhances video diffusion models to generate physically-correct videos by using LLMs to reason about physical context from text prompts and incorporating it through novel training objectives.
Details
Motivation: Current video diffusion models struggle with synthesizing correct physical effects in generated videos due to the complexity of real-world motions and dynamics, making physics learning from data challenging.Method: Leverages LLMs to explicitly reason physical context from text prompts, uses MLLM as supervisory signal, introduces novel training objectives for physical correctness and semantic consistency, and creates a high-quality physical video dataset for fine-tuning.
Result: Extensive experiments on public benchmarks show that DiffPhy produces state-of-the-art results across diverse physics-related scenarios.
Conclusion: DiffPhy successfully enables physically-correct and photo-realistic video generation by fine-tuning pre-trained video diffusion models with explicit physical reasoning from LLMs.
Abstract: Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to explicitly reason a comprehensive physical context from the text prompt and use it to guide the generation. To incorporate physical context into the diffusion model, we leverage a Multimodal large language model (MLLM) as a supervisory signal and introduce a set of novel training objectives that jointly enforce physical correctness and semantic consistency with the input text. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/
[223] WorldExplorer: Towards Generating Fully Navigable 3D Scenes
Manuel-Andreas Schneider, Lukas Höllein, Matthias Nießner
Main category: cs.CV
TL;DR: WorldExplorer is a novel method for generating fully navigable 3D scenes from text using autoregressive video trajectory generation and 3D Gaussian Splatting, enabling stable exploration from multiple viewpoints without artifacts.
Details
Motivation: Existing text-to-3D generation methods produce stretched-out and noisy artifacts when moving beyond central or panoramic perspectives, limiting scene exploration capabilities.Method: The approach initializes with 360° panorama images, then uses video diffusion models to generate multiple videos along predefined trajectories. It employs scene memory for conditioning on prior views and collision-detection to prevent degenerate results, finally fusing views via 3D Gaussian Splatting optimization.
Result: WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling realistic and unrestricted exploration without the artifacts present in prior approaches.
Conclusion: This represents a significant step toward generating immersive and truly explorable virtual 3D environments from text descriptions.
Abstract: Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.
[224] MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing
Minghao Liu, Zhitao He, Zhiyuan Fan, Qingyun Wang, Yi R., Fung
Main category: cs.CV
TL;DR: MedEBench is a new benchmark for evaluating text-guided medical image editing, featuring 1,182 clinically curated image-prompt pairs across 70 tasks and 13 anatomical regions, with comprehensive evaluation metrics and diagnostic error analysis.
Details
Motivation: Text-guided image editing has advanced in natural images but lacks standardized evaluation in medical imaging, despite its potential to revolutionize clinical practices through personalized surgical planning, medical education, and patient communication.Method: Developed MedEBench benchmark with clinically curated dataset, evaluation framework measuring Editing Accuracy, Context Preservation, and Visual Quality, plus diagnostic error analysis using attention alignment and IoU between model attention maps and ROI masks.
Result: Comprehensive comparison of seven state-of-the-art models revealed consistent failure patterns and mislocalization issues where models focus on incorrect anatomical regions.
Conclusion: MedEBench establishes a foundation for developing more reliable and clinically effective text-guided medical image editing tools by providing standardized evaluation and diagnostic capabilities.
Abstract: Text-guided image editing has seen significant progress in natural image domains, but its application in medical imaging remains limited and lacks standardized evaluation frameworks. Such editing could revolutionize clinical practices by enabling personalized surgical planning, enhancing medical education, and improving patient communication. To bridge this gap, we introduce MedEBench1, a robust benchmark designed to diagnose reliability in text-guided medical image editing. MedEBench consists of 1,182 clinically curated image-prompt pairs covering 70 distinct editing tasks and 13 anatomical regions. It contributes in three key areas: (1) a clinically grounded evaluation framework that measures Editing Accuracy, Context Preservation, and Visual Quality, complemented by detailed descriptions of intended edits and corresponding Region-of-Interest (ROI) masks; (2) a comprehensive comparison of seven state-of-theart models, revealing consistent patterns of failure; and (3) a diagnostic error analysis technique that leverages attention alignment, using Intersection-over-Union (IoU) between model attention maps and ROI masks to identify mislocalization issues, where models erroneously focus on incorrect anatomical regions. MedEBench sets the stage for developing more reliable and clinically effective text-guided medical image editing tools.
[225] AMF-MedIT: An Efficient Align-Modulation-Fusion Framework for Medical Image-Tabular Data
Congjing Yu, Jing Ye, Yang Liu, Xiaodong Zhang, Zhiyong Zhang
Main category: cs.CV
TL;DR: AMF-MedIT is an efficient multimodal fusion framework that integrates medical images and tabular data using adaptive modulation and fusion to handle dimension discrepancies and dynamic modality balancing, with a specialized tabular encoder for noisy medical data.
Details
Motivation: Multimodal medical analysis faces challenges with cross-modal feature dimension discrepancies, varying modality contributions, and noise from high-dimensional tabular data, especially in data-scarce conditions.Method: Proposes AMF-MedIT framework with Adaptive Modulation and Fusion module for dimension harmonization and dynamic modality balancing, plus FT-Mamba tabular encoder with selective mechanism for noisy data handling. Uses self-supervised learning, feature masks, and magnitude/leakage losses.
Result: Achieves superior accuracy, robustness, and data efficiency across multimodal classification tasks. Demonstrates effectiveness in handling clinical noise scenarios and enhances interpretability of multimodal pretraining.
Conclusion: The framework provides reliable and efficient clinical AI applications by effectively integrating medical images and tabular data while addressing key challenges in multimodal fusion under data-scarce conditions.
Abstract: Multimodal medical analysis combining image and tabular data has gained increasing attention. However, effective fusion remains challenging due to cross-modal discrepancies in feature dimensions and modality contributions, as well as the noise from high-dimensional tabular inputs. To address these problems, we present AMF-MedIT, an efficient Align-Modulation-Fusion framework for medical image and tabular data integration, particularly under data-scarce conditions. Built upon a self-supervised learning strategy, we introduce the Adaptive Modulation and Fusion (AMF) module, a novel, streamlined fusion paradigm that harmonizes dimension discrepancies and dynamically balances modality contributions. It integrates prior knowledge to guide the allocation of modality contributions in the fusion and employs feature masks together with magnitude and leakage losses to adjust the dimensionality and magnitude of unimodal features. Additionally, we develop FT-Mamba, a powerful tabular encoder leveraging a selective mechanism to handle noisy medical tabular data efficiently. Extensive experiments, including simulations of clinical noise, demonstrate that AMF-MedIT achieves superior accuracy, robustness, and data efficiency across multimodal classification tasks. Interpretability analyses further reveal how FT-Mamba shapes multimodal pretraining and enhances the image encoder’s attention, highlighting the practical value of our framework for reliable and efficient clinical artificial intelligence applications.
[226] Taming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection
Hanzhe Liang, Jie Zhang, Tao Dai, Linlin Shen, Jinbao Wang, Can Gao
Main category: cs.CV
TL;DR: DUS-Net: A reconstruction-based 3D anomaly detection method using down-up sampling with noise generation and group center preservation to handle high-precision point clouds.
Details
Motivation: Existing reconstruction-based methods struggle with high-precision point clouds due to large scale and complex structure, requiring better geometric preservation.Method: Proposes Down-Up Sampling Network (DUS-Net) with Noise Generation module, Down-sampling Network for anomaly-free center point cloud, and Up-sampling Network with multi-scale feature fusion.
Result: Achieves SOTA performance: Object-level AUROC 79.9% (Real3D-AD) and 79.5% (Anomaly-ShapeNet), Point-level AUROC 71.2% and 84.7% respectively.
Conclusion: DUS-Net effectively reconstructs high-precision point clouds for 3D anomaly detection by preserving geometric structure through group center construction and noise-augmented training.
Abstract: Reconstruction-based methods have demonstrated very promising results for 3D anomaly detection. However, these methods face great challenges in handling high-precision point clouds due to the large scale and complex structure. In this study, a Down-Up Sampling Network (DUS-Net) is proposed to reconstruct high-precision point clouds for 3D anomaly detection by preserving the group center geometric structure. The DUS-Net first introduces a Noise Generation module to generate noisy patches, which facilitates the diversity of training data and strengthens the feature representation for reconstruction. Then, a Down-sampling Network (Down-Net) is developed to learn an anomaly-free center point cloud from patches with noise injection. Subsequently, an Up-sampling Network (Up-Net) is designed to reconstruct high-precision point clouds by fusing multi-scale up-sampling features. Our method leverages group centers for construction, enabling the preservation of geometric structure and providing a more precise point cloud. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art (SOTA) performance with an Object-level AUROC of 79.9% and 79.5%, and a Point-level AUROC of 71.2% and 84.7% on the Real3D-AD and Anomaly-ShapeNet datasets, respectively.
[227] ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way
Rajarshi Roy, Devleena Das, Ankesh Banerjee, Arjya Bhattacharjee, Kousik Dasgupta, Subarna Tripathi
Main category: cs.CV
TL;DR: ByDeWay is a training-free framework that enhances MLLM performance through depth-based scene segmentation and prompting, reducing hallucinations and improving spatial reasoning without parameter changes.
Details
Motivation: Multimodal Large Language Models often struggle with spatial reasoning and grounding, leading to hallucinations. Existing methods require training or parameter modifications, which limits flexibility and compatibility with black-box models.Method: Uses Layered-Depth-Based Prompting (LDP) with monocular depth estimation to segment scenes into closest, mid-range, and farthest layers. Generates region-specific captions with a grounded vision-language model and appends these depth-aware captions to image-question prompts.
Result: Consistent improvements on hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks across multiple MLLMs. The method reduces hallucinations and enhances spatial reasoning in a zero-training setting.
Conclusion: Depth-aware prompting through scene segmentation effectively improves MLLM performance without training. The framework is lightweight, modular, and compatible with black-box models, demonstrating the value of spatial context enrichment for multimodal reasoning.
Abstract: We introduce ByDeWay, a training-free framework designed to enhance the performance of Multimodal Large Language Models (MLLMs). ByDeWay uses a novel prompting strategy called Layered-Depth-Based Prompting (LDP), which improves spatial reasoning and grounding without modifying any model parameters. It segments the scene into closest, mid-range, and farthest layers using monocular depth estimation, then generates region-specific captions with a grounded vision-language model. These structured, depth-aware captions are appended to the image-question prompt, enriching it with spatial context. This guides MLLMs to produce more grounded and less hallucinated responses. Our method is lightweight, modular, and compatible with black-box MLLMs. Experiments on hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks show consistent improvements across multiple MLLMs, validating the effectiveness of depth-aware prompting in a zero-training setting.
[228] SlumpGuard: An AI-Powered Real-Time System for Automated Concrete Slump Prediction via Video Analysis
Youngmin Kim, Giyeong Oh, Kwangsoo Youm, Youngjae Yu
Main category: cs.CV
TL;DR: AI-powered video system automates concrete workability assessment by analyzing flow from truck chute, replacing manual slump tests for real-time quality control.
Details
Motivation: Traditional slump testing is manual, time-consuming, and inconsistent, limiting real-time monitoring capabilities for concrete workability assessment on construction sites.Method: Developed SlumpGuard - an AI-powered video-based system that automatically analyzes concrete flow from truck chute to assess workability without manual intervention.
Result: System enables full-batch inspection and improves both accuracy and efficiency of quality control, with empirical results from real-world deployment demonstrating effectiveness.
Conclusion: SlumpGuard provides a practical solution for modern concrete quality assurance, offering automated real-time workability assessment that overcomes limitations of traditional manual testing methods.
Abstract: Concrete workability is essential for construction quality, with the slump test being the most common on-site method for its assessment. However, traditional slump testing is manual, time-consuming, and prone to inconsistency, limiting its applicability for real-time monitoring. To address these challenges, we propose SlumpGuard, an AI-powered, video-based system that automatically analyzes concrete flow from the truck chute to assess workability in real time. Our system enables full-batch inspection without manual intervention, improving both the accuracy and efficiency of quality control. We present the system design, the construction of a dedicated dataset, and empirical results from real-world deployment, demonstrating the effectiveness of SlumpGuard as a practical solution for modern concrete quality assurance.
[229] Test-Time Canonicalization by Foundation Models for Robust Perception
Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash
Main category: cs.CV
TL;DR: FOCAL is a test-time robustness framework that transforms inputs into optimal views using foundation model priors, improving robustness to various transformations without retraining.
Details
Motivation: Existing approaches rely on specialized architectures or predefined data augmentations, limiting adaptability to diverse real-world viewing conditions.Method: At inference time, FOCAL explores transformed images and selects the one with highest likelihood under foundation model priors, inspired by human mental rotation.
Result: Applied to CLIP and SAM models, FOCAL significantly boosts robustness across 2D/3D rotations, contrast/lighting shifts, and day-night changes.
Conclusion: FOCAL reframes invariance as test-time optimization, offering a general and scalable approach to robustness without architectural changes.
Abstract: Perception in the real world requires robustness to diverse viewing conditions. Existing approaches often rely on specialized architectures or training with predefined data augmentations, limiting adaptability. Taking inspiration from mental rotation in human vision, we propose FOCAL, a test-time robustness framework that transforms the input into the most typical view. At inference time, FOCAL explores a set of transformed images and chooses the one with the highest likelihood under foundation model priors. This test-time optimization boosts robustness while requiring no retraining or architectural changes. Applied to models like CLIP and SAM, it significantly boosts robustness across a wide range of transformations, including 2D and 3D rotations, contrast and lighting shifts, and day-night changes. We also explore potential applications in active vision. By reframing invariance as a test-time optimization problem, FOCAL offers a general and scalable approach to robustness. Our code is available at: https://github.com/sutkarsh/focal.
[230] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning
Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, Ming Lu, Yang Wang, Shanghang Zhang
Main category: cs.CV
TL;DR: FastDriveVLA is a novel reconstruction-based visual token pruning framework for autonomous driving VLAs that prioritizes foreground information through MAE-style pixel reconstruction, achieving SOTA performance with reduced computational costs.
Details
Motivation: Current visual token pruning methods in VLMs perform poorly in autonomous driving scenarios. Human drivers focus on relevant foreground areas, so retaining foreground visual tokens is essential for effective decision-making while reducing computational costs.Method: Proposes FastDriveVLA with ReconPruner - a plug-and-play visual token pruner that uses MAE-style pixel reconstruction with adversarial foreground-background reconstruction strategy. Trained on nuScenes-FG dataset (241K image-mask pairs with annotated foreground regions).
Result: Achieves state-of-the-art results on nuScenes open-loop planning benchmark across different pruning ratios. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining.
Conclusion: The reconstruction-based approach effectively prioritizes foreground information for autonomous driving VLAs, significantly reducing computational costs while maintaining performance, demonstrating the importance of foreground-focused token pruning in driving scenarios.
Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.
[231] Sample-Aware Test-Time Adaptation for Medical Image-to-Image Translation
Irene Iele, Francesco Di Feola, Valerio Guarrasi, Paolo Soda
Main category: cs.CV
TL;DR: A novel Test-Time Adaptation framework for medical image translation that dynamically adjusts to out-of-distribution samples using reconstruction-based domain shift quantification and selective feature modification.
Details
Motivation: Current image-to-image translation methods suffer performance degradation with out-of-distribution samples, limiting their real-world applicability in medical imaging tasks like CT denoising and MRI conversion.Method: Proposes a TTA framework with Reconstruction Module to quantify domain shift and Dynamic Adaptation Block that selectively modifies pretrained model features only for samples requiring adaptation.
Result: Shows consistent improvements over baseline translation models and prior TTA methods on low-dose CT denoising and T1 to T2 MRI translation tasks.
Conclusion: Dynamic, sample-specific adaptation outperforms uniform adaptation approaches, offering improved model resilience for real-world medical imaging applications.
Abstract: Image-to-image translation has emerged as a powerful technique in medical imaging, enabling tasks such as image denoising and cross-modality conversion. However, it suffers from limitations in handling out-of-distribution samples without causing performance degradation. To address this limitation, we propose a novel Test-Time Adaptation (TTA) framework that dynamically adjusts the translation process based on the characteristics of each test sample. Our method introduces a Reconstruction Module to quantify the domain shift and a Dynamic Adaptation Block that selectively modifies the internal features of a pretrained translation model to mitigate the shift without compromising the performance on in-distribution samples that do not require adaptation. We evaluate our approach on two medical image-to-image translation tasks: low-dose CT denoising and T1 to T2 MRI translation, showing consistent improvements over both the baseline translation model without TTA and prior TTA methods. Our analysis highlights the limitations of the state-of-the-art that uniformly apply the adaptation to both out-of-distribution and in-distribution samples, demonstrating that dynamic, sample-specific adjustment offers a promising path to improve model resilience in real-world scenarios. The code is available at: https://github.com/Sample-Aware-TTA/Code.
[232] SA-3DGS: A Self-Adaptive Compression Method for 3D Gaussian Splatting
Liheng Zhang, Weihao Yu, Zubo Lu, Haozhi Gu, Jin Huang
Main category: cs.CV
TL;DR: SA-3DGS is a compression method for 3D Gaussian Splatting that reduces storage costs by 66x while maintaining rendering quality through importance scoring, pruning, and codebook optimization.
Details
Motivation: Current 3D Gaussian Splatting methods require large numbers of Gaussian points, leading to high storage demands that limit practical deployment. Existing compression methods struggle to identify truly insignificant points, causing quality degradation.Method: Proposes SA-3DGS with three modules: importance scoring to identify insignificant Gaussians for pruning, importance-aware clustering to compress attributes into codebooks, and codebook repair using contextual information to recover original attributes.
Result: Achieves up to 66x compression while maintaining or improving rendering quality on benchmark datasets. The pruning approach also improves other pruning-based methods like LightGaussian.
Conclusion: SA-3DGS effectively reduces storage costs for 3D Gaussian Splatting while preserving quality, demonstrating strong generalization and performance improvements over existing methods.
Abstract: Recent advancements in 3D Gaussian Splatting have enhanced efficient and high-quality novel view synthesis. However, representing scenes requires a large number of Gaussian points, leading to high storage demands and limiting practical deployment. The latest methods facilitate the compression of Gaussian models but struggle to identify truly insignificant Gaussian points in the scene, leading to a decline in subsequent Gaussian pruning, compression quality, and rendering performance. To address this issue, we propose SA-3DGS, a method that significantly reduces storage costs while maintaining rendering quality. SA-3DGS learns an importance score to automatically identify the least significant Gaussians in scene reconstruction, thereby enabling effective pruning and redundancy reduction. Next, the importance-aware clustering module compresses Gaussians attributes more accurately into the codebook, improving the codebook’s expressive capability while reducing model size. Finally, the codebook repair module leverages contextual scene information to repair the codebook, thereby recovering the original Gaussian point attributes and mitigating the degradation in rendering quality caused by information loss. Experimental results on several benchmark datasets show that our method achieves up to 66x compression while maintaining or even improving rendering quality. The proposed Gaussian pruning approach is not only adaptable to but also improves other pruning-based methods (e.g., LightGaussian), showcasing excellent performance and strong generalization ability.
[233] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang
Main category: cs.CV
TL;DR: SIFThinker is a spatially-aware multimodal framework that uses depth-enhanced bounding boxes and natural language to iteratively refine visual attention, improving spatial understanding and fine-grained perception in MLLMs.
Details
Motivation: Current MLLMs struggle with complex visual tasks like spatial understanding and fine-grained perception, lacking the ability to iteratively refine focus on relevant regions using spatial cues.Method: Uses a reverse-expansion-forward-inference strategy to generate image-text chains of thought, creates SIF-50K dataset, and implements GRPO-SIF reinforced training with depth-informed visual grounding for dynamic attention correction.
Result: Outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception while maintaining strong general capabilities.
Conclusion: The spatially-aware “think-with-images” framework effectively addresses MLLM limitations in complex visual tasks through iterative attention refinement with spatial cues.
Abstract: Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware “think-with-images” framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method. Code: https://github.com/zhangquanchen/SIFThinker.
[234] Plane Detection and Ranking via Model Information Optimization
Daoxin Zhong, Jun Li, Meng Yee Michael Chuah
Main category: cs.CV
TL;DR: A novel plane detection framework using model information optimization to address RANSAC’s false positive issues, with improved accuracy and real-world applicability through neural network acceleration.
Details
Motivation: RANSAC-based plane detection suffers from false positives due to ambiguous inlier thresholds, especially in complex scenes with unknown numbers of planes. The paper aims to provide an objective mechanism for determining true plane count and preventing false detections.Method: Treats depth readings as discrete random variables constrained by ground truth planes. Generates models through random sub-sampling and calculates information for each model using depth sensor physics and noise models. The model with least information is selected as most likely ground truth. Uses neural network segmentation for acceleration.
Result: The algorithm estimates plane parameters more accurately than default Open3D RANSAC plane segmentation in synthetic data experiments. The framework provides objective plane count determination and prevents false positives while ranking plane quality.
Conclusion: The proposed information optimization framework effectively addresses RANSAC’s limitations, providing more accurate plane detection with objective criteria for determining true plane numbers and preventing false positives in complex real-world scenes.
Abstract: Plane detection from depth images is a crucial subtask with broad robotic applications, often accomplished by iterative methods such as Random Sample Consensus (RANSAC). While RANSAC is a robust strategy with strong probabilistic guarantees, the ambiguity of its inlier threshold criterion makes it susceptible to false positive plane detections. This issue is particularly prevalent in complex real-world scenes, where the true number of planes is unknown and multiple planes coexist. In this paper, we aim to address this limitation by proposing a generalised framework for plane detection based on model information optimization. Building on previous works, we treat the observed depth readings as discrete random variables, with their probability distributions constrained by the ground truth planes. Various models containing different candidate plane constraints are then generated through repeated random sub-sampling to explain our observations. By incorporating the physics and noise model of the depth sensor, we can calculate the information for each model, and the model with the least information is accepted as the most likely ground truth. This information optimization process serves as an objective mechanism for determining the true number of planes and preventing false positive detections. Additionally, the quality of each detected plane can be ranked by summing the information reduction of inlier points for each plane. We validate these properties through experiments with synthetic data and find that our algorithm estimates plane parameters more accurately compared to the default Open3D RANSAC plane segmentation. Furthermore, we accelerate our algorithm by partitioning the depth map using neural network segmentation, which enhances its ability to generate more realistic plane parameters in real-world data.
[235] ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving
Xianda Guo, Ruijun Zhang, Yiqun Duan, Ruilin Wang, Matteo Poggi, Keyuan Zhou, Wenzhao Zheng, Wenke Huang, Gangwei Xu, Mike Horton, Yuan Si, Qin Zou, Hao Zhao, Long Chen
Main category: cs.CV
TL;DR: ROVR is a new large-scale, diverse depth dataset for autonomous driving that exposes limitations of current models and datasets, showing severe cross-dataset generalization failures despite near-saturation performance on existing benchmarks.
Details
Motivation: Existing depth datasets like KITTI, nuScenes, and DDAD have limitations in diversity and scalability, and benchmark performance is approaching saturation, creating need for new large-scale datasets to support foundation models and multi-modal learning.Method: Created ROVR dataset with 200K high-resolution frames across highway, rural, and urban scenarios, including day/night and adverse weather conditions, using lightweight acquisition pipeline with sparse but statistically sufficient ground truth.
Result: Benchmarking shows severe cross-dataset generalization failures - models with near-ceiling accuracy on KITTI degrade drastically on ROVR, and even when trained on ROVR, current methods fall short of saturation.
Conclusion: ROVR establishes a demanding new platform for advancing depth estimation, highlighting unique challenges of scene diversity, dynamic environments, and sparse ground truth, enabling development of models with stronger real-world robustness.
Abstract: Depth estimation is a fundamental task for 3D scene understanding in autonomous driving, robotics, and augmented reality. Existing depth datasets, such as KITTI, nuScenes, and DDAD, have advanced the field but suffer from limitations in diversity and scalability. As benchmark performance on these datasets approaches saturation, there is an increasing need for a new generation of large-scale, diverse, and cost-efficient datasets to support the era of foundation models and multi-modal learning. We present ROVR, a large-scale, diverse, and cost-efficient depth dataset designed to capture the complexity of real-world driving. ROVR comprises 200K high-resolution frames across highway, rural, and urban scenarios, spanning day/night and adverse weather conditions. A lightweight acquisition pipeline ensures scalable collection, while sparse but statistically sufficient ground truth supports robust training. Benchmarking with state-of-the-art monocular depth models reveals severe cross-dataset generalization failures: models achieving near-ceiling accuracy on KITTI degrade drastically on ROVR, and even when trained on ROVR, current methods fall short of saturation. These results highlight the unique challenges posed by ROVR-scene diversity, dynamic environments, and sparse ground truth, establishing it as a demanding new platform for advancing depth estimation and building models with stronger real-world robustness. Extensive ablation studies provide a more intuitive understanding of our dataset across different scenarios, lighting conditions, and generalized ability.
[236] PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting
Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Jiale Tao, Qixun Wang, Ruihuang Li, Comi Chen, Xin Li, Mingrui Wu, Xinchi Deng, Chunyu Wang, Qinglin Lu
Main category: cs.CV
TL;DR: PromptEnhancer is a universal prompt rewriting framework that improves text-to-image generation by enhancing prompts through reinforcement learning with a specialized reward model, without modifying the base T2I model’s weights.
Details
Motivation: Current text-to-image diffusion models often fail to accurately render complex prompts involving attribute binding, negation, and compositional relationships, leading to mismatches between user intent and generated images.Method: A Chain-of-Thought rewriter trained through reinforcement learning guided by an AlignEvaluator reward model that provides fine-grained feedback based on 24 key points derived from common T2I failure modes.
Result: Extensive experiments on HunyuanImage 2.1 show significant improvements in image-text alignment across various semantic and compositional challenges.
Conclusion: PromptEnhancer effectively enhances prompt quality for better T2I generation without model modifications, and introduces a new human preference benchmark for future research.
Abstract: Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.
[237] CoRe-GS: Coarse-to-Refined Gaussian Splatting with Semantic Object Focus
Hannah Schieber, Dominik Frischmann, Victor Schaack, Simon Boche, Angela Schoellig, Stefan Leutenegger, Daniel Roth
Main category: cs.CV
TL;DR: CoRe-GS is a semantic point-of-interest focused extension of Gaussian Splatting that selectively refines relevant scene elements instead of uniform optimization, reducing training time by 75% while improving quality in critical areas.
Details
Motivation: Mobile reconstruction needs to support time-critical tasks like tele-guidance and disaster response, but full high-fidelity reconstruction is computationally expensive and often unnecessary when only specific points of interest matter for timely decision making.Method: CoRe-GS first produces a fast segmentation-ready Gaussian Splatting representation, then selectively refines splats belonging to semantically relevant POIs detected during data acquisition, focusing optimization only on important areas.
Result: The method reduces training time to 25% compared to full semantic Gaussian Splatting while improving novel view synthesis quality in the most important areas. Validated on both real-world (SCRREAM) and synthetic (NeRDS 360) datasets.
Conclusion: Prioritizing points of interest enables faster and higher-quality mobile reconstruction tailored to operational needs, making it suitable for time-critical applications where only specific scene elements matter.
Abstract: Mobile reconstruction has the potential to support time-critical tasks such as tele-guidance and disaster response, where operators must quickly gain an accurate understanding of the environment. Full high-fidelity scene reconstruction is computationally expensive and often unnecessary when only specific points of interest (POIs) matter for timely decision making. We address this challenge with CoRe-GS, a semantic POI-focused extension of Gaussian Splatting (GS). Instead of optimizing every scene element uniformly, CoRe-GS first produces a fast segmentation-ready GS representation and then selectively refines splats belonging to semantically relevant POIs detected during data acquisition. This targeted refinement reduces training time to 25% compared to full semantic GS while improving novel view synthesis quality in the areas that matter most. We validate CoRe-GS on both real-world (SCRREAM) and synthetic (NeRDS 360) datasets, demonstrating that prioritizing POIs enables faster and higher-quality mobile reconstruction tailored to operational needs.
[238] TinyDef-DETR: A DETR-based Framework for Defect Detection in Transmission Lines from UAV Imagery
Jiaming Cui, Shuai Zhou, Feng Shen
Main category: cs.CV
TL;DR: TinyDef-DETR is a DETR-based framework for accurate UAV-based transmission line defect detection, featuring edge-enhanced backbone, detail-preserving downsampling, multi-scale attention, and improved loss function for small targets.
Details
Motivation: Automated defect detection from UAV imagery is challenging due to small defect size, ambiguity, and complex backgrounds in transmission line inspections.Method: DETR-based framework with four key components: edge-enhanced ResNet backbone, stride-free space-to-depth module, cross-stage dual-domain multi-scale attention, and Focaler-Wise-SIoU regression loss.
Result: Achieves superior detection performance and strong generalization on public and real-world datasets while maintaining modest computational overhead.
Conclusion: TinyDef-DETR is an effective and efficient solution for UAV-based transmission line defect detection, particularly for small and ambiguous targets.
Abstract: Automated defect detection from UAV imagery of transmission lines is a challenging task due to the small size, ambiguity, and complex backgrounds of defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to achieve accurate and efficient detection of transmission line defects from UAV-acquired images. The model integrates four major components: an edge-enhanced ResNet backbone to strengthen boundary-sensitive representations, a stride-free space-to-depth module to enable detail-preserving downsampling, a cross-stage dual-domain multi-scale attention mechanism to jointly model global context and local cues, and a Focaler-Wise-SIoU regression loss to improve the localization of small and difficult targets. Together, these designs effectively mitigate the limitations of conventional detectors. Extensive experiments on both public and real-world datasets demonstrate that TinyDef-DETR achieves superior detection performance and strong generalization capability, while maintaining modest computational overhead. The accuracy and efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission line defect detection, particularly in scenarios involving small and ambiguous targets.
[239] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang
Main category: cs.CV
TL;DR: BranchGRPO improves efficiency and alignment in image/video generation by using branching tree rollouts, reward fusion, and pruning strategies to address inefficiencies in existing GRPO methods.
Details
Motivation: Existing Group Relative Policy Optimization (GRPO) variants are inefficient due to sequential rollouts, many sampling steps, and unreliable credit assignment with sparse terminal rewards that don't capture varying decision criticality during denoising.Method: BranchGRPO restructures rollout process into a branching tree with shared prefixes, introduces reward fusion and depth-wise advantage estimator for dense step-level signals, and implements pruning strategies that cut gradient computation while preserving forward rollouts.
Result: On HPDv2.1 image alignment, BranchGRPO improves alignment scores by 16% over DanceGRPO while reducing training time by 55%. BranchGRPO-Mix variant accelerates training 4.7x faster without degrading alignment. On WanX video generation, it achieves higher Video-Align scores with sharper, temporally consistent frames.
Conclusion: BranchGRPO significantly improves both efficiency and alignment quality in image and video generation tasks compared to previous GRPO methods, demonstrating the effectiveness of branching tree structures and improved reward assignment.
Abstract: Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.
[240] Implicit Neural Representations of Intramyocardial Motion and Strain
Andrew Bell, Yan Kit Choi, Steffen E Petersen, Andrew King, Muhummad Sohaib Nazir, Alistair A Young
Main category: cs.CV
TL;DR: INR-based method for automatic quantification of intramyocardial motion and strain from tagging MRI, achieving state-of-the-art accuracy and significant speed improvements over baselines.
Details
Motivation: Automatic quantification of intramyocardial motion and strain from tagging MRI is important but challenging, requiring accurate and scalable solutions for large cardiac MRI datasets.Method: Uses implicit neural representations (INRs) conditioned on learned latent codes to predict continuous left ventricular displacement without requiring inference-time optimization.
Result: Achieved best tracking accuracy (2.14 mm RMSE) and lowest combined error in global circumferential (2.86%) and radial (6.42%) strain on 452 UK Biobank test cases, while being ~380× faster than the most accurate baseline.
Conclusion: INR-based models are suitable for accurate and scalable analysis of myocardial strain in large cardiac MRI datasets.
Abstract: Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement – without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is $\sim$380$\times$ faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets.
[241] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura
Main category: cs.CV
TL;DR: ZeroPlantSeg is a zero-shot method for segmenting entire rosette-shaped plant individuals from top-view images by combining foundation segmentation models for leaf extraction and vision-language models for structural reasoning, achieving state-of-the-art performance without training.
Details
Motivation: Existing foundation segmentation models can extract individual leaves but struggle with segmenting entire plant individuals consisting of multiple overlapping leaves. Current hierarchical segmentation methods require annotated training datasets that are species-specific and labor-intensive to create.Method: Integrates a foundation segmentation model to extract leaf instances and a vision-language model to reason about plant structures for extracting complete plant individuals, all without additional training (zero-shot approach).
Result: Outperforms existing zero-shot methods and achieves better cross-domain performance than supervised methods across multiple plant species, growth stages, and shooting environments.
Conclusion: ZeroPlantSeg provides an effective zero-shot solution for plant individual segmentation that eliminates the need for annotated training data while maintaining strong performance across diverse conditions.
Abstract: Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants’ structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.
[242] ALL-PET: A Low-resource and Low-shot PET Foundation Model in Projection Domain
Bin Huang, Kang Chen, Bingxuan Li, Huafeng Liu, Qiegen Liu
Main category: cs.CV
TL;DR: ALL-PET is a low-resource PET foundation model that uses latent diffusion with innovative mask augmentation and attention mechanisms to achieve high-quality sinogram generation with only 500 samples, enabling multiple PET imaging tasks efficiently.
Details
Motivation: Building large-scale foundation models for PET imaging is challenging due to limited labeled data access and insufficient computational resources. Data scarcity and efficiency limitations hinder progress in this domain.Method: Proposes ALL-PET with three key innovations: 1) Radon mask augmentation strategy generating 200k+ diverse training samples, 2) positive/negative mask constraints for geometric consistency, and 3) transparent medical attention mechanism for lesion-focused guidance in projection data.
Result: ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. It generalizes across multiple tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, while operating efficiently with memory under 24GB.
Conclusion: The proposed ALL-PET framework successfully addresses data scarcity and computational limitations in PET imaging through innovative mask augmentation strategies and geometry-driven attention mechanisms, enabling effective foundation modeling with minimal resources.
Abstract: Building large-scale foundation model for PET imaging is hindered by limited access to labeled data and insufficient computational resources. To overcome data scarcity and efficiency limitations, we propose ALL-PET, a low-resource, low-shot PET foundation model operating directly in projection domain. ALL-PET leverages a latent diffusion model (LDM) with three key innovations. First, we design a Radon mask augmentation strategy (RMAS) that generates over 200,000 structurally diverse training samples by projecting randomized image-domain masks into sinogram space, significantly improving generalization with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism that varies mask quantity and distribution, enhancing data diversity without added model complexity. Second, we implement positive/negative mask constraints to embed strict geometric consistency, reducing parameter burden while preserving generation quality. Third, we introduce transparent medical attention (TMA), a parameter-free, geometry-driven mechanism that enhances lesion-related regions in raw projection data. Lesion-focused attention maps are derived from coarse segmentation, covering both hypermetabolic and hypometabolic areas, and projected into sinogram space for physically consistent guidance. The system supports clinician-defined ROI adjustments, ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET acquisition physics. Experimental results show that ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. ALL-PET generalizes across tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, operating efficiently with memory use under 24GB.
[243] Hierarchical MLANet: Multi-level Attention for 3D Face Reconstruction From Single Images
Danling Cao
Main category: cs.CV
TL;DR: MLANet uses hierarchical CNN with multi-level attention mechanisms for 3D face reconstruction from single wild images, achieving state-of-the-art results on benchmark datasets.
Details
Motivation: Lack of ground-truth labeled datasets and complexity of real-world environments make 3D face reconstruction from 2D in-the-wild images challenging.Method: Hierarchical Multi-Level Attention Network (MLANet) with pre-trained backbone, multi-level attention mechanisms, semi-supervised training using 3DMM parameters and differentiable renderer for end-to-end training.
Result: Extensive experiments on AFLW2000-3D and MICC Florence datasets show effectiveness in 3D face reconstruction and alignment tasks, evaluated both quantitatively and qualitatively.
Conclusion: The proposed MLANet effectively addresses 3D face reconstruction challenges from single in-the-wild images through hierarchical attention mechanisms and semi-supervised learning.
Abstract: Recovering 3D face models from 2D in-the-wild images has gained considerable attention in the computer vision community due to its wide range of potential applications. However, the lack of ground-truth labeled datasets and the complexity of real-world environments remain significant challenges. In this chapter, we propose a convolutional neural network-based approach, the Hierarchical Multi-Level Attention Network (MLANet), for reconstructing 3D face models from single in-the-wild images. Our model predicts detailed facial geometry, texture, pose, and illumination parameters from a single image. Specifically, we employ a pre-trained hierarchical backbone network and introduce multi-level attention mechanisms at different stages of 2D face image feature extraction. A semi-supervised training strategy is employed, incorporating 3D Morphable Model (3DMM) parameters from publicly available datasets along with a differentiable renderer, enabling an end-to-end training process. Extensive experiments, including both comparative and ablation studies, were conducted on two benchmark datasets, AFLW2000-3D and MICC Florence, focusing on 3D face reconstruction and 3D face alignment tasks. The effectiveness of the proposed method was evaluated both quantitatively and qualitatively.
[244] Compressed Video Quality Enhancement: Classifying and Benchmarking over Standards
Xiem HoangVan, Dang BuiDinh, Sang NguyenQuang, Wen-Hsiao Peng
Main category: cs.CV
TL;DR: This paper presents a comprehensive survey on compressed video quality enhancement (CVQE) methods, addressing limitations in existing surveys by providing a novel taxonomy, unified benchmarking framework, and systematic analysis of performance-complexity trade-offs.
Details
Motivation: Existing CVQE surveys lack systematic classification linking methods to specific standards and artifacts, insufficient comparative analysis across architectural paradigms, and underdeveloped benchmarking practices, creating gaps in the field.Method: The paper introduces three key contributions: 1) a novel taxonomy classifying CVQE methods across architectural paradigms, coding standards, and compressed-domain feature utilization; 2) a unified benchmarking framework with modern compression protocols and standard test sequences; 3) systematic analysis of reconstruction performance vs. computational complexity trade-offs.
Result: The survey establishes a foundation for consistent assessment and informed model selection in CVQE research and deployment, providing comprehensive classification and benchmarking standards for the field.
Conclusion: This comprehensive review addresses critical gaps in CVQE literature by providing systematic classification, fair evaluation standards, and analysis of key trade-offs, highlighting promising directions for future research in compressed video quality enhancement.
Abstract: Compressed video quality enhancement (CVQE) is crucial for improving user experience with lossy video codecs like H.264/AVC, H.265/HEVC, and H.266/VVC. While deep learning based CVQE has driven significant progress, existing surveys still suffer from limitations: lack of systematic classification linking methods to specific standards and artifacts, insufficient comparative analysis of architectural paradigms across coding types, and underdeveloped benchmarking practices. To address these gaps, this paper presents three key contributions. First, it introduces a novel taxonomy classifying CVQE methods across architectural paradigms, coding standards, and compressed-domain feature utilization. Second, it proposes a unified benchmarking framework integrating modern compression protocols and standard test sequences for fair multi-criteria evaluation. Third, it provides a systematic analysis of the critical trade-offs between reconstruction performance and computational complexity observed in state-of-the-art methods and highlighting promising directions for future research. This comprehensive review aims to establish a foundation for consistent assessment and informed model selection in CVQE research and deployment.
[245] Leveraging Geometric Priors for Unaligned Scene Change Detection
Ziling Liu, Ziwei Chen, Mingqi Gao, Jinyu Yang, Feng Zheng
Main category: cs.CV
TL;DR: Introducing geometric priors to unaligned scene change detection for better handling of viewpoint variations and occlusions, achieving superior performance without training.
Details
Motivation: Current methods rely solely on 2D visual cues which fail under large viewpoint changes and lack explicit geometric reasoning, limiting their ability to handle occlusions and identify visual overlaps reliably.Method: Proposes a training-free framework that integrates geometric priors with visual foundation model representations to enable reliable change detection under viewpoint misalignment.
Result: Achieves superior and robust performance through extensive evaluation on PSCD, ChangeSim, and PASLCD datasets.
Conclusion: Geometric priors effectively address core challenges of unaligned scene change detection, providing reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection.
Abstract: Unaligned Scene Change Detection aims to detect scene changes between image pairs captured at different times without assuming viewpoint alignment. To handle viewpoint variations, current methods rely solely on 2D visual cues to establish cross-image correspondence to assist change detection. However, large viewpoint changes can alter visual observations, causing appearance-based matching to drift or fail. Additionally, supervision limited to 2D change masks from small-scale SCD datasets restricts the learning of generalizable multi-view knowledge, making it difficult to reliably identify visual overlaps and handle occlusions. This lack of explicit geometric reasoning represents a critical yet overlooked limitation. In this work, we introduce geometric priors for the first time to address the core challenges of unaligned SCD, for reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. Building on these priors, we propose a training-free framework that integrates them with the powerful representations of a visual foundation model to enable reliable change detection under viewpoint misalignment. Through extensive evaluation on the PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves superior and robust performance. Our code will be released at https://github.com/ZilingLiu/GeoSCD.
[246] SFGNet: Semantic and Frequency Guided Network for Camouflaged Object Detection
Dezhen Wang, Haixiang Zhao, Xiang Shen, Sheng Miao
Main category: cs.CV
TL;DR: SFGNet is a novel network for camouflaged object detection that incorporates semantic prompts and frequency-domain features with Multi-Band Fourier Module and Interactive Structure Enhancement Block to improve boundary perception and handle complex backgrounds.
Details
Motivation: Most existing camouflaged object detection studies overlook semantic differences among textual prompts of different targets and fine-grained frequency features, leading to suboptimal performance in detecting objects that blend into their surroundings.Method: Proposed Semantic and Frequency Guided Network (SFGNet) with Multi-Band Fourier Module (MBFM) to handle complex backgrounds and blurred boundaries, and Interactive Structure Enhancement Block (ISEB) to ensure structural integrity and boundary details.
Result: Extensive experiments on three COD benchmark datasets demonstrate that SFGNet significantly outperforms state-of-the-art approaches.
Conclusion: The incorporation of semantic prompts and frequency-domain features with specialized modules effectively improves camouflaged object detection performance, particularly in handling complex backgrounds and preserving boundary details.
Abstract: Camouflaged object detection (COD) aims to segment objects that blend into their surroundings. However, most existing studies overlook the semantic differences among textual prompts of different targets as well as fine-grained frequency features. In this work, we propose a novel Semantic and Frequency Guided Network (SFGNet), which incorporates semantic prompts and frequency-domain features to capture camouflaged objects and improve boundary perception. We further design Multi-Band Fourier Module(MBFM) to enhance the ability of the network in handling complex backgrounds and blurred boundaries. In addition, we design an Interactive Structure Enhancement Block (ISEB) to ensure structural integrity and boundary details in the predictions. Extensive experiments conducted on three COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches. The core code of the model is available at the following link: https://github.com/winter794444/SFGNetICASSP2026.
[247] Disentangling Content from Style to Overcome Shortcut Learning: A Hybrid Generative-Discriminative Learning Framework
Siming Fu, Sijun Dong, Xiaoliang Meng
Main category: cs.CV
TL;DR: HyGDL is a hybrid generative-discriminative learning framework that addresses shortcut learning in SSL by achieving explicit content-style disentanglement through analytical vector projection and style-conditioned reconstruction.
Details
Motivation: Self-supervised learning suffers from shortcut learning where models exploit superficial features like texture instead of intrinsic structure, which hinders generalization to unseen domains. Existing methods fail to address the underlying learning mechanism that fosters shortcut dependency.Method: HyGDL uses a single encoder with analytical style definition via vector projection. It employs: (1) self-distillation for style-invariant content learning, (2) analytical projection to decompose representations into orthogonal content and style vectors, and (3) style-conditioned reconstruction for end-to-end supervision based on the Invariance Pre-training Principle.
Result: The framework demonstrates superior performance on benchmarks designed to diagnose shortcut learning, showing it learns truly robust representations compared to prior methods.
Conclusion: HyGDL provides a principled approach to content-style disentanglement that addresses shortcut learning at its core, enabling better generalization through explicit separation of invariant content from style variations.
Abstract: Despite the remarkable success of Self-Supervised Learning (SSL), its generalization is fundamentally hindered by Shortcut Learning, where models exploit superficial features like texture instead of intrinsic structure. We experimentally verify this flaw within the generative paradigm (e.g., MAE) and argue it is a systemic issue also affecting discriminative methods, identifying it as the root cause of their failure on unseen domains. While existing methods often tackle this at a surface level by aligning or separating domain-specific features, they fail to alter the underlying learning mechanism that fosters shortcut dependency.To address this at its core, we propose HyGDL (Hybrid Generative-Discriminative Learning Framework), a hybrid framework that achieves explicit content-style disentanglement. Our approach is guided by the Invariance Pre-training Principle: forcing a model to learn an invariant essence by systematically varying a bias (e.g., style) at the input while keeping the supervision signal constant. HyGDL operates on a single encoder and analytically defines style as the component of a representation that is orthogonal to its style-invariant content, derived via vector projection. This is operationalized through a synergistic design: (1) a self-distillation objective learns a stable, style-invariant content direction; (2) an analytical projection then decomposes the representation into orthogonal content and style vectors; and (3) a style-conditioned reconstruction objective uses these vectors to restore the image, providing end-to-end supervision. Unlike prior methods that rely on implicit heuristics, this principled disentanglement allows HyGDL to learn truly robust representations, demonstrating superior performance on benchmarks designed to diagnose shortcut learning.
[248] DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection
Seoik Jung, Taekyung Song, Joshua Jordan Daniel, JinYoung Lee, SungJun Lee
Main category: cs.CV
TL;DR: This paper introduces a softmax-based frame allocation strategy for video anomaly detection that prioritizes anomaly-dense segments while maintaining full-video coverage, and constructs two complementary benchmarks for comprehensive evaluation.
Details
Motivation: Existing video anomaly detection benchmarks are limited to either frame-level or video-level tasks, restricting a holistic view of model generalization capabilities.Method: Proposes a softmax-based frame allocation strategy that balances sampling across temporal scales by focusing on anomaly-dense segments while ensuring full-video coverage. Constructs two benchmarks: image-based for frame-level reasoning and video-based for temporally localized segments with abnormality scoring.
Result: Experiments on UCF-Crime show improvements at both frame and video levels. Ablation studies confirm clear advantages of anomaly-focused sampling over uniform and random baselines.
Conclusion: The proposed approach provides a more comprehensive evaluation framework for video anomaly detection by addressing both frame-level and video-level tasks through balanced temporal sampling.
Abstract: Video Anomaly Detection (VAD) is critical for surveillance and public safety. However, existing benchmarks are limited to either frame-level or video-level tasks, restricting a holistic view of model generalization. This work first introduces a softmax-based frame allocation strategy that prioritizes anomaly-dense segments while maintaining full-video coverage, enabling balanced sampling across temporal scales. Building on this process, we construct two complementary benchmarks. The image-based benchmark evaluates frame-level reasoning with representative frames, while the video-based benchmark extends to temporally localized segments and incorporates an abnormality scoring task. Experiments on UCF-Crime demonstrate improvements at both the frame and video levels, and ablation studies confirm clear advantages of anomaly-focused sampling over uniform and random baselines.
[249] MSMA: Multi-Scale Feature Fusion For Multi-Attribute 3D Face Reconstruction From Unconstrained Images
Danling Cao
Main category: cs.CV
TL;DR: Proposes MSMA framework for 3D face reconstruction from unconstrained images using multi-scale feature fusion and multi-attribute learning with large-kernel attention to improve feature extraction.
Details
Motivation: Existing 3D face reconstruction methods struggle with diverse unconstrained conditions, require extensive 3D data, and fail to capture detailed multi-scale features under varying facial attributes, leading to incomplete reconstructions.Method: Multi-Scale Feature Fusion with Multi-Attribute (MSMA) framework that integrates multi-scale feature fusion with multi-attribute learning and uses large-kernel attention module for precise feature extraction across scales.
Result: Achieves results comparable to state-of-the-art methods on MICC Florence, Facewarehouse and custom datasets, and in some cases surpasses SOTA performance across challenging conditions.
Conclusion: The proposed MSMA framework effectively addresses limitations of existing methods by better capturing detailed and multi-scale features for accurate 3D facial parameter estimation from single 2D images.
Abstract: Reconstructing 3D face from a single unconstrained image remains a challenging problem due to diverse conditions in unconstrained environments. Recently, learning-based methods have achieved notable results by effectively capturing complex facial structures and details across varying conditions. Consequently, many existing approaches employ projection-based losses between generated and input images to constrain model training. However, learning-based methods for 3D face reconstruction typically require substantial amounts of 3D facial data, which is difficult and costly to obtain. Consequently, to reduce reliance on labeled 3D face datasets, many existing approaches employ projection-based losses between generated and input images to constrain model training. Nonetheless, despite these advancements, existing approaches frequently struggle to capture detailed and multi-scale features under diverse facial attributes and conditions, leading to incomplete or less accurate reconstructions. In this paper, we propose a Multi-Scale Feature Fusion with Multi-Attribute (MSMA) framework for 3D face reconstruction from unconstrained images. Our method integrates multi-scale feature fusion with a focus on multi-attribute learning and leverages a large-kernel attention module to enhance the precision of feature extraction across scales, enabling accurate 3D facial parameter estimation from a single 2D image. Comprehensive experiments on the MICC Florence, Facewarehouse and custom-collect datasets demonstrate that our approach achieves results on par with current state-of-the-art methods, and in some instances, surpasses SOTA performance across challenging conditions.
[250] Enriched text-guided variational multimodal knowledge distillation network (VMD) for automated diagnosis of plaque vulnerability in 3D carotid artery MRI
Bo Cao, Fan Yu, Mengmeng Feng, SenHao Zhang, Xin Meng, Yue Zhang, Zhen Qian, Jie Lu
Main category: cs.CV
TL;DR: VMD method uses variation inference and multimodal knowledge distillation to leverage radiologists’ domain knowledge for automated carotid plaque vulnerability diagnosis from 3D MRI images and radiology reports.
Details
Motivation: Diagnosing carotid plaque vulnerability from 3D MRI images is challenging for both radiologists and conventional 3D vision networks. Clinical practice uses multimodal approaches combining various imaging modalities and domain expertise, suggesting the need for multimodal diagnostic networks.Method: Variation inference and Multimodal knowledge Distillation (VMD) strategy that harnesses cross-modality prior knowledge from limited image annotations and radiology reports to enhance diagnostic accuracy for unannotated 3D MRI images.
Result: Conducted in-depth experiments on in-house collected dataset and verified the effectiveness of the proposed VMD strategy.
Conclusion: The VMD method provides an effective approach to automate carotid plaque vulnerability diagnosis by leveraging radiologists’ domain knowledge through multimodal learning and knowledge distillation techniques.
Abstract: Multimodal learning has attracted much attention in recent years due to its ability to effectively utilize data features from a variety of different modalities. Diagnosing the vulnerability of atherosclerotic plaques directly from carotid 3D MRI images is relatively challenging for both radiologists and conventional 3D vision networks. In clinical practice, radiologists assess patient conditions using a multimodal approach that incorporates various imaging modalities and domain-specific expertise, paving the way for the creation of multimodal diagnostic networks. In this paper, we have developed an effective strategy to leverage radiologists’ domain knowledge to automate the diagnosis of carotid plaque vulnerability through Variation inference and Multimodal knowledge Distillation (VMD). This method excels in harnessing cross-modality prior knowledge from limited image annotations and radiology reports within training data, thereby enhancing the diagnostic network’s accuracy for unannotated 3D MRI images. We conducted in-depth experiments on the dataset collected in-house and verified the effectiveness of the VMD strategy we proposed.
[251] RailSafeNet: Visual Scene Understanding for Tram Safety
Ondřej Valach, Ivan Gruber
Main category: cs.CV
TL;DR: RailSafeNet is a real-time AI framework that uses monocular video to detect track intrusions and assess collision risks by combining semantic segmentation, object detection, and distance assessment.
Details
Motivation: Tram-human interaction safety is critical in urban areas where collisions can cause serious injuries or fatalities. The paper aims to improve safety for pedestrians, drivers, cyclists, pets, and tram passengers using AI technologies.Method: The framework fuses semantic segmentation (SegFormer B3 model), object detection (YOLOv8), and a rule-based Distance Assessor. It identifies rails, locates nearby objects, and classifies risk by comparing projected distances with the standard 1435mm rail gauge using only monocular video.
Result: On the RailSem19 dataset, the system achieved 65% IoU for semantic segmentation and 75.6% mAP for object detection at IoU threshold 0.50. The framework provides accurate, annotation-light scene understanding for real-time warnings.
Conclusion: RailSafeNet delivers effective real-time safety monitoring that can warn tram drivers before dangerous situations escalate, using only monocular video input without heavy annotation requirements.
Abstract: Tram-human interaction safety is an important challenge, given that trams frequently operate in densely populated areas, where collisions can range from minor injuries to fatal outcomes. This paper addresses the issue from the perspective of designing a solution leveraging digital image processing, deep learning, and artificial intelligence to improve the safety of pedestrians, drivers, cyclists, pets, and tram passengers. We present RailSafeNet, a real-time framework that fuses semantic segmentation, object detection and a rule-based Distance Assessor to highlight track intrusions. Using only monocular video, the system identifies rails, localises nearby objects and classifies their risk by comparing projected distances with the standard 1435mm rail gauge. Experiments on the diverse RailSem19 dataset show that a class-filtered SegFormer B3 model achieves 65% intersection-over-union (IoU), while a fine-tuned YOLOv8 attains 75.6% mean average precision (mAP) calculated at an intersection over union (IoU) threshold of 0.50. RailSafeNet therefore delivers accurate, annotation-light scene understanding that can warn drivers before dangerous situations escalate. Code available at https://github.com/oValach/RailSafeNet.
cs.AI
[252] V-Math: An Agentic Approach to the Vietnamese National High School Graduation Mathematics Exams
Duong Q. Nguyen, Quy P. Nguyen, Nguyen Van Nhon, Quang-Thinh Bui, H. Nguyen-Xuan
Main category: cs.AI
TL;DR: V-Math is an autonomous AI framework with three specialized agents that helps Vietnamese high school students prepare for national math exams through personalized practice and assists teachers in generating compliant exam questions.
Details
Motivation: To address the need for scalable and equitable mathematics preparation for Vietnamese high school students taking national graduation exams, while reducing teacher workload in creating exam-aligned practice materials.Method: Developed a framework with three AI agents: 1) specification-matrix-conditioned question generator, 2) solver/explainer for step-by-step reasoning, and 3) personalized tutor that adapts to student performance. The system supports both student practice modes and teacher-oriented question generation features.
Result: Preliminary evaluations show V-Math produces matrix-aligned exams with high solution accuracy, delivers coherent explanations, and enhances variety of practice materials. The system successfully generates compliant exam questions and builds diverse question banks.
Conclusion: V-Math demonstrates potential to support scalable, equitable mathematics preparation aligned with national standards while empowering teachers through AI-assisted exam creation, reducing manual workload and enriching instructional resources.
Abstract: This paper develops an autonomous agentic framework called V-Math that aims to assist Vietnamese high school students in preparing for the National High School Graduation Mathematics Exams (NHSGMEs). The salient framework integrates three specialized AI agents: a specification-matrix-conditioned question generator, a solver/explainer for detailed step-by-step reasoning, and a personalized tutor that adapts to student performance. Beyond enabling self-paced student practice, V-Math supports teachers by generating innovative, compliant exam questions and building diverse, high-quality question banks. This reduces manual workload and enriches instructional resources. We describe the system architecture, focusing on practice modes for learners and teacher-oriented features for question generation. Preliminary evaluations demonstrate that V-Math produces matrix-aligned exams with high solution accuracy, delivers coherent explanations, and enhances the variety of practice materials. These results highlight its potential to support scalable, equitable mathematics preparation aligned with national standards while also empowering teachers through AI-assisted exam creation.
[253] DISPLIB: a library of train dispatching problems
Oddvar Kloster, Bjørnar Luteberget, Carlo Mannino, Giorgio Sartor
Main category: cs.AI
TL;DR: Introduces DISPLIB - a common problem definition and file format for train re-routing and re-scheduling problems to enable reproducibility and performance comparisons across different optimization algorithms.
Details
Motivation: Current optimization algorithms for train dispatching are tied to specific industrial cases with little code/data sharing, hindering reproducibility and preventing fair performance comparisons between different approaches.Method: Created a standardized problem definition and file format (DISPLIB) that captures main features of train re-routing and re-scheduling. Gathered real-world problem instances and developed a reference solver implementation.
Result: Established an open repository of industrial problem instances and reference implementation, enabling researchers to work on train dispatching without industrial connections and perform empirical solver comparisons.
Conclusion: DISPLIB provides a foundation for reproducible research and fair performance evaluation in train dispatching optimization, similar to established communities for MILP, SAT, TSP, and VRP problems.
Abstract: Optimization-based decision support systems have a significant potential to reduce delays, and thus improve efficiency on the railways, by automatically re-routing and re-scheduling trains after delays have occurred. The operations research community has dedicated a lot of effort to developing optimization algorithms for this problem, but each study is typically tightly connected with a specific industrial use case. Code and data are seldom shared publicly. This fact hinders reproducibility, and has led to a proliferation of papers describing algorithms for more or less compatible problem definitions, without any real opportunity for readers to assess their relative performance. Inspired by the successful communities around MILP, SAT, TSP, VRP, etc., we introduce a common problem definition and file format, DISPLIB, which captures all the main features of train re-routing and re-scheduling. We have gathered problem instances from multiple real-world use cases and made them openly available. In this paper, we describe the problem definition, the industrial instances, and a reference solver implementation. This allows any researcher or developer to work on the train dispatching problem without an industrial connection, and enables the research community to perform empirical comparisons between solvers. All materials are available online at https://displib.github.io.
[254] InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning
Gautam Sreekumar, Vishnu Naresh Boddeti
Main category: cs.AI
TL;DR: InPhyRe is the first benchmark to evaluate inductive physical reasoning in LMMs, revealing they struggle with applying parametric knowledge to novel physical scenarios and suffer from language bias.
Details
Motivation: Current LMMs encode physical laws as parametric knowledge but cannot adapt to unseen physical environments that violate these laws, which is crucial for safety-critical applications where human-like adaptive reasoning is needed.Method: Proposed InPhyRe benchmark using algorithmically generated synthetic collision videos to test LMMs’ ability to predict collision outcomes in novel physical scenarios that violate universal physical laws seen during training.
Result: Evaluation of 13 LMMs showed they: (1) struggle to apply parametric knowledge to reasoning, (2) have weak inductive reasoning when demonstrations violate physical laws, and (3) suffer from language bias, largely ignoring visual inputs.
Conclusion: LMMs currently lack trustworthy inductive physical reasoning capabilities, questioning their reliability for visual-based reasoning in safety-critical applications where adaptation to novel physical environments is required.
Abstract: Large multimodal models (LMMs) encode universal physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning when the inference scenario violates these physical laws. In contrast, humans possess the skill to adapt their physical reasoning to unseen physical environments from a few visual examples. This ability, which we refer to as inductive physical reasoning, is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks evaluate only the parametric knowledge in LMMs, and not inductive physical reasoning. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs on their ability to predict the outcome of collision events in algorithmically generated synthetic collision videos. By inspecting 13 LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when demonstration samples violate universal physical laws, and (3) inductive physical reasoning in LMMs suffers from language bias and largely ignores the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.
[255] LLMAP: LLM-Assisted Multi-Objective Route Planning with User Preferences
Liangqi Yuan, Dong-Jun Han, Christopher G. Brinton, Sabine Brunswicker
Main category: cs.AI
TL;DR: A novel LLM-assisted route planning system that combines LLM parsing with multi-step graph search to handle natural language preferences and complex constraints for optimal route planning.
Details
Motivation: Current LLM-based route planning approaches struggle with handling extensive map data or understanding natural language preferences, and face challenges with heterogeneous spatio-temporal user distributions.Method: Uses LLM-as-Parser to comprehend natural language and extract preferences, combined with Multi-Step Graph construction with iterative Search (MSGS) algorithm for optimal route finding with multi-objective optimization.
Result: Superior performance with guarantees across multiple constraints, demonstrated through extensive experiments with 1,000 routing prompts across 14 countries and 27 cities.
Conclusion: The LLMAP system effectively bridges the gap between natural language understanding and optimal route planning, handling complex constraints and achieving robust performance globally.
Abstract: The rise of large language models (LLMs) has made natural language-driven route planning an emerging research area that encompasses rich user objectives. Current research exhibits two distinct approaches: direct route planning using LLM-as-Agent and graph-based searching strategies. However, LLMs in the former approach struggle to handle extensive map data, while the latter shows limited capability in understanding natural language preferences. Additionally, a more critical challenge arises from the highly heterogeneous and unpredictable spatio-temporal distribution of users across the globe. In this paper, we introduce a novel LLM-Assisted route Planning (LLMAP) system that employs an LLM-as-Parser to comprehend natural language, identify tasks, and extract user preferences and recognize task dependencies, coupled with a Multi-Step Graph construction with iterative Search (MSGS) algorithm as the underlying solver for optimal route finding. Our multi-objective optimization approach adaptively tunes objective weights to maximize points of interest (POI) quality and task completion rate while minimizing route distance, subject to three key constraints: user time limits, POI opening hours, and task dependencies. We conduct extensive experiments using 1,000 routing prompts sampled with varying complexity across 14 countries and 27 cities worldwide. The results demonstrate that our approach achieves superior performance with guarantees across multiple constraints.
[256] HLSMAC: A New StarCraft Multi-Agent Challenge for High-Level Strategic Decision-Making
Xingxing Hong, Yungong Wang, Dexin Jin, Ye Yuan, Ximing Huang, Zijian Wu, Wenxin Li
Main category: cs.AI
TL;DR: HLSMAC is a new StarCraft II benchmark with 12 scenarios based on classical stratagems, designed to evaluate high-level strategic decision-making in MARL beyond micromanagement.
Details
Motivation: Existing MARL benchmarks like SMAC focus primarily on micromanagement, limiting comprehensive evaluation of high-level strategic intelligence in multi-agent systems.Method: Created 12 carefully designed StarCraft II scenarios based on the Thirty-Six Stratagems, each challenging agents with strategic elements like tactical maneuvering, timing coordination, and deception. Proposed novel metrics beyond win rate including ability utilization and advancement efficiency.
Result: The benchmark serves as a robust testbed for advancing multi-agent strategic decision-making, as demonstrated through comprehensive experiments with state-of-the-art MARL algorithms and LLM-based agents.
Conclusion: HLSMAC successfully addresses the gap in evaluating high-level strategic capabilities in MARL and opens up new avenues for assessing strategic decision-making in multi-agent systems.
Abstract: Benchmarks are crucial for assessing multi-agent reinforcement learning (MARL) algorithms. While StarCraft II-related environments have driven significant advances in MARL, existing benchmarks like SMAC focus primarily on micromanagement, limiting comprehensive evaluation of high-level strategic intelligence. To address this, we introduce HLSMAC, a new cooperative MARL benchmark with 12 carefully designed StarCraft II scenarios based on classical stratagems from the Thirty-Six Stratagems. Each scenario corresponds to a specific stratagem and is designed to challenge agents with diverse strategic elements, including tactical maneuvering, timing coordination, and deception, thereby opening up avenues for evaluating high-level strategic decision-making capabilities. We also propose novel metrics across multiple dimensions beyond conventional win rate, such as ability utilization and advancement efficiency, to assess agents’ overall performance within the HLSMAC environment. We integrate state-of-the-art MARL algorithms and LLM-based agents with our benchmark and conduct comprehensive experiments. The results demonstrate that HLSMAC serves as a robust testbed for advancing multi-agent strategic decision-making.
[257] Developing an aeroponic smart experimental greenhouse for controlling irrigation and plant disease detection using deep learning and IoT
Mohammadreza Narimani, Ali Hajiahmad, Ali Moghimi, Reza Alimardani, Shahin Rafiee, Amir Hossein Mirzabe
Main category: cs.AI
TL;DR: Development of a smart aeroponic greenhouse system combining IoT for environmental monitoring and AI for plant disease detection, achieving 92% accuracy in identifying drought stress and rust leaves using VGG-19 algorithm.
Details
Motivation: To improve crop production through continuous monitoring of environmental conditions and plant status, enabling prompt management decisions in greenhouse settings.Method: Integrated IoT platform for environmental control and data collection, combined with AI framework using VGG-19, InceptionResNetV2, and InceptionV3 algorithms for disease detection from periodically captured plant images.
Result: IoT system successfully published real-time data (temperature, humidity, water flow, tank volume) and adjusted parameters for optimal growth. VGG-19 algorithm achieved highest accuracy (92%) in identifying drought stress and rust leaves compared to other algorithms.
Conclusion: The smart aeroponic greenhouse system effectively combines IoT and AI technologies to monitor plant health and environmental conditions, with VGG-19 showing superior performance in disease detection, providing valuable insights for informed management decisions.
Abstract: Controlling environmental conditions and monitoring plant status in greenhouses is critical to promptly making appropriate management decisions aimed at promoting crop production. The primary objective of this research study was to develop and test a smart aeroponic greenhouse on an experimental scale where the status of Geranium plant and environmental conditions are continuously monitored through the integration of the internet of things (IoT) and artificial intelligence (AI). An IoT-based platform was developed to control the environmental conditions of plants more efficiently and provide insights to users to make informed management decisions. In addition, we developed an AI-based disease detection framework using VGG-19, InceptionResNetV2, and InceptionV3 algorithms to analyze the images captured periodically after an intentional inoculation. The performance of the AI framework was compared with an expert’s evaluation of disease status. Preliminary results showed that the IoT system implemented in the greenhouse environment is able to publish data such as temperature, humidity, water flow, and volume of charge tanks online continuously to users and adjust the controlled parameters to provide an optimal growth environment for the plants. Furthermore, the results of the AI framework demonstrate that the VGG-19 algorithm was able to identify drought stress and rust leaves from healthy leaves with the highest accuracy, 92% among the other algorithms.
[258] Agentic AI for Financial Crime Compliance
Henrik Axelsen, Valdemar Licht, Jan Damsgaard
Main category: cs.AI
TL;DR: Agentic AI system for financial crime compliance that automates workflows with explainability and regulatory alignment
Details
Motivation: Rising costs and complexity of financial crime compliance without measurable effectiveness improvements, and lack of regulatory-aligned AI solutionsMethod: Action Design Research (ADR) process with fintech firm and regulatory stakeholders, using artifact-centric modeling with autonomous agents and task-specific model routing
Result: Developed reference architecture and real-world prototype demonstrating automated onboarding, monitoring, investigation, and reporting with explainability and traceability
Conclusion: Agentic AI with accountable governance structures can support transparency and institutional trust in regulated financial environments, extending IS literature on AI-enabled compliance
Abstract: The cost and complexity of financial crime compliance (FCC) continue to rise, often without measurable improvements in effectiveness. While AI offers potential, most solutions remain opaque and poorly aligned with regulatory expectations. This paper presents the design and deployment of an agentic AI system for FCC in digitally native financial platforms. Developed through an Action Design Research (ADR) process with a fintech firm and regulatory stakeholders, the system automates onboarding, monitoring, investigation, and reporting, emphasizing explainability, traceability, and compliance-by-design. Using artifact-centric modeling, it assigns clearly bounded roles to autonomous agents and enables task-specific model routing and audit logging. The contribution includes a reference architecture, a real-world prototype, and insights into how Agentic AI can reconfigure FCC workflows under regulatory constraints. Our findings extend IS literature on AI-enabled compliance by demonstrating how automation, when embedded within accountable governance structures, can support transparency and institutional trust in high-stakes, regulated environments.
[259] AIssistant: An Agentic Approach for Human–AI Collaborative Scientific Work on Reviews and Perspectives in Machine Learning
Sasi Kiran Gaddipati, Farhana Keya, Gollam Rabby, Sören Auer
Main category: cs.AI
TL;DR: AIssistant is an open-source Human-AI collaborative framework that simplifies scientific workflow creation, improving drafting efficiency and thematic consistency while maintaining human oversight for accuracy and rigor.
Details
Motivation: Current AI-assisted research tools are fragmented and lack human-centered workflows, creating a need for an integrated framework that supports end-to-end scientific paper creation with proper human oversight.Method: Developed AIssistant framework with modular tools for literature synthesis, experimentation, citation management, and LaTeX generation. Evaluated through three layers: independent human review (NeurIPS standards), automated LLM review (GPT-5), and program chair oversight.
Result: The system improves drafting efficiency and thematic consistency. However, limitations include hallucinated citations, difficulty adapting to dynamic paper structures, and incomplete multimodal content integration.
Conclusion: Human-AI collaboration remains essential for maintaining factual correctness, methodological soundness, and ethical compliance in scientific research, despite AIssistant’s effectiveness in streamlining workflow creation.
Abstract: Advances in AI-assisted research have introduced powerful tools for literature retrieval, hypothesis generation, experimentation, and manuscript preparation. However, systems remain fragmented and lack human-centred workflows. To address these gaps, we introduce AIssistant, an agentic, open-source Human-AI collaborative framework designed to simplify the end-to-end creation of scientific workflows. Since our development is still in an early stage, we present here the first experiments with AIssistant for perspective and review research papers in machine learning. Our system integrates modular tools and agents for literature synthesis, section-wise experimentation, citation management, and automatic LaTeX paper text generation, while maintaining human oversight at every stage to ensure accuracy, coherence, and scholarly rigour. We conducted a comprehensive evaluation across three layers: (1) Independent Human Review, following NeurIPS double-blind standards; (2) Automated LLM Review, using GPT-5 as a scalable human review proxy; and (3) Program Chair Oversight, where the chair monitors the entire review process and makes final validation and acceptance decisions. The results demonstrate that AIssistant improves drafting efficiency and thematic consistency. Nonetheless, Human-AI collaboration remains essential for maintaining factual correctness, methodological soundness, and ethical compliance. Despite its effectiveness, we identify key limitations, including hallucinated citations, difficulty adapting to dynamic paper structures, and incomplete integration of multimodal content.
[260] Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition
Danielle Cohen, Yoni Halpern, Noam Kahlon, Joel Oren, Omri Berkovitch, Sapir Caduri, Ido Dagan, Anatoly Efros
Main category: cs.AI
TL;DR: A novel decomposed approach for on-device intent inference from UI interactions using structured summarization and fine-tuned models, achieving better performance than large MLLMs while preserving privacy.
Details
Motivation: Small on-device models struggle with accurate intent inference from UI interaction trajectories, while large MLLMs require datacenter resources and compromise privacy. There's a need for privacy-preserving, low-cost, low-latency solutions.Method: Two-step approach: 1) Structured interaction summarization to capture key information from each user action, 2) Intent extraction using a fine-tuned model operating on the aggregated summaries.
Result: The method improves intent understanding in resource-constrained models and even surpasses the base performance of large multi-modal large language models (MLLMs).
Conclusion: The decomposed approach enables effective on-device intent inference while maintaining privacy, low cost, and low latency, outperforming larger models in this specific application.
Abstract: Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs.
[261] Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization
Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing
Main category: cs.AI
TL;DR: A new framework called \sys enhances preference optimization algorithms to preserve output diversity for better test-time scaling in software engineering tasks, achieving state-of-the-art results on SWE-bench with models up to 106B parameters.
Details
Motivation: Current LLMs struggle with complex software engineering tasks that require multi-step reasoning and tool use. Standard alignment methods like DPO and KTO reduce output diversity, limiting test-time scaling effectiveness, and existing approaches don't handle multi-turn interactions well for coding agents.Method: \sys framework augments preference objectives to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. Also includes a hybrid best-trajectory selection scheme combining learned verifier with model-free approaches.
Result: Achieved new state-of-the-art results on SWE-bench leaderboard among open-weight models. A 30B parameter model trained with \sys ranked 1st on Lite and 4th on Verified benchmarks, surpassed only by models with over 10x more parameters.
Conclusion: \sys successfully bridges the gap in multi-turn preference optimization for coding agents, preserving diversity for effective test-time scaling while achieving strong performance on complex software engineering tasks with relatively smaller models.
Abstract: Software engineering presents complex, multi-step challenges for Large Language Models (LLMs), requiring reasoning over large codebases and coordinated tool use. The difficulty of these tasks is exemplified by benchmarks like SWE-bench, where current LLMs still struggle to resolve real-world issues. A promising approach to enhance performance is test-time scaling (TTS), but its gains are heavily dependent on the diversity of model outputs. While standard alignment methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are effective at aligning model outputs with human preferences, this process can come at the cost of reduced diversity, limiting the effectiveness of TTS. Additionally, existing preference optimization algorithms are typically designed for single-turn tasks and do not fully address the complexities of multi-turn reasoning and tool integration required for interactive coding agents. To bridge this gap, we introduce \sys, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting. \sys augments the preference objective to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. We validate \sys by fine-tuning a diverse suite of models from different families and sizes (up to 106B parameters). To maximize performance gains from TTS, we further propose a hybrid best-trajectory selection scheme combining a learned verifier model with model free approaches. On the \swebench leaderboard, our approach establishes new state-of-the-art results among open-weight models. A 30B parameter model trained with \sys ranks 1st on \lite and 4th on \verified on the open-weight leaderboard, surpassed only by models with over 10x more parameters(\eg$>$350B).
[262] Enhancing Physical Consistency in Lightweight World Models
Dingrui Wang, Zhexiao Sun, Zhouheng Li, Cheng Wang, Youlun Peng, Hongyuan Ye, Baha Zarrouki, Wei Li, Mattia Piccinini, Lei Xie, Johannes Betz
Main category: cs.AI
TL;DR: PIWM is a compact physics-informed world model that achieves better performance than larger baselines through Soft Mask training and Warm Start inference techniques.
Details
Motivation: Address the trade-off between model size and performance in world models - large models are computationally expensive while small models struggle with accurate physics predictions.Method: Physics-Informed BEV World Model (PIWM) with Soft Mask training for improved dynamic object modeling and Warm Start inference technique for enhanced prediction quality.
Result: At 400M parameters, PIWM surpasses baseline by 60.6% in weighted overall score. Even the smallest PIWM (130M) beats the largest baseline (400M) by 7.4% with 28% faster inference speed.
Conclusion: PIWM demonstrates that compact models can outperform larger counterparts in world modeling through physics-informed design and effective training/inference techniques.
Abstract: A major challenge in deploying world models is the trade-off between size and performance. Large world models can capture rich physical dynamics but require massive computing resources, making them impractical for edge devices. Small world models are easier to deploy but often struggle to learn accurate physics, leading to poor predictions. We propose the Physics-Informed BEV World Model (PIWM), a compact model designed to efficiently capture physical interactions in bird’s-eye-view (BEV) representations. PIWM uses Soft Mask during training to improve dynamic object modeling and future prediction. We also introduce a simple yet effective technique, Warm Start, for inference to enhance prediction quality with a zero-shot model. Experiments show that at the same parameter scale (400M), PIWM surpasses the baseline by 60.6% in weighted overall score. Moreover, even when compared with the largest baseline model (400M), the smallest PIWM (130M Soft Mask) achieves a 7.4% higher weighted overall score with a 28% faster inference speed.
[263] Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
Ryan Lucas, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, Rahul Mazumder
Main category: cs.AI
TL;DR: Reasoning-Aware Compression (RAC) improves pruning performance for reasoning language models by jointly reconstructing input activations and chain-of-thought traces, addressing the decode-dominated nature of reasoning tasks.
Details
Motivation: Standard LLM pruning methods focus on input reconstruction but perform poorly on reasoning models, causing performance loss and sometimes making models slower by producing more thinking tokens with worse quality.Method: Introduces RAC - a simple drop-in fix that during pruning jointly reconstructs activations from both the input and the model’s on-policy chain-of-thought traces, integrating seamlessly with existing pruning workflows like SparseGPT.
Result: RAC significantly boosts pruning performance for reasoning models compared to standard compression techniques, which cause greater performance loss in reasoning tasks than typical language modeling.
Conclusion: Reasoning-aware compression that considers both input and chain-of-thought traces during pruning is essential for effectively compressing reasoning language models while maintaining performance.
Abstract: Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model’s on-policy chain-of-thought traces. This “Reasoning-Aware Compression” (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC
[264] Agentic Lybic: Multi-Agent Execution System with Tiered Reasoning and Orchestration
Liangxuan Guo, Bin Zhu, Qingqian Tao, Kangning Liu, Xun Zhao, Xianzhe Qin, Jin Gao, Guangfu Hao
Main category: cs.AI
TL;DR: Agentic Lybic is a multi-agent system using finite-state machine architecture for desktop automation, achieving 57.07% success rate on OSWorld benchmark through dynamic orchestration and quality control.
Details
Motivation: Existing autonomous agents for desktop automation struggle with complex multi-step tasks due to poor coordination and inadequate quality control.Method: A multi-agent system with FSM-based architecture comprising Controller, Manager, three specialized Workers (Technician, Operator, Analyst), and Evaluator, using dynamic routing for optimal execution strategy selection.
Result: Achieves state-of-the-art 57.07% success rate in 50 steps on OSWorld benchmark, substantially outperforming existing methods.
Conclusion: Principled multi-agent orchestration with continuous quality control provides superior reliability for generalized desktop automation in complex computing environments.
Abstract: Autonomous agents for desktop automation struggle with complex multi-step tasks due to poor coordination and inadequate quality control. We introduce Agentic Lybic, a novel multi-agent system where the entire architecture operates as a finite-state machine (FSM). This core innovation enables dynamic orchestration. Our system comprises four components: a Controller, a Manager, three Workers (Technician for code-based operations, Operator for GUI interactions, and Analyst for decision support), and an Evaluator. The critical mechanism is the FSM-based routing between these components, which provides flexibility and generalization by dynamically selecting the optimal execution strategy for each subtask. This principled orchestration, combined with robust quality gating, enables adaptive replanning and error recovery. Evaluated officially on the OSWorld benchmark, Agentic Lybic achieves a state-of-the-art 57.07% success rate in 50 steps, substantially outperforming existing methods. Results demonstrate that principled multi-agent orchestration with continuous quality control provides superior reliability for generalized desktop automation in complex computing environments.
[265] Empowering Clinical Trial Design through AI: A Randomized Evaluation of PowerGPT
Yiwen Lu, Lu Li, Dazheng Zhang, Xinyao Jian, Tingyin Wang, Siqi Chen, Yuqing Lei, Jiayi Tong, Zhaohan Xi, Haitao Chu, Chongliang Luo, Alexis Ogdie, Brian Athey, Alparslan Turan, Michael Abramoff, Joseph C Cappelleri, Hua Xu, Yun Lu, Jesse Berlin, Daniel I. Sessler, David A. Asch, Xiaoqian Jiang, Yong Chen
Main category: cs.AI
TL;DR: PowerGPT is an AI system that combines large language models with statistical engines to automate test selection and sample size calculations for clinical trials, significantly improving completion rates, accuracy, and efficiency compared to traditional methods.
Details
Motivation: Sample size calculations for power analysis are critical but complex in clinical research, creating barriers for many researchers due to their reliance on statistical expertise.Method: PowerGPT integrates large language models (LLMs) with statistical engines to automate test selection and sample size estimation in trial design. The system was evaluated in a randomized trial comparing its performance against traditional approaches.
Result: PowerGPT significantly improved task completion rates (99.3% vs. 88.9% for test selection, 99.3% vs. 77.8% for sample size calculation) and accuracy (94.1% vs. 55.4% in sample size estimation, p < 0.001), while reducing average completion time (4.0 vs. 9.3 minutes, p < 0.001). Gains were consistent across statistical tests and benefited both statisticians and non-statisticians.
Conclusion: PowerGPT represents a scalable AI-driven approach that enhances accessibility, efficiency, and accuracy in statistical power analysis for clinical research, and is already being deployed across multiple institutions.
Abstract: Sample size calculations for power analysis are critical for clinical research and trial design, yet their complexity and reliance on statistical expertise create barriers for many researchers. We introduce PowerGPT, an AI-powered system integrating large language models (LLMs) with statistical engines to automate test selection and sample size estimation in trial design. In a randomized trial to evaluate its effectiveness, PowerGPT significantly improved task completion rates (99.3% vs. 88.9% for test selection, 99.3% vs. 77.8% for sample size calculation) and accuracy (94.1% vs. 55.4% in sample size estimation, p < 0.001), while reducing average completion time (4.0 vs. 9.3 minutes, p < 0.001). These gains were consistent across various statistical tests and benefited both statisticians and non-statisticians as well as bridging expertise gaps. Already under deployment across multiple institutions, PowerGPT represents a scalable AI-driven approach that enhances accessibility, efficiency, and accuracy in statistical power analysis for clinical research.
[266] Co-Alignment: Rethinking Alignment as Bidirectional Human-AI Cognitive Adaptation
Yubo Li, Weiyi Song
Main category: cs.AI
TL;DR: BiCA enables bidirectional co-alignment where humans and AI mutually adapt through learnable protocols, representation mapping, and KL-budget constraints, achieving superior collaboration and safety compared to traditional single-directional alignment.
Details
Motivation: Current AI alignment through RLHF treats human cognition as fixed while AI conforms to human preferences. The authors propose shifting from this single-directional paradigm to co-alignment where both humans and AI mutually adapt.Method: Bidirectional Cognitive Alignment (BiCA) uses learnable protocols, representation mapping, and KL-budget constraints for controlled co-evolution between humans and AI systems.
Result: In collaborative navigation, BiCA achieved 85.5% success (vs 70.3% baseline), 230% better mutual adaptation, 332% better protocol convergence. Emergent protocols outperformed handcrafted ones by 84%, with 23% better out-of-distribution robustness and 46% synergy improvement.
Conclusion: Optimal collaboration exists at the intersection (not union) of human and AI capabilities, validating the shift from single-directional to co-alignment paradigms with bidirectional adaptation improving both performance and safety.
Abstract: Current AI alignment through RLHF follows a single directional paradigm that AI conforms to human preferences while treating human cognition as fixed. We propose a shift to co-alignment through Bidirectional Cognitive Alignment (BiCA), where humans and AI mutually adapt. BiCA uses learnable protocols, representation mapping, and KL-budget constraints for controlled co-evolution. In collaborative navigation, BiCA achieved 85.5% success versus 70.3% baseline, with 230% better mutual adaptation and 332% better protocol convergence. Emergent protocols outperformed handcrafted ones by 84%, while bidirectional adaptation unexpectedly improved safety (+23% out-of-distribution robustness). The 46% synergy improvement demonstrates optimal collaboration exists at the intersection, not union, of human and AI capabilities, validating the shift from single-directional to co-alignment paradigms.
[267] Physical Complexity of a Cognitive Artifact
Gülce Kardeş, David Krakauer, Joshua Grochow
Main category: cs.AI
TL;DR: The paper maps computational complexity concepts from the Soma Cube puzzle to cognitive problem-solving strategies, showing how various cognitive mechanisms reduce task difficulty by lowering branching factor and search complexity.
Details
Motivation: To bridge cognitive science and theoretical computer science by understanding how mechanisms of intelligence reduce task difficulty, using the Soma Cube puzzle as a concrete example to study how cognitive strategies modify computational complexity.Method: Analyzed the Soma Cube puzzle’s branching factor through search tree outdegree measurements, then systematically layered cognitive strategies including preprocessing (chunking), value ordering (free-sorting), variable ordering (scaffolding), and pruning (inference) to refine trial-and-error search.
Result: Demonstrated quantitatively how different cognitive strategies reduce task difficulty by modifying complexity, and showed how competent artifact use reduces effective time complexity by exploiting physical constraints.
Conclusion: Proposes a model of intelligence as a library of algorithms that recruit both mental capabilities and material/physical constraints, showing how cognitive mechanisms systematically reduce computational complexity in problem-solving.
Abstract: Cognitive science and theoretical computer science both seek to classify and explain the difficulty of tasks. Mechanisms of intelligence are those that reduce task difficulty. Here we map concepts from the computational complexity of a physical puzzle, the Soma Cube, onto cognitive problem-solving strategies through a ``Principle of Materiality’’. By analyzing the puzzle’s branching factor, measured through search tree outdegree, we quantitatively assess task difficulty and systematically examine how different strategies modify complexity. We incrementally refine a trial-and-error search by layering preprocessing (cognitive chunking), value ordering (cognitive free-sorting), variable ordering (cognitive scaffolding), and pruning (cognitive inference). We discuss how the competent use of artifacts reduces effective time complexity by exploiting physical constraints and propose a model of intelligence as a library of algorithms that recruit the capabilities of both mind and matter.
[268] A Dimensionality-Reduced XAI Framework for Roundabout Crash Severity Insights
Rohit Chakraborty, Subasish Das
Main category: cs.AI
TL;DR: This study analyzes Ohio roundabout crashes using an explainable AI workflow that identifies four crash patterns and quantifies injury severity drivers through cluster analysis and SHAP interpretation.
Details
Motivation: Roundabouts reduce severe crashes but risk patterns vary by conditions, requiring better understanding of crash mechanisms to support safety improvements and countermeasure selection.Method: Two-step explainable workflow: 1) Cluster Correspondence Analysis (CCA) to identify co-occurring factors and four crash patterns, 2) Tree-based severity model interpreted with SHAP to quantify injury drivers within and across patterns.
Result: Higher severity occurs when darkness, wet surfaces, and higher speeds coincide with fixed-object or angle events; lower severity in clear, low-speed settings. Pattern-specific explanations reveal mechanisms at entries (fail-to-yield), multi-lane circulation (improper maneuvers), and slow-downs (rear-end).
Conclusion: The workflow links pattern discovery with case-level explanations, supporting site screening, countermeasure selection, and audit-ready reporting. Provides a practical template for usable XAI in public safety analytics.
Abstract: Roundabouts reduce severe crashes, yet risk patterns vary by conditions. This study analyzes 2017-2021 Ohio roundabout crashes using a two-step, explainable workflow. Cluster Correspondence Analysis (CCA) identifies co-occurring factors and yields four crash patterns. A tree-based severity model is then interpreted with SHAP to quantify drivers of injury within and across patterns. Results show higher severity when darkness, wet surfaces, and higher posted speeds coincide with fixed-object or angle events, and lower severity in clear, low-speed settings. Pattern-specific explanations highlight mechanisms at entries (fail-to-yield, gap acceptance), within multi-lane circulation (improper maneuvers), and during slow-downs (rear-end). The workflow links pattern discovery with case-level explanations, supporting site screening, countermeasure selection, and audit-ready reporting. The contribution to Information Systems is a practical template for usable XAI in public safety analytics.
[269] zELO: ELO-inspired Training Method for Rerankers and Embedding Models
Nicholas Pipitone, Ghita Houir Alami, Advaith Avadhanam, Anton Kaminskyi, Ashley Khoo
Main category: cs.AI
TL;DR: zELO is a novel training method that treats ranking as a Thurstone model, enabling unsupervised training of state-of-the-art reranker models (zerank-1 and zerank-1-small) that outperform proprietary systems across multiple domains.
Details
Motivation: To develop a more efficient and effective approach to training retrieval models by leveraging unsupervised data and the statistical equivalence of ranking tasks to Thurstone models, avoiding the need for expensive annotated data.Method: Uses zELO methodology based on Thurstone model analysis, trained on 112,000 queries with 100 documents per query using unsupervised data (unannotated queries and documents), requiring less than 10,000 H100-hours for end-to-end training.
Result: Achieved highest retrieval scores in finance, legal, code, and STEM domains, outperforming closed-source proprietary rerankers on both NDCG@10 and Recall metrics. Models maintained strong zero-shot performance on out-of-domain and private customer datasets.
Conclusion: zELO enables efficient unsupervised training of high-performance reranker models that surpass proprietary solutions across diverse domains while demonstrating excellent generalization capabilities without domain-specific training.
Abstract: We introduce a novel training methodology named zELO, which optimizes retrieval performance via the analysis that ranking tasks are statically equivalent to a Thurstone model. Based on the zELO method, we use unsupervised data in order train a suite of state-of-the-art open-weight reranker models: zerank-1 and zerank-1-small. These models achieve the highest retrieval scores in multiple domains, including finance, legal, code, and STEM, outperforming closed-source proprietary rerankers on both NDCG@10 and Recall. These models also demonstrate great versatility, maintaining their 0-shot performance on out-of-domain and private customer datasets. The training data included 112,000 queries and 100 documents per query, and was trained end-to-end from unannotated queries and documents in less than 10,000 H100-hours.
[270] Human + AI for Accelerating Ad Localization Evaluation
Harshit Rajgarhia, Shivali Dalmia, Mengyang Zhao, Mukherji Abhishek, Kiran Ganesh
Main category: cs.AI
TL;DR: A framework combining automated components and human oversight for multilingual ad localization that preserves visual consistency and semantic accuracy across languages.
Details
Motivation: Advertisement localization requires more than text translation - it needs to maintain visual consistency, spatial alignment, and stylistic integrity across different languages and formats, which current approaches don't adequately address.Method: Combines scene text detection, inpainting, machine translation, and text reimposition with human oversight to create a structured framework for accelerating ad localization evaluation workflows.
Result: Qualitative results across six locales show the approach produces semantically accurate and visually coherent localized advertisements suitable for real-world deployment.
Conclusion: This integrated framework successfully addresses the complexities of multilingual advertisement localization and is the first work to combine these specific techniques for accelerating ad localization workflows.
Abstract: Adapting advertisements for multilingual audiences requires more than simple text translation; it demands preservation of visual consistency, spatial alignment, and stylistic integrity across diverse languages and formats. We introduce a structured framework that combines automated components with human oversight to address the complexities of advertisement localization. To the best of our knowledge, this is the first work to integrate scene text detection, inpainting, machine translation (MT), and text reimposition specifically for accelerating ad localization evaluation workflows. Qualitative results across six locales demonstrate that our approach produces semantically accurate and visually coherent localized advertisements, suitable for deployment in real-world workflows.
[271] Redefining CX with Agentic AI: Minerva CQ Case Study
Garima Agrawal, Riccardo De Maria, Kiran Davuluri, Daniele Spera, Charlie Read, Cosimo Spera, Jack Garrett, Don Miller
Main category: cs.AI
TL;DR: Agentic AI system called Minerva CQ that proactively assists customer service agents in real-time, reducing cognitive load and improving efficiency through automated workflows and contextual reasoning.
Details
Motivation: Customer experience suffers from high handling times and low satisfaction due to agents' cognitive load from navigating fragmented systems and manual troubleshooting. Existing AI tools are reactive and lack deeper contextual reasoning.Method: Developed Agentic AI - goal-driven, autonomous systems that identify customer intent, trigger modular workflows, maintain evolving context, and adapt dynamically to conversation state. Minerva CQ integrates real-time transcription, intent/sentiment detection, entity recognition, contextual retrieval, dynamic profiling, and partial summaries.
Result: Deployed in live production, Minerva CQ acts as an AI co-pilot delivering measurable improvements in agent efficiency and customer experience across multiple deployments.
Conclusion: Agentic AI systems like Minerva CQ provide proactive, real-time assistance that significantly enhances customer service operations by reducing cognitive load and improving both agent performance and customer satisfaction metrics.
Abstract: Despite advances in AI for contact centers, customer experience (CX) continues to suffer from high average handling time (AHT), low first-call resolution, and poor customer satisfaction (CSAT). A key driver is the cognitive load on agents, who must navigate fragmented systems, troubleshoot manually, and frequently place customers on hold. Existing AI-powered agent-assist tools are often reactive driven by static rules, simple prompting, or retrieval-augmented generation (RAG) without deeper contextual reasoning. We introduce Agentic AI goal-driven, autonomous, tool-using systems that proactively support agents in real time. Unlike conventional approaches, Agentic AI identifies customer intent, triggers modular workflows, maintains evolving context, and adapts dynamically to conversation state. This paper presents a case study of Minerva CQ, a real-time Agent Assist product deployed in voice-based customer support. Minerva CQ integrates real-time transcription, intent and sentiment detection, entity recognition, contextual retrieval, dynamic customer profiling, and partial conversational summaries enabling proactive workflows and continuous context-building. Deployed in live production, Minerva CQ acts as an AI co-pilot, delivering measurable improvements in agent efficiency and customer experience across multiple deployments.
[272] Match Chat: Real Time Generative AI and Generative Computing for Tennis
Aaron Baughman, Gozde Akay, Eduardo Morales, Rahul Agarwal, Preetika Srivastava
Main category: cs.AI
TL;DR: Match Chat is a real-time AI assistant for tennis fans that combines GenAI and GenComp to provide instant match insights during live events, achieving 92.83% accuracy and 6.25s response time at scale.
Details
Motivation: To enhance the tennis fan experience by delivering instant, accurate responses to match-related queries during live events, making match data accessible through natural language interactions.Method: Uses Agent-Oriented Architecture (AOA) combining rule engines, predictive models, and agents to pre-process queries before GenAI components. Features interactive prompt design and integrates Generative AI with Generative Computing techniques.
Result: Achieved 92.83% answer accuracy with 6.25s average response time under 120 RPS load. Served ~1 million users at Wimbledon and US Open with 100% uptime. 96.08% of queries used guided prompts for optimal user experience.
Conclusion: Demonstrates successful deployment of performant agentic systems in dynamic environments with key design patterns emphasizing speed, precision, and usability for real-time consumer-facing AI applications.
Abstract: We present Match Chat, a real-time, agent-driven assistant designed to enhance the tennis fan experience by delivering instant, accurate responses to match-related queries. Match Chat integrates Generative Artificial Intelligence (GenAI) with Generative Computing (GenComp) techniques to synthesize key insights during live tennis singles matches. The system debuted at the 2025 Wimbledon Championships and the 2025 US Open, where it provided about 1 million users with seamless access to streaming and static data through natural language queries. The architecture is grounded in an Agent-Oriented Architecture (AOA) combining rule engines, predictive models, and agents to pre-process and optimize user queries before passing them to GenAI components. The Match Chat system had an answer accuracy of 92.83% with an average response time of 6.25 seconds under loads of up to 120 requests per second (RPS). Over 96.08% of all queries were guided using interactive prompt design, contributing to a user experience that prioritized clarity, responsiveness, and minimal effort. The system was designed to mask architectural complexity, offering a frictionless and intuitive interface that required no onboarding or technical familiarity. Across both Grand Slam deployments, Match Chat maintained 100% uptime and supported nearly 1 million unique users, underscoring the scalability and reliability of the platform. This work introduces key design patterns for real-time, consumer-facing AI systems that emphasize speed, precision, and usability that highlights a practical path for deploying performant agentic systems in dynamic environments.
[273] DaSAThco: Data-Aware SAT Heuristics Combinations Optimization via Large Language Models
Minyu Chen, Guoqiang Li
Main category: cs.AI
TL;DR: DaSAThco is a framework that uses LLMs to create adaptive heuristic ensembles for SAT solvers, enabling robust generalization across different problem types without re-optimization.
Details
Motivation: SAT solver performance depends on heuristics, but no single configuration works optimally for all problems. Existing automated methods require costly re-optimization for each new problem type and lack generalizability.Method: Uses Large Language Model guided by Problem Archetypes to generate diverse heuristic ensembles, then learns an adaptive selection mechanism to map instance features to appropriate heuristic configurations.
Result: DaSAThco achieves superior performance and demonstrates robust out-of-domain generalization where non-adaptive methods fail.
Conclusion: Provides a scalable and practical approach for automated algorithm design in complex configurable systems through adaptive heuristic mapping.
Abstract: The performance of Conflict-Driven Clause Learning solvers hinges on internal heuristics, yet the heterogeneity of SAT problems makes a single, universally optimal configuration unattainable. While prior automated methods can find specialized configurations for specific problem families, this dataset-specific approach lacks generalizability and requires costly re-optimization for new problem types. We introduce DaSAThco, a framework that addresses this challenge by learning a generalizable mapping from instance features to tailored heuristic ensembles, enabling a train-once, adapt-broadly model. Our framework uses a Large Language Model, guided by systematically defined Problem Archetypes, to generate a diverse portfolio of specialized heuristic ensembles and subsequently learns an adaptive selection mechanism to form the final mapping. Experiments show that DaSAThco achieves superior performance and, most notably, demonstrates robust out-of-domain generalization where non-adaptive methods show limitations. Our work establishes a more scalable and practical path toward automated algorithm design for complex, configurable systems.
[274] Analogy-Driven Financial Chain-of-Thought (AD-FCoT): A Prompting Approach for Financial Sentiment Analysis
Anmol Singhal Navya Singhal
Main category: cs.AI
TL;DR: AD-FCoT is a prompting framework that combines analogical reasoning with chain-of-thought prompting to improve financial news sentiment analysis using LLMs, achieving better accuracy and market correlation without additional training.
Details
Motivation: Existing financial sentiment analysis methods struggle with capturing complex economic context and lack transparent reasoning, undermining reliability for market movement prediction.Method: Analogy-Driven Financial Chain-of-Thought (AD-FCoT) prompting framework that guides LLMs to draw parallels between new events and historical scenarios with known outcomes, embedding analogies into structured step-by-step reasoning chains.
Result: Outperforms strong baselines in sentiment classification accuracy, achieves substantially higher correlation with market returns, and generates explanations that align with domain expertise.
Conclusion: AD-FCoT provides an effective, interpretable approach for financial sentiment analysis that leverages LLMs’ internal knowledge without requiring additional training data or fine-tuning, making it suitable for real-world financial analysis.
Abstract: Financial news sentiment analysis is crucial for anticipating market movements. With the rise of AI techniques such as Large Language Models (LLMs), which demonstrate strong text understanding capabilities, there has been renewed interest in enhancing these systems. Existing methods, however, often struggle to capture the complex economic context of news and lack transparent reasoning, which undermines their reliability. We propose Analogy-Driven Financial Chain-of-Thought (AD-FCoT), a prompting framework that integrates analogical reasoning with chain-of-thought (CoT) prompting for sentiment prediction on historical financial news. AD-FCoT guides LLMs to draw parallels between new events and relevant historical scenarios with known outcomes, embedding these analogies into a structured, step-by-step reasoning chain. To our knowledge, this is among the first approaches to explicitly combine analogical examples with CoT reasoning in finance. Operating purely through prompting, AD-FCoT requires no additional training data or fine-tuning and leverages the model’s internal financial knowledge to generate rationales that mirror human analytical reasoning. Experiments on thousands of news articles show that AD-FCoT outperforms strong baselines in sentiment classification accuracy and achieves substantially higher correlation with market returns. Its generated explanations also align with domain expertise, providing interpretable insights suitable for real-world financial analysis.
[275] GBV-SQL: Guided Generation and SQL2Text Back-Translation Validation for Multi-Agent Text2SQL
Daojun Chen, Xi Wang, Shenyuan Ren, Qingzhi Ma, Pengpeng Zhao, An Liu
Main category: cs.AI
TL;DR: GBV-SQL is a multi-agent framework that uses SQL2Text back-translation validation to ensure generated SQL queries accurately reflect user intent, while also exposing systematic errors in benchmark datasets.
Details
Motivation: Current Text2SQL models produce syntactically valid queries that often misinterpret user intent, and existing benchmarks contain pervasive flaws in ground-truth data that obscure true model performance.Method: Multi-agent framework with Guided Generation and SQL2Text Back-translation Validation, where a specialized agent translates generated SQL back to natural language to verify logical alignment with original questions.
Result: Achieves 63.23% execution accuracy on BIRD benchmark (5.8% improvement) and 96.5-97.6% accuracy on Spider benchmark after removing flawed examples.
Conclusion: Provides both a robust semantic validation framework and critical perspective on benchmark integrity, highlighting the need for more rigorous dataset curation in Text2SQL evaluation.
Abstract: While Large Language Models have significantly advanced Text2SQL generation, a critical semantic gap persists where syntactically valid queries often misinterpret user intent. To mitigate this challenge, we propose GBV-SQL, a novel multi-agent framework that introduces Guided Generation with SQL2Text Back-translation Validation. This mechanism uses a specialized agent to translate the generated SQL back into natural language, which verifies its logical alignment with the original question. Critically, our investigation reveals that current evaluation is undermined by a systemic issue: the poor quality of the benchmarks themselves. We introduce a formal typology for “Gold Errors”, which are pervasive flaws in the ground-truth data, and demonstrate how they obscure true model performance. On the challenging BIRD benchmark, GBV-SQL achieves 63.23% execution accuracy, a 5.8% absolute improvement. After removing flawed examples, GBV-SQL achieves 96.5% (dev) and 97.6% (test) execution accuracy on the Spider benchmark. Our work offers both a robust framework for semantic validation and a critical perspective on benchmark integrity, highlighting the need for more rigorous dataset curation.
[276] Mob-based cattle weight gain forecasting using ML models
Muhammad Riaz Hasib Hossain, Rafiqul Islam, Shawn R McGrath, Md Zahidul Islam, David Lamb
Main category: cs.AI
TL;DR: Random Forest model outperforms SVR and LSTM for predicting cattle weight gain, achieving R²=0.973 when including weather and age factors, with an automated preprocessing tool developed for dataset preparation.
Details
Motivation: Forecasting mob-based cattle weight gain can help farmers optimize feeding strategies, make better breeding decisions, and mitigate risks from climate variability and market fluctuations.Method: Used Random Forest model compared against SVR and LSTM models on 756 samples from 108 cattle with weather data (rainfall, temperature). Developed automated preprocessing tool for dataset generation.
Result: RF model performed best with R²=0.973, RMSE=0.040, MAE=0.033 when including both weather and age factors. Weather and age factors significantly improved prediction accuracy across all models.
Conclusion: RF is a robust tool for cattle weight gain forecasting, demonstrating the importance of age and climatic factors. The automated preprocessing tool provides a benchmark for future research.
Abstract: Forecasting mob based cattle weight gain (MB CWG) may benefit large livestock farms, allowing farmers to refine their feeding strategies, make educated breeding choices, and reduce risks linked to climate variability and market fluctuations. In this paper, a novel technique termed MB CWG is proposed to forecast the one month advanced weight gain of herd based cattle using historical data collected from the Charles Sturt University Farm. This research employs a Random Forest (RF) model, comparing its performance against Support Vector Regression (SVR) and Long Short Term Memory (LSTM) models for monthly weight gain prediction. Four datasets were used to evaluate the performance of models, using 756 sample data from 108 herd-based cattle, along with weather data (rainfall and temperature) influencing CWG. The RF model performs better than the SVR and LSTM models across all datasets, achieving an R^2 of 0.973, RMSE of 0.040, and MAE of 0.033 when both weather and age factors were included. The results indicate that including both weather and age factors significantly improves the accuracy of weight gain predictions, with the RF model outperforming the SVR and LSTM models in all scenarios. These findings demonstrate the potential of RF as a robust tool for forecasting cattle weight gain in variable conditions, highlighting the influence of age and climatic factors on herd based weight trends. This study has also developed an innovative automated pre processing tool to generate a benchmark dataset for MB CWG predictive models. The tool is publicly available on GitHub and can assist in preparing datasets for current and future analytical research..
[277] ECG-aBcDe: Overcoming Model Dependence, Encoding ECG into a Universal Language for Any LLM
Yong Xia, Jingxuan Li, YeTeng Sun, Jiarui Bu
Main category: cs.AI
TL;DR: ECG-aBcDe is a novel ECG encoding method that transforms ECG signals into a universal language interpretable by any LLM, enabling direct fine-tuning without architectural changes while improving interpretability and time-scale information capture.
Details
Motivation: Current LLM-based ECG analysis methods face challenges with transferability across different LLMs, difficulty capturing time-scale information due to Transformer limitations, and lack of interpretability which hinders clinical adoption.Method: Developed ECG-aBcDe which transforms ECG signals into a universal ECG language, constructed a hybrid dataset of ECG language and natural language, enabling direct fine-tuning of pre-trained LLMs without architectural modifications.
Result: Achieved competitive performance on ROUGE-L and METEOR, with significant improvements in BLEU-4 (2.8x improvement in in-dataset evaluation reaching 42.58, and 3.9x improvement in cross-dataset evaluation reaching 30.76).
Conclusion: ECG-aBcDe presents a new paradigm for integrating ECG analysis with LLMs, providing strong evidence for feasibility through improved performance metrics and enhanced interpretability capabilities.
Abstract: Large Language Models (LLMs) hold significant promise for electrocardiogram (ECG) analysis, yet challenges remain regarding transferability, time-scale information learning, and interpretability. Current methods suffer from model-specific ECG encoders, hindering transfer across LLMs. Furthermore, LLMs struggle to capture crucial time-scale information inherent in ECGs due to Transformer limitations. And their black-box nature limits clinical adoption. To address these limitations, we introduce ECG-aBcDe, a novel ECG encoding method that transforms ECG signals into a universal ECG language readily interpretable by any LLM. By constructing a hybrid dataset of ECG language and natural language, ECG-aBcDe enables direct fine-tuning of pre-trained LLMs without architectural modifications, achieving “construct once, use anywhere” capability. Moreover, the bidirectional convertibility between ECG and ECG language of ECG-aBcDe allows for extracting attention heatmaps from ECG signals, significantly enhancing interpretability. Finally, ECG-aBcDe explicitly represents time-scale information, mitigating Transformer limitations. This work presents a new paradigm for integrating ECG analysis with LLMs. Compared with existing methods, our method achieves competitive performance on ROUGE-L and METEOR. Notably, it delivers significant improvements in the BLEU-4, with improvements of 2.8 times and 3.9 times in in-dataset and cross-dataset evaluations, respectively, reaching scores of 42.58 and 30.76. These results provide strong evidence for the feasibility of the new paradigm.
[278] Learn to Relax with Large Language Models: Solving Nonlinear Combinatorial Optimization Problems via Bidirectional Coevolution
Beidan Liu, Zhengqiu Zhu, Chen Gao, Yong Zhao, Wei Qi, Quanjun Yin
Main category: cs.AI
TL;DR: AutoCO is an end-to-end automated constraint optimization method that uses LLMs to generate constraint relaxation strategies for nonlinear combinatorial optimization problems, combining evolutionary algorithms and Monte Carlo Tree Search for effective optimization.
Details
Motivation: Nonlinear combinatorial optimization problems are computationally challenging due to nonconvex, multi-modal solution spaces. Traditional methods require expert-driven iterative design and lack automation, while current LLM-based approaches only validate constraints rather than proactively designing strategies.Method: Uses structured LLM reasoning to generate constraint relaxation strategies with a unified triple-representation scheme. Implements bidirectional coevolution combining Evolutionary Algorithms for local refinement and Monte Carlo Tree Search for global strategy exploration.
Result: Comprehensive experiments on three challenging NCOP benchmarks demonstrate AutoCO’s consistent effectiveness and superior performance over baseline methods.
Conclusion: AutoCO represents the first end-to-end automated constraint optimization method that successfully revolutionizes NCOP resolution through learning to relax with LLMs, achieving optimal balance between intensification and diversification.
Abstract: Nonlinear Combinatorial Optimization Problems (NCOPs) present a formidable computational hurdle in practice, as their nonconvex nature gives rise to multi-modal solution spaces that defy efficient optimization. Traditional constraint relaxation approaches rely heavily on expert-driven, iterative design processes that lack systematic automation and scalable adaptability. While recent Large Language Model (LLM)-based optimization methods show promise for autonomous problem-solving, they predominantly function as passive constraint validators rather than proactive strategy architects, failing to handle the sophisticated constraint interactions inherent to NCOPs.To address these limitations, we introduce the first end-to-end \textbf{Auto}mated \textbf{C}onstraint \textbf{O}ptimization (AutoCO) method, which revolutionizes NCOPs resolution through learning to relax with LLMs.Specifically, we leverage structured LLM reasoning to generate constraint relaxation strategies, which are dynamically evolving with algorithmic principles and executable code through a unified triple-representation scheme. We further establish a novel bidirectional (global-local) coevolution mechanism that synergistically integrates Evolutionary Algorithms for intensive local refinement with Monte Carlo Tree Search for systematic global strategy space exploration, ensuring optimal balance between intensification and diversification in fragmented solution spaces. Finally, comprehensive experiments on three challenging NCOP benchmarks validate AutoCO’s consistent effectiveness and superior performance over the baselines.
[279] Large Language Models Imitate Logical Reasoning, but at what Cost?
Lachlan McGinness, Peter Baumgartner
Main category: cs.AI
TL;DR: Longitudinal study shows LLM reasoning capabilities improved significantly from 2023-2025 through Chain of Thought prompting and thinking models, with neuro-symbolic approach reducing computational costs while maintaining performance.
Details
Motivation: To evaluate the evolution of reasoning capabilities in frontier Large Language Models over time and develop a more computationally efficient approach for logical reasoning tasks.Method: Conducted longitudinal evaluation of three leading LLMs from Dec 2023 to June 2025 on PrOntoQA true/false questions, measured faithfulness to reasoning strategies. Developed neuro-symbolic architecture using smaller LLMs (<15B params) to translate problems into standardized form, then parsed into Z3 SMT solver programs.
Result: Performance improved from 2023-2024 due to hidden Chain of Thought prompting, and further improved 2024-2025 with thinking models. Neuro-symbolic approach maintained near-perfect performance while significantly reducing computational costs. FLOP estimation formula was accurate within 10%.
Conclusion: LLM reasoning capabilities have significantly improved over time through advanced prompting techniques, and neuro-symbolic approaches can achieve high performance with dramatically reduced computational requirements compared to pure LLM approaches.
Abstract: We present a longitudinal study which evaluates the reasoning capability of frontier Large Language Models over an eighteen month period. We measured the accuracy of three leading models from December 2023, September 2024 and June 2025 on true or false questions from the PrOntoQA dataset and their faithfulness to reasoning strategies provided through in-context learning. The improvement in performance from 2023 to 2024 can be attributed to hidden Chain of Thought prompting. The introduction of thinking models allowed for significant improvement in model performance between 2024 and 2025. We then present a neuro-symbolic architecture which uses LLMs of less than 15 billion parameters to translate the problems into a standardised form. We then parse the standardised forms of the problems into a program to be solved by Z3, an SMT solver, to determine the satisfiability of the query. We report the number of prompt and completion tokens as well as the computational cost in FLOPs for open source models. The neuro-symbolic approach significantly reduces the computational cost while maintaining near perfect performance. The common approximation that the number of inference FLOPs is double the product of the active parameters and total tokens was accurate within 10% for all experiments.
[280] Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLMs
Hanqing Li, Kiran Sheena Jyothi, Henry Liang, Sharika Mahadevan, Diego Klabjan
Main category: cs.AI
TL;DR: GRRAF is a training-free method that combines RAG with LLMs’ code generation to solve graph reasoning tasks by storing graphs in databases and generating executable queries, achieving near-perfect accuracy on most tasks while scaling to large graphs.
Details
Motivation: Existing graph reasoning methods require extensive finetuning or predefined algorithms, limiting their flexibility and scalability. GRRAF aims to overcome these limitations with a training-free approach.Method: Stores target graphs in graph databases, uses LLMs to generate executable code queries for information retrieval, and incorporates an error feedback loop with time-out mechanism for correctness and efficiency.
Result: Achieves 100% accuracy on most tasks (cycle detection, bipartite checks, shortest path, max flow), high performance on subgraph matching, and scales to graphs with 10,000 nodes with consistent token costs.
Conclusion: GRRAF provides an effective training-free solution for graph reasoning that overcomes limitations of existing methods, demonstrates excellent accuracy and scalability, and maintains efficiency across different graph sizes.
Abstract: We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes.
[281] H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents
Shicheng Ye, Chao Yu, Kaiqiang Ke, Chengdong Xu, Yinqi Wei
Main category: cs.AI
TL;DR: H$^2$R introduces hierarchical memory architecture with separate planning and execution memories for fine-grained knowledge transfer in LLM-based agents, outperforming previous methods.
Details
Motivation: Existing approaches treat prior experiences as monolithic units, leading to inefficient and coarse-grained knowledge transfer in multi-task scenarios for LLM-based agents.Method: Proposes hierarchical memory architecture decoupling high-level planning memory from low-level execution memory, with Hierarchical Hindsight Reflection (H$^2$R) mechanism to distill reusable hierarchical knowledge from past interactions.
Result: Experimental results across two benchmarks show improved generalization and decision-making performance, outperforming prior baselines like Expel.
Conclusion: H$^2$R enables efficient fine-grained knowledge transfer through hierarchical memory separation, enhancing LLM-based agent performance in multi-task scenarios.
Abstract: Large language model (LLM)-based agents have shown strong potential in multi-task scenarios, owing to their ability to transfer knowledge across diverse tasks. However, existing approaches often treat prior experiences and knowledge as monolithic units, leading to inefficient and coarse-grained knowledge transfer. In this work, we propose a novel hierarchical memory architecture that enables fine-grained knowledge transfer by decoupling high-level planning memory from low-level execution memory. To construct and refine these hierarchical memories, we introduce Hierarchical Hindsight Reflection (H$^2$R), a mechanism that distills reusable and hierarchical knowledge from past agent-environment interactions. At test time, H$^2$R performs retrievals of high-level and low-level memories separately, allowing LLM-based agents to efficiently access and utilize task-relevant knowledge for new tasks.Experimental results across two benchmarks demonstrate that H$^2$R can improve generalization and decision-making performance, outperforming prior baselines such as Expel.
[282] LTA-thinker: Latent Thought-Augmented Training Framework for Large Language Models on Complex Reasoning
Jiaqi Wang, Binquan Ji, Haibo Luo, Yiyang Qi, Ruiting Li, Huiyan Wang, Yuantao Han, Cangyi Yang, jiaxu Zhang, Feiliang Ren
Main category: cs.AI
TL;DR: LTA-Thinker is a novel framework that enhances complex reasoning in LLMs by improving latent thought distribution variance through learnable priors and multi-objective optimization, achieving SOTA performance.
Details
Motivation: To address the bottleneck in efficient generation and utilization of high-quality latent thoughts for complex reasoning, as current methods still struggle with distribution variance despite advances in test-time scaling techniques.Method: Proposes a two-pronged approach: 1) Learnable prior architecture for latent thought generation to increase distribution variance, 2) Distribution-based directional optimization with Semantic Alignment Loss (KL divergence) and Reasoning Focus Loss (contrastive learning) combined with standard SFT.
Result: Achieves state-of-the-art performance across various baselines, demonstrates higher performance ceiling and better scaling effects compared to existing methods.
Conclusion: LTA-Thinker effectively enhances reasoning performance by optimizing latent thought distribution variance through innovative architectural and optimization techniques, providing a promising direction for complex reasoning in LLMs.
Abstract: Complex Reasoning in Large Language Models can be dynamically optimized using Test-Time Scaling (TTS) to mitigate Overthinking. Methods such as Coconut, SoftCoT and its variant are effective in continuous latent space inference, the core bottleneck still lies in the efficient generation and utilization of high-quality Latent Thought. Drawing from the theory of SoftCoT++ that a larger variance in the generated Latent Thought distribution more closely approximates the golden truth distribution, we propose a Latent Thought-Augmented Training Framework–LTA-Thinker, which improves distributional variance and enhances reasoning performance from two perspectives. First, LTA-Thinker constructs a Latent Thought generation architecture based on a learnable prior. This architecture aims to increase the variance distribution of generated Latent Thought Vectors in order to simplify the overall structure and raise the performance ceiling. Second, LTA-Thinker introduces a distribution-based directional optimization paradigm that jointly constrains both distribution locality and distribution scale. This mechanism improves information efficiency and computational cost through a multi-objective co-training strategy, which combines standard Supervised Fine-Tuning (SFT) loss with two novel losses: Semantic Alignment Loss, which utilizes KL divergence to ensure that the Latent Thought is highly relevant to the semantics of the question; Reasoning Focus Loss, which utilizes a contrastive learning mechanism to guide the model to focus on the most critical reasoning steps. Experiments show that LTA-thinker achieves state-of-the-art (SOTA) performance among various baselines and demonstrates a higher performance ceiling and better scaling effects.
[283] Stochastic Streets: A Walk Through Random LLM Address Generation in four European Cities
Tairan Fu, David Campo-Nazareno, Javier Coronado-Blázquez, Javier Conde, Pedro Reviriego, Fabrizio Lombardi
Main category: cs.AI
TL;DR: LLMs struggle to generate random street addresses for European cities despite their general capabilities.
Details
Motivation: To test whether LLMs can generate realistic random street addresses, which requires understanding of geographic patterns and local naming conventions.Method: Evaluated LLMs on their ability to generate random street addresses for various European cities, assessing realism and randomness.
Result: LLMs performed poorly at generating realistic random street addresses, showing limitations in geographic knowledge and pattern recognition.
Conclusion: While LLMs excel at many complex tasks, they have specific weaknesses in generating realistic random geographic data like street addresses.
Abstract: Large Language Models (LLMs) are capable of solving complex math problems or answer difficult questions on almost any topic, but can they generate random street addresses for European cities?
[284] Population Estimation using Deep Learning over Gandhinagar Urban Area
Jai Singla, Peal Jotania, Keivalya Pandya
Main category: cs.AI
TL;DR: Deep learning approach using satellite imagery and DEM data to estimate urban population with high accuracy, achieving F1-score of 0.9936 for Gandhinagar.
Details
Motivation: Traditional population estimation methods like surveys and censuses are expensive, time-consuming, and labor-intensive, requiring automated solutions for efficient urban planning and resource allocation.Method: Combines CNN for building classification (residential vs non-residential) and ANN for population estimation, using high-resolution satellite imagery (0.3m), DEM data (0.5m), and vector boundaries on 48k building footprints.
Result: Achieved impressive F1-score of 0.9936 and estimated Gandhinagar population at 278,954, demonstrating high accuracy in automated population estimation.
Conclusion: The framework provides municipalities with a scalable, replicable tool for optimized resource management in rapidly urbanizing cities, showcasing AI-driven geospatial analytics’ efficiency in enhancing data-driven urban governance.
Abstract: Population estimation is crucial for various applications, from resource allocation to urban planning. Traditional methods such as surveys and censuses are expensive, time-consuming and also heavily dependent on human resources, requiring significant manpower for data collection and processing. In this study a deep learning solution is proposed to estimate population using high resolution (0.3 m) satellite imagery, Digital Elevation Models (DEM) of 0.5m resolution and vector boundaries. Proposed method combines Convolution Neural Network (CNN) architecture for classification task to classify buildings as residential and non-residential and Artificial Neural Network (ANN) architecture to estimate the population. Approx. 48k building footprints over Gandhinagar urban area are utilized containing both residential and non-residential, with residential categories further used for building-level population estimation. Experimental results on a large-scale dataset demonstrate the effectiveness of our model, achieving an impressive overall F1-score of 0.9936. The proposed system employs advanced geospatial analysis with high spatial resolution to estimate Gandhinagar population at 278,954. By integrating real-time data updates, standardized metrics, and infrastructure planning capabilities, this automated approach addresses critical limitations of conventional census-based methodologies. The framework provides municipalities with a scalable and replicable tool for optimized resource management in rapidly urbanizing cities, showcasing the efficiency of AI-driven geospatial analytics in enhancing data-driven urban governance.
[285] The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo
Main category: cs.AI
TL;DR: FSRL is a transparent alignment framework that uses lightweight adapters to steer LLM behavior through interpretable features from Sparse Autoencoders, offering comparable performance to RLHF while providing mechanistic insights into alignment.
Details
Motivation: Current RLHF methods induce opaque parameter changes, making it difficult to understand what models internalize during alignment. There's a need for more transparent and interpretable alignment approaches.Method: Feature Steering with Reinforcement Learning (FSRL) trains lightweight adapters to modulate interpretable features from Sparse Autoencoders (SAEs) to steer model behavior, providing a more transparent alternative to traditional RLHF.
Result: FSRL demonstrates comparable effectiveness to current RLHF methods for preference optimization. Mechanistic analysis reveals the adapter systematically promotes style features over explicit alignment concepts, suggesting preference optimization rewards stylistic presentation as a proxy for quality.
Conclusion: FSRL provides both an effective alignment method and a diagnostic tool for understanding alignment mechanisms, offering interpretable model control and insights into what models actually learn during preference optimization.
Abstract: Aligning large language models is critical for their usability and safety. However, the prevailing approach of Reinforcement Learning from Human Feedback (RLHF) induces diffuse, opaque parameter changes, making it difficult to discern what the model has internalized. Hence, we introduce Feature Steering with Reinforcement Learning (FSRL), a transparent alignment framework that trains a lightweight adapter to steer behavior by modulating interpretable features from a Sparse Autoencoder (SAE). First, we demonstrate that FSRL is an effective method for preference optimization and is comparable with current RLHF methods. We then perform mechanistic analysis on the trained adapter, and find that its policy systematically promotes style features over explicit alignment concepts, suggesting that the preference optimization process rewards stylistic presentation as a proxy for quality. Ultimately, we hope that FSRL provides a tool for both interpretable model control and diagnosing the internal mechanisms of alignment.
[286] Black-box Model Merging for Language-Model-as-a-Service with Massive Model Repositories
Shilian Chen, Jie Zhou, Tianyu Huai, Yujiang Lu, Junsong Li, Bihao Zhan, Qianjun Pan, Yutao Yang, Xin Li, Qin Chen, Hang Yan, Liang He
Main category: cs.AI
TL;DR: Evo-Merging enables black-box model merging of large language models using only API queries through evolutionary optimization with sparsity-based denoising and sign-aware scaling.
Details
Motivation: Existing model merging methods require access to model weights, but large LLMs like GPT-4 are often provided as black-box services through APIs, making traditional approaches infeasible.Method: Proposes derivative-free optimization using evolutionary algorithms with two key components: sparsity-based denoising to filter irrelevant information and sign-aware scaling to compute optimal combination weights dynamically.
Result: Achieves state-of-the-art results on various tasks, significantly outperforming existing baselines in black-box model merging scenarios.
Conclusion: Evo-Merging provides an effective solution for merging black-box LLMs using only inference API queries, with theoretical justification and strong empirical performance.
Abstract: Model merging refers to the process of integrating multiple distinct models into a unified model that preserves and combines the strengths and capabilities of the individual models. Most existing approaches rely on task vectors to combine models, typically under the assumption that model parameters are accessible. However, for extremely large language models (LLMs) such as GPT-4, which are often provided solely as black-box services through API interfaces (Language-Model-as-a-Service), model weights are not available to end users. This presents a significant challenge, which we refer to as black-box model merging (BMM) with massive LLMs. To address this challenge, we propose a derivative-free optimization framework based on the evolutionary algorithm (Evo-Merging) that enables effective model merging using only inference-time API queries. Our method consists of two key components: (1) sparsity-based denoising, designed to identify and filter out irrelevant or redundant information across models, and (2) sign-aware scaling, which dynamically computes optimal combination weights for the relevant models based on their performance. We also provide a formal justification, along with a theoretical analysis, for our asymmetric sparsification. Extensive experimental evaluations demonstrate that our approach achieves state-of-the-art results on a range of tasks, significantly outperforming existing strong baselines.
[287] Forget What’s Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning
Bihao Zhan, Jie Zhou, Junsong Li, Yutao Yang, Shilian Chen, Qianjun Pan, Xin Li, Wen Wu, Xingjiao Wu, Qin Chen, Hang Yan, Liang He
Main category: cs.AI
TL;DR: PeCL framework combines token-level dynamic differential privacy and privacy-guided memory sculpting to protect sensitive information while preserving task-invariant knowledge in continual learning.
Details
Motivation: Traditional uniform DP approaches degrade model utility too much in continual learning, while CL models accumulate diverse sensitive information that needs targeted protection.Method: Token-level dynamic DP strategy that adaptively allocates privacy budgets based on semantic sensitivity, plus privacy-guided memory sculpting to forget sensitive information while preserving task-invariant knowledge.
Result: PeCL achieves superior balance between privacy and utility, outperforming baselines with high accuracy on previous tasks while ensuring robust privacy protection.
Conclusion: The proposed framework effectively addresses privacy challenges in continual learning by selectively protecting sensitive information while maintaining model performance on learned tasks.
Abstract: Continual Learning (CL) models, while adept at sequential knowledge acquisition, face significant and often overlooked privacy challenges due to accumulating diverse information. Traditional privacy methods, like a uniform Differential Privacy (DP) budget, indiscriminately protect all data, leading to substantial model utility degradation and hindering CL deployment in privacy-sensitive areas. To overcome this, we propose a privacy-enhanced continual learning (PeCL) framework that forgets what’s sensitive and remembers what matters. Our approach first introduces a token-level dynamic Differential Privacy strategy that adaptively allocates privacy budgets based on the semantic sensitivity of individual tokens. This ensures robust protection for private entities while minimizing noise injection for non-sensitive, general knowledge. Second, we integrate a privacy-guided memory sculpting module. This module leverages the sensitivity analysis from our dynamic DP mechanism to intelligently forget sensitive information from the model’s memory and parameters, while explicitly preserving the task-invariant historical knowledge crucial for mitigating catastrophic forgetting. Extensive experiments show that PeCL achieves a superior balance between privacy preserving and model utility, outperforming baseline models by maintaining high accuracy on previous tasks while ensuring robust privacy.
[288] Toward PDDL Planning Copilot
Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern
Main category: cs.AI
TL;DR: Planning Copilot integrates planning tools with LLMs via MCP protocol, enabling reliable long-horizon planning without domain-specific fine-tuning, outperforming both base LLMs and GPT-5.
Details
Motivation: LLMs lack reliable long-horizon planning capabilities on their own, creating a gap for autonomous task performance that needs bridging.Method: Uses Model Context Protocol (MCP) to connect LLMs with external planning tools, allowing natural language instructions for syntax checking, planner selection, plan validation, and execution simulation.
Result: Empirical evaluation shows Planning Copilot significantly outperforms base LLMs without planning tools and even beats GPT-5 despite using smaller LLMs.
Conclusion: Dedicated planning tools integrated via MCP protocol are an effective approach to enable LLMs to perform reliable planning tasks.
Abstract: Large Language Models (LLMs) are increasingly being used as autonomous agents capable of performing complicated tasks. However, they lack the ability to perform reliable long-horizon planning on their own. This paper bridges this gap by introducing the Planning Copilot, a chatbot that integrates multiple planning tools and allows users to invoke them through instructions in natural language. The Planning Copilot leverages the Model Context Protocol (MCP), a recently developed standard for connecting LLMs with external tools and systems. This approach allows using any LLM that supports MCP without domain-specific fine-tuning. Our Planning Copilot supports common planning tasks such as checking the syntax of planning problems, selecting an appropriate planner, calling it, validating the plan it generates, and simulating their execution. We empirically evaluate the ability of our Planning Copilot to perform these tasks using three open-source LLMs. The results show that the Planning Copilot highly outperforms using the same LLMs without the planning tools. We also conducted a limited qualitative comparison of our tool against Chat GPT-5, a very recent commercial LLM. Our results shows that our Planning Copilot significantly outperforms GPT-5 despite relying on a much smaller LLM. This suggests dedicated planning tools may be an effective way to enable LLMs to perform planning tasks.
[289] Data-driven Methods of Extracting Text Structure and Information Transfer
Shinichi Honna, Taichi Murayama, Akira Matsui
Main category: cs.AI
TL;DR: The paper tests the Anna Karenina Principle across different media types (novels, Wikipedia, research papers, movies) by analyzing structural patterns in text sequences. Results show medium-specific structural constraints determine success, while failure patterns vary by domain.
Details
Motivation: To empirically test the Anna Karenina Principle and its variations across different text-based media to understand how structural patterns relate to success and failure in different domains.Method: Represent texts as sequences of functional blocks and assess convergence in transition order and position across novels, online encyclopedias, research papers, and movies.
Result: Structural principles vary by medium: novels follow reverse AKP in order, Wikipedia combines AKP with ordered patterns, academic papers display reverse AKP in order but remain noisy in position, and movies diverge by genre.
Conclusion: Success depends on medium-specific structural constraints, while failure assumes different forms across domains, demonstrating that structural patterns are context-dependent rather than universal.
Abstract: The Anna Karenina Principle (AKP) holds that success requires satisfying a small set of essential conditions, whereas failure takes diverse forms. We test AKP, its reverse, and two further patterns described as ordered and noisy across novels, online encyclopedias, research papers, and movies. Texts are represented as sequences of functional blocks, and convergence is assessed in transition order and position. Results show that structural principles vary by medium: novels follow reverse AKP in order, Wikipedia combines AKP with ordered patterns, academic papers display reverse AKP in order but remain noisy in position, and movies diverge by genre. Success therefore depends on structural constraints that are specific to each medium, while failure assumes different shapes across domains.
[290] A Visualized Framework for Event Cooperation with Generative Agents
Yuyang Tian, Shunqiang Mao, Wenchang Gao, Lanlan Qiu, Tianxing He
Main category: cs.AI
TL;DR: MiniAgentPro is a visualization platform for LLM-based agent societies that addresses limitations in event organization evaluation and physical environment integration, featuring a map editor and simulation player with comprehensive test scenarios.
Details
Motivation: Existing LLM agent frameworks lack systematic evaluation for event organization and visualized integration with physically grounded environments, limiting realistic spatial navigation and item interaction capabilities.Method: Developed MiniAgentPro platform with intuitive map editor for environment customization and simulation player with smooth animations. Created comprehensive test set with 8 diverse event scenarios in basic and hard variants to assess agent capabilities.
Result: Evaluations using GPT-4o showed strong performance in basic settings but revealed coordination challenges in hard variants, demonstrating the platform’s effectiveness in identifying agent limitations.
Conclusion: MiniAgentPro provides a valuable visualization and evaluation framework for LLM-based agent societies, successfully identifying both capabilities and coordination challenges in complex physical environments.
Abstract: Large Language Models (LLMs) have revolutionized the simulation of agent societies, enabling autonomous planning, memory formation, and social interactions. However, existing frameworks often overlook systematic evaluations for event organization and lack visualized integration with physically grounded environments, limiting agents’ ability to navigate spaces and interact with items realistically. We develop MiniAgentPro, a visualization platform featuring an intuitive map editor for customizing environments and a simulation player with smooth animations. Based on this tool, we introduce a comprehensive test set comprising eight diverse event scenarios with basic and hard variants to assess agents’ ability. Evaluations using GPT-4o demonstrate strong performance in basic settings but highlight coordination challenges in hard variants.
[291] Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets
Marylou Fauchard, Florian Carichon, Margarida Carvalho, Golnoosh Farnadi
Main category: cs.AI
TL;DR: LLMs struggle with matching problems requiring preferential constraints. New benchmark shows reasoning models outperform traditional ones, but no prompt strategy consistently works best, and iterative performance isn’t monotonic.
Details
Motivation: While LLMs excel at mathematical reasoning, their application to matching problems with preferential and structural constraints remains underexplored, creating a gap in understanding their capabilities for combinatorial optimization with preferences.Method: Created a benchmark of 369 College Admission Problem instances to evaluate LLMs on feasibility, stability, and optimality. Tested multiple open-weight LLMs with various prompting strategies including Chain-of-Thought, In-Context Learning, and role-based prompting.
Result: LLMs struggle to consistently meet all evaluation criteria. Reasoning models (QwQ, GPT-oss) significantly outperform traditional models (Llama, Qwen, Mistral). No prompt strategy consistently performed best, and iterative prompting with auto-generated feedback showed non-monotonic performance that peaks early then declines.
Conclusion: This work provides new insights into LLM reasoning performance and prompt effectiveness for combinatorial optimization with preferential constraints, highlighting both the potential and limitations of current approaches.
Abstract: Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including combinatorial optimization. Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLMs both powerful and accessible tools for a wide range of users, including non-experts. However, applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored. To address this gap, we introduce a novel benchmark of 369 instances of the College Admission Problem, a canonical example of a matching problem with preferences, to evaluate LLMs across key dimensions: feasibility, stability, and optimality. We employ this benchmark to assess the performance of several open-weight LLMs. Our results first reveal that while LLMs can satisfy certain constraints, they struggle to meet all evaluation criteria consistently. They also show that reasoning LLMs, like QwQ and GPT-oss, significantly outperform traditional models such as Llama, Qwen or Mistral, defined here as models used without any dedicated reasoning mechanisms. Moreover, we observed that LLMs reacted differently to the various prompting strategies tested, which include Chain-of-Thought, In-Context Learning and role-based prompting, with no prompt consistently offering the best performance. Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints.
[292] G-CSEA: A Graph-Based Conflict Set Extraction Algorithm for Identifying Infeasibility in Pseudo-Boolean Models
Kanishk Garg, Saranya D., Sanal Kumar, Saurabh Singh, Anupam Purwar
Main category: cs.AI
TL;DR: A graph-based algorithm (G-CSEA) for extracting irreducible infeasible subsets in workforce scheduling models with pseudo-Boolean constraints, inspired by SAT solver conflict analysis.
Details
Motivation: Traditional IIS extraction methods like Additive Deletion and QuickXplain require many solver calls, while dual ray analysis fails for pseudo-Boolean models when the relaxed problem is feasible but the original is infeasible.Method: Proposes Graph-based Conflict Set Extraction Algorithm (G-CSEA) that constructs an implication graph during constraint propagation and traces contributing constraints across decision branches upon conflict detection.
Result: The method efficiently extracts conflict sets that can be minimized to produce Irreducible Infeasible Subsets (IISs), reducing the number of solver calls compared to existing approaches.
Conclusion: G-CSEA provides an effective alternative for identifying minimal constraint conflicts in workforce scheduling models with pseudo-Boolean constraints, overcoming limitations of traditional IIS extraction methods.
Abstract: Workforce scheduling involves a variety of rule-based constraints-such as shift limits, staffing policies, working hour restrictions, and many similar scheduling rules-which can interact in conflicting ways, leading to infeasible models. Identifying the underlying causes of such infeasibility is critical for resolving scheduling issues and restoring feasibility. A common diagnostic approach is to compute Irreducible Infeasible Subsets (IISs): minimal sets of constraints that are jointly infeasible but become feasible when any one is removed. We consider models formulated using pseudo-Boolean constraints with inequality relations over binary variables, which naturally encode scheduling logic. Existing IIS extraction methods such as Additive Deletion and QuickXplain rely on repeated feasibility checks, often incurring large numbers of solver calls. Dual ray analysis, while effective for LP-based models, may fail when the relaxed problem is feasible but the underlying pseudo-Boolean model is not. To address these limitations, we propose Graph-based Conflict Set Extraction Algorithm (G-CSEA) to extract a conflict set, an approach inspired by Conflict-Driven Clause Learning (CDCL) in SAT solvers. Our method constructs an implication graph during constraint propagation and, upon detecting a conflict, traces all contributing constraints across both decision branches. The resulting conflict set can optionally be minimized using QuickXplain to produce an IIS.
[293] Simulating Clinical AI Assistance using Multimodal LLMs: A Case Study in Diabetic Retinopathy
Nadim Barakat, William Lotter
Main category: cs.AI
TL;DR: This paper evaluates multimodal large language models (MLLMs) for diabetic retinopathy detection, comparing GPT-4o and MedGemma across different output formats and collaboration scenarios to assess their potential for clinical AI assistance.
Details
Motivation: Current FDA-cleared diabetic retinopathy screening systems provide only binary referral outputs, which may limit clinical trust and utility. The research aims to determine the most effective output format to enhance clinician-AI performance through scalable simulation.Method: The study tested GPT-4o and MedGemma on IDRiD and Messidor-2 datasets through three experiments: baseline evaluation, simulated AI assistance with synthetic predictions, and actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs.
Result: MedGemma outperformed GPT-4o at baseline with higher sensitivity and AUROC. GPT-4o showed near-perfect specificity but low sensitivity. In collaboration, GPT-4o achieved strong results (AUROC up to 0.96) when guided by MedGemma’s descriptive outputs, even without direct image access.
Conclusion: MLLMs can improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance. Open models like MedGemma are valuable in low-resource settings, while descriptive outputs enhance explainability and clinician trust.
Abstract: Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening. Current FDA-cleared systems primarily provide binary referral outputs, where this minimal output may limit clinical trust and utility. Yet, determining the most effective output format to enhance clinician-AI performance is an empirical challenge that is difficult to assess at scale. We evaluated multimodal large language models (MLLMs) for DR detection and their ability to simulate clinical AI assistance across different output types. Two models were tested on IDRiD and Messidor-2: GPT-4o, a general-purpose MLLM, and MedGemma, an open-source medical model. Experiments included: (1) baseline evaluation, (2) simulated AI assistance with synthetic predictions, and (3) actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs. MedGemma outperformed GPT-4o at baseline, achieving higher sensitivity and AUROC, while GPT-4o showed near-perfect specificity but low sensitivity. Both models adjusted predictions based on simulated AI inputs, but GPT-4o’s performance collapsed with incorrect ones, whereas MedGemma remained more stable. In actual collaboration, GPT-4o achieved strong results when guided by MedGemma’s descriptive outputs, even without direct image access (AUROC up to 0.96). These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations. Open, lightweight models such as MedGemma may be especially valuable in low-resource settings, while descriptive outputs could enhance explainability and clinician trust in clinical workflows.
[294] A Scenario-Driven Cognitive Approach to Next-Generation AI Memory
Linyue Cai, Yuyang Cheng, Xiaoding Shao, Huiming Wang, Yong Zhao, Wei Zhang, Kang Li
Main category: cs.AI
TL;DR: Proposes COLMA, a cognitive layered memory architecture for AGI systems that addresses limitations in adaptability, multimodal integration, and continuous learning through scenario-driven design principles.
Details
Motivation: Current AI memory systems lack robustness, human-like qualities, and struggle with adaptability, multimodal integration, and continuous learning - limitations that hinder progress toward artificial general intelligence.Method: Scenario-driven methodology that extracts functional requirements from cognitive scenarios to establish design principles, resulting in the COgnitive Layered Memory Architecture (COLMA) framework that integrates scenarios, memory processes, and storage mechanisms.
Result: Development of COLMA as a structured framework that provides a foundation for AI systems capable of lifelong learning and human-like reasoning.
Conclusion: COLMA contributes to the pragmatic development of AGI by offering a cohesive memory architecture that addresses key limitations in current systems and supports more human-like cognitive capabilities.
Abstract: As artificial intelligence advances toward artificial general intelligence (AGI), the need for robust and human-like memory systems has become increasingly evident. Current memory architectures often suffer from limited adaptability, insufficient multimodal integration, and an inability to support continuous learning. To address these limitations, we propose a scenario-driven methodology that extracts essential functional requirements from representative cognitive scenarios, leading to a unified set of design principles for next-generation AI memory systems. Based on this approach, we introduce the \textbf{COgnitive Layered Memory Architecture (COLMA)}, a novel framework that integrates cognitive scenarios, memory processes, and storage mechanisms into a cohesive design. COLMA provides a structured foundation for developing AI systems capable of lifelong learning and human-like reasoning, thereby contributing to the pragmatic development of AGI.
[295] RepIt: Representing Isolated Targets to Steer Language Models
Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang
Main category: cs.AI
TL;DR: RepIt is a simple, data-efficient framework for isolating concept-specific representations in LLMs, enabling precise interventions that suppress refusal on targeted concepts while preserving safety elsewhere.
Details
Motivation: Current activation steering methods in LLMs often have broader effects than desired, motivating the need for purer concept vectors to enable targeted interventions and understand model behavior at a more granular level.Method: RepIt framework for isolating concept-specific representations, using corrective signals that localize to just 100-200 neurons and can be extracted from as few as a dozen examples.
Result: Across five frontier LLMs, RepIt enables precise interventions that selectively suppress refusal on targeted concepts while preserving refusal elsewhere, allowing models to answer WMD-related questions while still scoring as safe on standard benchmarks.
Conclusion: Targeted interventions can counteract overgeneralization in LLMs, laying the foundation for more granular control of model behavior, though the efficiency raises concerns about potential misuse with modest compute and data.
Abstract: While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior.
[296] Shapes of Cognition for Computational Cognitive Modeling
Marjorie McShane, Sergei Nirenburg, Sanjay Oruganti, Jesse English
Main category: cs.AI
TL;DR: Shapes of cognition is a new paradigm for modeling intelligent agents that uses remembered knowledge constellations to simplify real-world complexity through pattern recognition, analogy, and cognitive load minimization.
Details
Motivation: To create a computational cognitive modeling approach that enables Language-Endowed Intelligent Agents (LEIAs) to handle real-life complexity like humans do - by expecting typical patterns, minimizing cognitive load, and recovering from atypical situations.Method: Uses shapes (remembered constellations of sensory, linguistic, conceptual, episodic, and procedural knowledge) within a specific cognitive architecture. Includes shapes-based recovery methods for atypical outcomes like on-the-fly learning, human assistance, and situational understanding.
Result: Provides a concrete framework for building explainable, extensible, and trustworthy agent systems that can operate in critical domains. The approach gives new life to knowledge-based and hybrid AI systems.
Conclusion: Shapes-based modeling offers a specific yet broadly applicable paradigm for cognitive modeling that combines knowledge representation with practical implementation strategies, enabling the development of reliable intelligent agents that mimic human cognitive efficiency.
Abstract: Shapes of cognition is a new conceptual paradigm for the computational cognitive modeling of Language-Endowed Intelligent Agents (LEIAs). Shapes are remembered constellations of sensory, linguistic, conceptual, episodic, and procedural knowledge that allow agents to cut through the complexity of real life the same way as people do: by expecting things to be typical, recognizing patterns, acting by habit, reasoning by analogy, satisficing, and generally minimizing cognitive load to the degree situations permit. Atypical outcomes are treated using shapes-based recovery methods, such as learning on the fly, asking a human partner for help, or seeking an actionable, even if imperfect, situational understanding. Although shapes is an umbrella term, it is not vague: shapes-based modeling involves particular objectives, hypotheses, modeling strategies, knowledge bases, and actual models of wide-ranging phenomena, all implemented within a particular cognitive architecture. Such specificity is needed both to vet our hypotheses and to achieve our practical aims of building useful agent systems that are explainable, extensible, and worthy of our trust, even in critical domains. However, although the LEIA example of shapes-based modeling is specific, the principles can be applied more broadly, giving new life to knowledge-based and hybrid AI.
[297] Concurrent Linguistic Error Detection (CLED): a New Methodology for Error Detection in Large Language Models
Jinhua Zhu, Javier Conde, Zhen Gao, Pedro Reviriego, Shanshan Liu, Fabrizio Lombardi
Main category: cs.AI
TL;DR: CLED is a concurrent linguistic error detection scheme that uses linguistic features from LLM outputs to detect errors without needing access to internal model nodes, achieving high error detection with low overhead.
Details
Motivation: Large language models are widely adopted but their black-box nature prevents traditional error detection methods that require internal node access. Error detection is crucial for system reliability.Method: Extracts linguistic features from LLM-generated text and feeds them to a concurrent classifier that detects errors based on text validity and deviation from normal patterns.
Result: CLED effectively detects most errors in both T5 (news summarization) and OPUS-MT (translation) models with low overhead, using the same linguistic feature set across different applications.
Conclusion: The proposed scheme provides an effective error detection method for black-box LLMs, offers flexibility through trade-offs between detection effectiveness and overhead, and demonstrates broad applicability beyond specific use cases.
Abstract: The wide adoption of Large language models (LLMs) makes their dependability a pressing concern. Detection of errors is the first step to mitigating their impact on a system and thus, efficient error detection for LLMs is an important issue. In many settings, the LLM is considered as a black box with no access to the internal nodes; this prevents the use of many error detection schemes that need access to the model’s internal nodes. An interesting observation is that the output of LLMs in error-free operation should be valid and normal text. Therefore, when the text is not valid or differs significantly from normal text, it is likely that there is an error. Based on this observation we propose to perform Concurrent Linguistic Error Detection (CLED); this scheme extracts some linguistic features of the text generated by the LLM and feeds them to a concurrent classifier that detects errors. Since the proposed error detection mechanism only relies on the outputs of the model, then it can be used on LLMs in which there is no access to the internal nodes. The proposed CLED scheme has been evaluated on the T5 model when used for news summarization and on the OPUS-MT model when used for translation. In both cases, the same set of linguistic features has been used for error detection to illustrate the applicability of the proposed scheme beyond a specific case. The results show that CLED can detect most of the errors at a low overhead penalty. The use of the concurrent classifier also enables a trade-off between error detection effectiveness and its associated overhead, so providing flexibility to a designer.
[298] Federated Cross-Training Learners for Robust Generalization under Data Heterogeneity
Zhuang Qi, Lei Meng, Ruohan Zhang, Yu Wang, Xin Qi, Xiangxu Meng, Han Yu, Qiang Yang
Main category: cs.AI
TL;DR: FedCT is a federated learning cross-training scheme that uses multi-view knowledge distillation to address feature space heterogeneity and improve model generalization across clients with different data distributions.
Details
Motivation: Federated learning faces challenges with misaligned optimization goals and feature space heterogeneity due to inherent differences in data distributions across clients, even after cross-training.Method: Proposes FedCT with three modules: consistency-aware knowledge broadcasting for optimal model assignment, multi-view knowledge-guided representation learning using fused prototypical knowledge from global and local views, and mixup-based feature augmentation to increase feature space diversity.
Result: Extensive experiments on four datasets show FedCT outperforms state-of-the-art methods by alleviating knowledge forgetting from both local and global views.
Conclusion: FedCT effectively preserves client-specific characteristics while ensuring feature alignment across clients through multi-view knowledge distillation, achieving superior performance in federated learning scenarios.
Abstract: Federated learning benefits from cross-training strategies, which enables models to train on data from distinct sources to improve generalization capability. However, due to inherent differences in data distributions, the optimization goals of local models remain misaligned, and this mismatch continues to manifest as feature space heterogeneity even after cross-training. We argue that knowledge distillation from the personalized view preserves client-specific characteristics and expands the local knowledge base, while distillation from the global view provides consistent semantic anchors that facilitate feature alignment across clients. To achieve this goal, this paper presents a cross-training scheme, termed FedCT, includes three main modules, where the consistency-aware knowledge broadcasting module aims to optimize model assignment strategies, which enhances collaborative advantages between clients and achieves an efficient federated learning process. The multi-view knowledge-guided representation learning module leverages fused prototypical knowledge from both global and local views to enhance the preservation of local knowledge before and after model exchange, as well as to ensure consistency between local and global knowledge. The mixup-based feature augmentation module aggregates rich information to further increase the diversity of feature spaces, which enables the model to better discriminate complex samples. Extensive experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study. The results demonstrated that FedCT alleviates knowledge forgetting from both local and global views, which enables it outperform state-of-the-art methods.
[299] Overcoming classic challenges for artificial neural networks by providing incentives and practice
Kazuki Irie, Brenden M. Lake
Main category: cs.AI
TL;DR: Metalearning approaches address classic ANN weaknesses by providing explicit incentives and practice opportunities, contrasting with conventional methods that hope desired behaviors emerge indirectly.
Details
Motivation: To overcome key weaknesses in artificial neural network models compared to human cognitive abilities, particularly addressing the Problem of Incentive and Practice.Method: Using metalearning to provide machines with both incentives to improve specific skills and opportunities to practice those skills, with applications to systematic generalization, catastrophic forgetting, few-shot learning, and multi-step reasoning.
Result: Metalearning helps address four classic ANN challenges, and large language models incorporate aspects of this framework (sequence prediction with feedback on diverse data), explaining some of their successes.
Conclusion: The framework shows promise for understanding human development and whether natural environments provide appropriate incentives and practice for making challenging generalizations.
Abstract: Since the earliest proposals for artificial neural network (ANN) models of the mind and brain, critics have pointed out key weaknesses in these models compared to human cognitive abilities. Here we review recent work that uses metalearning to overcome several classic challenges, which we characterize as addressing the Problem of Incentive and Practice – that is, providing machines with both incentives to improve specific skills and opportunities to practice those skills. This explicit optimization contrasts with more conventional approaches that hope the desired behaviour will emerge through optimizing related but different objectives. We review applications of this principle to addressing four classic challenges for ANNs: systematic generalization, catastrophic forgetting, few-shot learning and multi-step reasoning. We also discuss how large language models incorporate key aspects of this metalearning framework (namely, sequence prediction with feedback trained on diverse data), which helps to explain some of their successes on these classic challenges. Finally, we discuss the prospects for understanding aspects of human development through this framework, and whether natural environments provide the right incentives and practice for learning how to make challenging generalizations.
[300] Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge
Seongmin Lee, Hsiang Hsu, Chun-Fu Chen, Duen Horng Chau
Main category: cs.AI
TL;DR: SHINE is a novel hallucination probing method that classifies LLM-generated text into three categories without external knowledge, supervised training, or LLM fine-tuning, achieving state-of-the-art performance in hallucination detection.
Details
Motivation: Current hallucination detection methods rely on external knowledge, LLM fine-tuning, or large labeled datasets, and fail to distinguish between different types of hallucinations, which is crucial for improving detection performance.Method: SHINE uses hallucination probing by perturbing key entities in prompts to differentially affect LLM’s generation of three text types (aligned, misaligned, fabricated), requiring no external knowledge, supervised training, or LLM fine-tuning.
Result: SHINE achieves state-of-the-art performance in hallucination detection, outperforming seven competing methods across four datasets and four LLMs, demonstrating effectiveness across three modern LLMs.
Conclusion: The approach underscores the importance of hallucination probing for accurate detection, showing that distinguishing between different hallucination types significantly enhances detection performance without requiring external resources.
Abstract: LLM hallucination, where unfaithful text is generated, presents a critical challenge for LLMs’ practical applications. Current detection methods often resort to external knowledge, LLM fine-tuning, or supervised training with large hallucination-labeled datasets. Moreover, these approaches do not distinguish between different types of hallucinations, which is crucial for enhancing detection performance. To address such limitations, we introduce hallucination probing, a new task that classifies LLM-generated text into three categories: aligned, misaligned, and fabricated. Driven by our novel discovery that perturbing key entities in prompts affects LLM’s generation of these three types of text differently, we propose SHINE, a novel hallucination probing method that does not require external knowledge, supervised training, or LLM fine-tuning. SHINE is effective in hallucination probing across three modern LLMs, and achieves state-of-the-art performance in hallucination detection, outperforming seven competing methods across four datasets and four LLMs, underscoring the importance of probing for accurate detection.
[301] CredID: Credible Multi-Bit Watermark for Large Language Models Identification
Haoyu Jiang, Xuhong Wang, Ping Yi, Shanzhe Lei, Yilun Lin
Main category: cs.AI
TL;DR: CredID is a multi-party watermarking framework with TTP coordination that enables privacy-preserving LLM text identification while maintaining text quality and achieving high accuracy.
Details
Motivation: Address privacy and security concerns in LLMs due to lack of identity recognition, while overcoming limitations of current watermarking algorithms in text quality, information capacity, and robustness.Method: Proposes a multi-party credible watermarking framework involving trusted third party (TTP) and multiple LLM vendors. Uses seed-based watermark generation without sharing user prompts, and coordinated extraction/verification. Also introduces a novel multi-bit watermarking algorithm.
Result: Experiments show enhanced watermark credibility and efficiency without compromising text quality. Achieved highly accurate identification among multiple LLM vendors.
Conclusion: CredID provides a credible watermarking solution that preserves vendor privacy while addressing current limitations in LLM text identification and watermarking techniques.
Abstract: Large Language Models (LLMs) are widely used in complex natural language processing tasks but raise privacy and security concerns due to the lack of identity recognition. This paper proposes a multi-party credible watermarking framework (CredID) involving a trusted third party (TTP) and multiple LLM vendors to address these issues. In the watermark embedding stage, vendors request a seed from the TTP to generate watermarked text without sending the user’s prompt. In the extraction stage, the TTP coordinates each vendor to extract and verify the watermark from the text. This provides a credible watermarking scheme while preserving vendor privacy. Furthermore, current watermarking algorithms struggle with text quality, information capacity, and robustness, making it challenging to meet the diverse identification needs of LLMs. Thus, we propose a novel multi-bit watermarking algorithm and an open-source toolkit to facilitate research. Experiments show our CredID enhances watermark credibility and efficiency without compromising text quality. Additionally, we successfully utilized this framework to achieve highly accurate identification among multiple LLM vendors.
[302] Robust Decision-Making Via Free Energy Minimization
Allahkaram Shafiei, Hozefa Jesawada, Karl Friston, Giovanni Russo
Main category: cs.AI
TL;DR: DR-FREE is a distributionally robust free energy model that enables autonomous agents to maintain optimal performance despite training-environment mismatches, outperforming state-of-the-art methods in ambiguous navigation tasks.
Details
Motivation: Current autonomous agents fail catastrophically when faced with minor mismatches between training and environmental conditions, creating a critical need for robustness in real-world deployments.Method: Combines robust free energy principle extension with novel resolution engine to produce optimal-yet-robust policies through free energy minimization, featuring explicit soft-max structure and Bayesian belief updating.
Result: DR-FREE successfully enables robot navigation in ambiguous obstacle-filled environments where state-of-the-art free energy models fail, demonstrating superior robustness across all experiments.
Conclusion: The approach represents a milestone for deploying robust agents in multi-agent settings and may provide biological plausibility for how natural agents survive in unpredictable environments with minimal training.
Abstract: Despite their groundbreaking performance, state-of-the-art autonomous agents can misbehave when training and environmental conditions become inconsistent, with minor mismatches leading to undesirable behaviors or even catastrophic failures. Robustness towards these training/environment ambiguities is a core requirement for intelligent agents and its fulfillment is a long-standing challenge when deploying agents in the real world. Here, we introduce a Distributionally Robust Free Energy model (DR-FREE) that instills this core property by design. It directly wires robustness into the agent decision-making mechanisms via free energy minimization. By combining a robust extension of the free energy principle with a novel resolution engine, DR-FREE returns a policy that is optimal-yet-robust against ambiguity. The policy has an explicit, soft-max, structure that reveals the mechanistic role of ambiguity on optimal decisions and requisite Bayesian belief updating. We evaluate DR-FREE on an experimental testbed involving real rovers navigating an ambiguous environment filled with obstacles. Across all the experiments, DR-FREE enables robots to successfully navigate towards their goal even when, in contrast, state-of-the-art free energy models fail. In short, DR-FREE can tackle scenarios that elude previous methods: this milestone may inspire both deployment in multi-agent settings and, at a perhaps deeper level, the quest for a biologically plausible explanation of how natural agents – with little or no training – survive in capricious environments.
[303] Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning
Weiliang Zhang, Xiaohan Huang, Yi Du, Ziyue Qiao, Qingqing Long, Zhen Meng, Yuanchun Zhou, Meng Xiao
Main category: cs.AI
TL;DR: HRLFS is a novel hierarchical reinforcement learning approach for feature selection that uses LLM-based feature clustering to reduce agent complexity and improve efficiency compared to traditional one-agent-per-feature methods.
Details
Motivation: Current RL-based feature selection methods face challenges with complex datasets due to inefficient one-agent-per-feature paradigms and dataset complexities, limiting their scalability and performance.Method: Uses LLM-based hybrid state extractor to capture mathematical and semantic features, clusters features, and constructs hierarchical agents for each cluster and sub-cluster to optimize feature subspace exploration.
Result: Extensive experiments show HRLFS improves downstream ML performance with iterative feature subspace exploration while accelerating runtime by reducing agent count compared to contemporary approaches.
Conclusion: HRLFS provides an efficient, scalable, and robust feature selection solution that addresses the limitations of traditional RL methods through hierarchical agent organization and LLM-enhanced feature characterization.
Abstract: Feature selection aims to preprocess the target dataset, find an optimal and most streamlined feature subset, and enhance the downstream machine learning task. Among filter, wrapper, and embedded-based approaches, the reinforcement learning (RL)-based subspace exploration strategy provides a novel objective optimization-directed perspective and promising performance. Nevertheless, even with improved performance, current reinforcement learning approaches face challenges similar to conventional methods when dealing with complex datasets. These challenges stem from the inefficient paradigm of using one agent per feature and the inherent complexities present in the datasets. This observation motivates us to investigate and address the above issue and propose a novel approach, namely HRLFS. Our methodology initially employs a Large Language Model (LLM)-based hybrid state extractor to capture each feature’s mathematical and semantic characteristics. Based on this information, features are clustered, facilitating the construction of hierarchical agents for each cluster and sub-cluster. Extensive experiments demonstrate the efficiency, scalability, and robustness of our approach. Compared to contemporary or the one-feature-one-agent RL-based approaches, HRLFS improves the downstream ML performance with iterative feature subspace exploration while accelerating total run time by reducing the number of agents involved.
[304] Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success
Ben Griffin, Diego Vidaurre, Ugur Koyluoglu, Joseph Ternasky, Fuat Alican, Yigit Ihlamur
Main category: cs.AI
TL;DR: RRF is a new ensemble method that uses LLMs to generate simple YES/NO questions as weak learners, achieving significant improvements in predicting startup success while maintaining interpretability.
Details
Motivation: Predicting rare outcomes like startup success requires models that are both accurate and interpretable for high-stakes decision making in venture capital.Method: Random Rule Forest (RRF) uses LLMs to generate natural language YES/NO questions as weak learners, combined through threshold-based voting to create an interpretable ensemble predictor.
Result: RRF achieved 6.9x improvement over random baseline, 8x with expert questions, and F0.5 of 0.121 vs 0.086 for best baseline (+41% improvement) on 9,892 founder dataset.
Conclusion: RRF successfully combines LLM creativity with ensemble learning rigor to deliver high-precision, interpretable predictions suitable for high-stakes domains like venture capital.
Abstract: Predicting rare outcomes such as startup success is central to venture capital, demanding models that are both accurate and interpretable. We introduce Random Rule Forest (RRF), a lightweight ensemble method that uses a large language model (LLM) to generate simple YES/NO questions in natural language. Each question functions as a weak learner, and their responses are combined using a threshold-based voting rule to form a strong, interpretable predictor. Applied to a dataset of 9,892 founders, RRF achieves a 6.9x improvement over a random baseline on held-out data; adding expert-crafted questions lifts this to 8x and highlights the value of human-LLM collaboration. Compared with zero- and few-shot baselines across three LLM architectures, RRF attains an F0.5 of 0.121, versus 0.086 for the best baseline (+0.035 absolute, +41% relative). By combining the creativity of LLMs with the rigor of ensemble learning, RRF delivers interpretable, high-precision predictions suitable for decision-making in high-stakes domains.
[305] Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D
Artemis Panagopoulou, Le Xue, Honglu Zhou, silvio savarese, Ran Xu, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles
Main category: cs.AI
TL;DR: Contra4 is a new dataset for evaluating contrastive cross-modal reasoning across image, audio, video, and 3D modalities, showing current multimodal models struggle with selecting the most relevant modality for natural language queries.
Details
Motivation: Real-world decision-making requires identifying which modality contains the most relevant information for a query, but current multimodal models lack clear evaluation of their contrastive reasoning capabilities across modalities.Method: Created Contra4 dataset with natural language questions and multiple candidate modality instances, using human-annotated captions with mixture-of-models round-trip-consistency filtering to ensure high-quality supervision.
Result: State-of-the-art models achieve only 56% accuracy overall and 42% in four-modality settings, despite task-specific fine-tuning improving performance by 56% relative to baseline.
Conclusion: Current multimodal models have significant limitations in contrastive cross-modal reasoning, highlighting a fundamental capability gap that needs to be addressed for effective real-world decision-making systems.
Abstract: Real-world decision-making often begins with identifying which modality contains the most relevant information for a given query. While recent multimodal models have made impressive progress in processing diverse inputs, it remains unclear whether they can reason contrastively across multiple modalities to select the one that best satisfies a natural language prompt. We argue this capability is foundational, especially in retrieval-augmented and decision-time contexts, where systems must evaluate multiple signals and identify which one conveys the relevant information. To evaluate this skill, we introduce Contra4, a dataset for contrastive cross-modal reasoning across four modalities: image, audio, video, and 3D. Each example presents a natural language question alongside multiple candidate modality instances, and the model must select the one that semantically aligns with the prompt. Contra4 combines human-annotated captions with a mixture-of-models round-trip-consistency filter to ensure high-quality supervision, resulting in 174k training examples and a manually verified test set of 2.3k samples. While task-specific fine-tuning helps improve performance by 56% relative to baseline, state-of-the-art models still achieve only an absolute of 56% accuracy overall and 42% in four-modality settings, underscoring a significant limitation in current multimodal models.
[306] Small Language Models are the Future of Agentic AI
Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, Pavlo Molchanov
Main category: cs.AI
TL;DR: Small language models (SLMs) are more suitable and economical than large language models (LLMs) for repetitive specialized tasks in agentic AI systems, with heterogeneous systems handling general conversation needs.
Details
Motivation: The rise of agentic AI systems requires specialized, repetitive task performance where large language models are overkill and economically inefficient, creating a need for more targeted and cost-effective solutions.Method: The paper presents arguments based on current SLM capabilities, common agentic system architectures, and deployment economics, plus proposes an LLM-to-SLM agent conversion algorithm.
Result: The analysis shows SLMs are sufficiently powerful and more economical for many agentic applications, with heterogeneous systems being optimal for scenarios requiring general conversational abilities.
Conclusion: Small language models represent the future of agentic AI due to their suitability, efficiency, and cost-effectiveness for specialized repetitive tasks, with significant operational and economic impacts expected from this shift.
Abstract: Large language models (LLMs) are often praised for exhibiting near-human performance on a wide range of tasks and valued for their ability to hold a general conversation. The rise of agentic AI systems is, however, ushering in a mass of applications in which language models perform a small number of specialized tasks repetitively and with little variation. Here we lay out the position that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI. Our argumentation is grounded in the current level of capabilities exhibited by SLMs, the common architectures of agentic systems, and the economy of LM deployment. We further argue that in situations where general-purpose conversational abilities are essential, heterogeneous agentic systems (i.e., agents invoking multiple different models) are the natural choice. We discuss the potential barriers for the adoption of SLMs in agentic systems and outline a general LLM-to-SLM agent conversion algorithm. Our position, formulated as a value statement, highlights the significance of the operational and economic impact even a partial shift from LLMs to SLMs is to have on the AI agent industry. We aim to stimulate the discussion on the effective use of AI resources and hope to advance the efforts to lower the costs of AI of the present day. Calling for both contributions to and critique of our position, we commit to publishing all such correspondence at https://research.nvidia.com/labs/lpr/slm-agents.
[307] TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems
Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis
Main category: cs.AI
TL;DR: This paper presents a comprehensive review of Trust, Risk, and Security Management (TRiSM) for LLM-based Agentic Multi-Agent Systems, proposing a risk taxonomy, novel metrics for assessment, and strategies for responsible development.
Details
Motivation: Agentic AI systems built on LLMs are transforming enterprise and societal domains, but they introduce unique trust, risk, and security challenges that require specialized frameworks for safe and responsible deployment.Method: The authors adapt and extend the AI TRiSM framework for Agentic AI, structured around explainability, ModelOps, security, privacy, and lifecycle governance. They propose a risk taxonomy and introduce two novel metrics: Component Synergy Score (CSS) and Tool Utilization Efficacy (TUE).
Result: The review provides a structured analysis framework, risk taxonomy capturing unique Agentic AI threats, and practical assessment metrics to evaluate inter-agent collaboration quality and tool use efficiency.
Conclusion: The paper concludes with a research roadmap for responsible Agentic AI development, emphasizing the need to align emerging systems with TRiSM principles to ensure safety, transparency, and accountability in operation.
Abstract: Agentic AI systems, built upon large language models (LLMs) and deployed in multi-agent configurations, are redefining intelligence, autonomy, collaboration, and decision-making across enterprise and societal domains. This review presents a structured analysis of Trust, Risk, and Security Management (TRiSM) in the context of LLM-based Agentic Multi-Agent Systems (AMAS). We begin by examining the conceptual foundations of Agentic AI and highlight its architectural distinctions from traditional AI agents. We then adapt and extend the AI TRiSM framework for Agentic AI, structured around key pillars: \textit{ Explainability, ModelOps, Security, Privacy} and \textit{their Lifecycle Governance}, each contextualized to the challenges of AMAS. A risk taxonomy is proposed to capture the unique threats and vulnerabilities of Agentic AI, ranging from coordination failures to prompt-based adversarial manipulation. To support practical assessment in Agentic AI works, we introduce two novel metrics: the Component Synergy Score (CSS), which quantifies the quality of inter-agent collaboration, and the Tool Utilization Efficacy (TUE), which evaluates the efficiency of tool use within agent workflows. We further discuss strategies for improving explainability in Agentic AI, as well as approaches to enhancing security and privacy through encryption, adversarial robustness, and regulatory compliance. The review concludes with a research roadmap for the responsible development and deployment of Agentic AI, highlighting key directions to align emerging systems with TRiSM principles-ensuring safety, transparency, and accountability in their operation.
[308] Neuromorphic Computing with Multi-Frequency Oscillations: A Bio-Inspired Approach to Artificial Intelligence
Boheng Liu, Ziyu Li, Xia Wu
Main category: cs.AI
TL;DR: A brain-inspired tripartite architecture with specialized perceptual, auxiliary, and executive systems, enhanced by multi-frequency neural oscillations and synaptic adaptation, achieves superior performance with 2.18% accuracy improvement and 48.44% computation reduction compared to state-of-the-art temporal processing approaches.
Details
Motivation: Artificial neural networks lack flexible, generalizable intelligence due to divergence from biological cognition, specifically overlooking functional specialization of neural regions and temporal dynamics critical for coordinating specialized systems.Method: Proposes a tripartite brain-inspired architecture with functionally specialized perceptual, auxiliary, and executive systems, integrated with temporal dynamics through multi-frequency neural oscillation simulation and synaptic dynamic adaptation mechanisms.
Result: Initial evaluations show superior performance: 2.18% accuracy improvement, 48.44% reduction in required computation iterations, and higher correlation with human confidence patterns compared to state-of-the-art temporal processing approaches.
Conclusion: The architecture establishes a theoretical foundation for brain-like intelligence across cognitive domains and potentially bridges the gap between artificial and biological intelligence, though currently demonstrated only on visual processing tasks.
Abstract: Despite remarkable capabilities, artificial neural networks exhibit limited flexible, generalizable intelligence. This limitation stems from their fundamental divergence from biological cognition that overlooks both neural regions’ functional specialization and the temporal dynamics critical for coordinating these specialized systems. We propose a tripartite brain-inspired architecture comprising functionally specialized perceptual, auxiliary, and executive systems. Moreover, the integration of temporal dynamics through the simulation of multi-frequency neural oscillation and synaptic dynamic adaptation mechanisms enhances the architecture, thereby enabling more flexible and efficient artificial cognition. Initial evaluations demonstrate superior performance compared to state-of-the-art temporal processing approaches, with 2.18% accuracy improvements while reducing required computation iterations by 48.44%, and achieving higher correlation with human confidence patterns. Though currently demonstrated on visual processing tasks, this architecture establishes a theoretical foundation for brain-like intelligence across cognitive domains, potentially bridging the gap between artificial and biological intelligence.
[309] Explaining Tournament Solutions with Minimal Supports
Clément Contet, Umberto Grandi, Jérôme Mengin
Main category: cs.AI
TL;DR: The paper studies certified explanations for tournament winners by identifying minimal sub-tournaments where a candidate is guaranteed to win regardless of how the rest of the tournament is completed.
Details
Motivation: To provide formal, certified explanations for why certain candidates win tournaments under various voting rules, addressing a central concept in explainable AI.Method: The authors identify minimal supports (sub-tournaments) where a candidate is a necessary winner and analyze this for multiple tournament rules including top cycle, uncovered set, Copeland, Borda, maximin, and weighted uncovered set.
Result: For all rules except weighted uncovered set, polynomial-time algorithms are presented to compute minimal supports. For weighted uncovered set, the problem is NP-complete. The paper also determines the size of smallest minimal supports for each rule.
Conclusion: Minimal supports provide compact, certified, and intuitive explanations for tournament winners, making them valuable tools for formal explainable AI in tournament settings.
Abstract: Tournaments are widely used models to represent pairwise dominance between candidates, alternatives, or teams. We study the problem of providing certified explanations for why a candidate appears among the winners under various tournament rules. To this end, we identify minimal supports, minimal sub-tournaments in which the candidate is guaranteed to win regardless of how the rest of the tournament is completed (that is, the candidate is a necessary winner of the sub-tournament). This notion corresponds to an abductive explanation for the question,“Why does the winner win the tournament”, a central concept in formal explainable AI. We focus on common tournament solutions: the top cycle, the uncovered set, the Copeland rule, the Borda rule, the maximin rule, and the weighted uncovered set. For each rule we determine the size of the smallest minimal supports, and we present polynomial-time algorithms to compute them for all but the weighted uncovered set, for which the problem is NP-complete. Finally, we show how minimal supports can serve to produce compact, certified, and intuitive explanations.
[310] Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture
Aleksandr Boldachev
Main category: cs.AI
TL;DR: Boldsea is an architecture for modeling complex dynamic systems using executable ontologies that integrate event semantics with dataflow to overcome limitations of traditional BPM systems and object-oriented semantic technologies.
Details
Motivation: To address the limitations of traditional Business Process Management systems and object-oriented semantic technologies by creating a more dynamic and flexible approach to modeling complex systems.Method: Developed the BSL (boldsea Semantic Language) with formal BNF grammar and created the boldsea-engine architecture that directly interprets semantic models as executable algorithms without compilation.
Result: The approach enables runtime modification of event models, ensures temporal transparency, and seamlessly merges data and business logic within a unified semantic framework.
Conclusion: Boldsea provides an effective architecture for executable ontologies that can dynamically control process execution while overcoming traditional system limitations.
Abstract: This paper presents boldsea, Boldachev’s semantic-event approach – an architecture for modeling complex dynamic systems using executable ontologies – semantic models that act as dynamic structures, directly controlling process execution. We demonstrate that integrating event semantics with a dataflow architecture addresses the limitations of traditional Business Process Management (BPM) systems and object-oriented semantic technologies. The paper presents the formal BSL (boldsea Semantic Language), including its BNF grammar, and outlines the boldsea-engine’s architecture, which directly interprets semantic models as executable algorithms without compilation. It enables the modification of event models at runtime, ensures temporal transparency, and seamlessly merges data and business logic within a unified semantic framework.
[311] When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models
Wei Cai, Shujuan Liu, Jian Zhao, Ziyan Shi, Yusheng Zhao, Yuchen Yuan, Tianle Zhang, Chi Zhang, Xuelong Li
Main category: cs.AI
TL;DR: MLLMs have implicit reasoning risks where safe unimodal inputs combine into harmful multimodal outputs. The paper introduces SSUI dataset and SRPO training framework to align MLLM reasoning with safety values, achieving SOTA results.
Details
Motivation: MLLMs are vulnerable to implicit reasoning risks where innocuous individual inputs synergistically form harmful multimodal outputs, due to difficulty maintaining safety alignment through long-chain reasoning.Method: Introduces Safe-Semantics-but-Unsafe-Interpretation (SSUI) dataset with interpretable reasoning paths, and Safety-aware Reasoning Path Optimization (SRPO) training framework to align MLLM reasoning with human safety values.
Result: SRPO-trained models achieve state-of-the-art results on safety benchmarks including the proposed Reasoning Path Benchmark (RSBench), significantly outperforming both open-source and top-tier commercial MLLMs.
Conclusion: The proposed SSUI dataset and SRPO framework effectively address implicit reasoning risks in MLLMs by aligning internal reasoning processes with safety values, demonstrating superior performance on safety benchmarks.
Abstract: Multimodal Large Language Models (MLLMs) are susceptible to the implicit reasoning risk, wherein innocuous unimodal inputs synergistically assemble into risky multimodal data that produce harmful outputs. We attribute this vulnerability to the difficulty of MLLMs maintaining safety alignment through long-chain reasoning. To address this issue, we introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for such a cross-modal challenge. A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset to align the MLLM’s internal reasoning process with human safety values. Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks, including the proposed Reasoning Path Benchmark (RSBench), significantly outperforming both open-source and top-tier commercial MLLMs.
cs.SD
[312] An Adaptive CMSA for Solving the Longest Filled Common Subsequence Problem with an Application in Audio Querying
Marko Djukanovic, Christian Blum, Aleksandar Kartelj, Ana Nikolikj, Guenther Raidl
Main category: cs.SD
TL;DR: This paper introduces an adaptive CMSA framework for solving the NP-hard Longest Filled Common Subsequence problem, achieving state-of-the-art performance on large instances and proposing novel applications in song identification.
Details
Motivation: Existing approaches for the LFCS problem have only been evaluated on small instances, lacking insights into scalability. The authors aim to address this gap by creating larger benchmarks and developing a scalable solution.Method: An adaptive Construct, Merge, Solve, Adapt (CMSA) framework that iteratively generates subproblems via component-based construction and refines them using feedback from prior iterations, with subproblems solved using an external black-box solver.
Result: The proposed adaptive CMSA outperforms five leading methods, solving 1,486 out of 1,510 instances with known optimal solutions, achieving over 99.9% optimal solution quality and demonstrating exceptional scalability.
Conclusion: The adaptive CMSA framework provides state-of-the-art performance for LFCS problems, offers novel applications in song identification, and includes empirical explainability analysis to identify critical problem features affecting algorithm performance.
Abstract: This paper addresses the Longest Filled Common Subsequence (LFCS) problem, a challenging NP-hard problem with applications in bioinformatics, including gene mutation prediction and genomic data reconstruction. Existing approaches, including exact, metaheuristic, and approximation algorithms, have primarily been evaluated on small-sized instances, which offer limited insights into their scalability. In this work, we introduce a new benchmark dataset with significantly larger instances and demonstrate that existing datasets lack the discriminative power needed to meaningfully assess algorithm performance at scale. To solve large instances efficiently, we utilize an adaptive Construct, Merge, Solve, Adapt (CMSA) framework that iteratively generates promising subproblems via component-based construction and refines them using feedback from prior iterations. Subproblems are solved using an external black-box solver. Extensive experiments on both standard and newly introduced benchmarks show that the proposed adaptive CMSA achieves state-of-the-art performance, outperforming five leading methods. Notably, on 1,510 problem instances with known optimal solutions, our approach solves 1,486 of them – achieving over 99.9% optimal solution quality and demonstrating exceptional scalability. We additionally propose a novel application of LFCS for song identification from degraded audio excerpts as an engineering contribution, using real-world energy-profile instances from popular music. Finally, we conducted an empirical explainability analysis to identify critical feature combinations influencing algorithm performance, i.e., the key problem features contributing to success or failure of the approaches across different instance types are revealed.
[313] A Traditional Approach to Symbolic Piano Continuation
Christian Zhou-Zheng, John Backsund, Dun Li Chan, Alex Coventry, Avid Eslami, Jyotin Goel, Xingwen Han, Danysh Soomro, Galen Wei
Main category: cs.SD
TL;DR: A simple next-token prediction approach for piano music continuation that aims to outperform large foundation models using better data and fundamentals rather than complex architectures.
Details
Motivation: To demonstrate that simpler approaches remain more effective than large foundation models for constrained, single-instrument music generation tasks like piano music continuation.Method: Uses a simple, unaugmented next-token-prediction objective on tokenized raw MIDI data without sophisticated architectural modifications.
Result: The approach is submitted for the MIREX 2025 Symbolic Music Generation challenge, with model weights and code released publicly.
Conclusion: Simple traditional methods with better data and fundamentals can outperform complex large foundation models for specific constrained music generation tasks.
Abstract: We present a traditional approach to symbolic piano music continuation for the MIREX 2025 Symbolic Music Generation challenge. While computational music generation has recently focused on developing large foundation models with sophisticated architectural modifications, we argue that simpler approaches remain more effective for constrained, single-instrument tasks. We thus return to a simple, unaugmented next-token-prediction objective on tokenized raw MIDI, aiming to outperform large foundation models by using better data and better fundamentals. We release model weights and code at https://github.com/christianazinn/mirex2025.
[314] Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio questuin answering
Jinghua Zhao, Hang Su, Lichun Fan, Zhenbo Luo, Jian Luan, Hui Wang, Haoqin Sun, Yong Qin
Main category: cs.SD
TL;DR: Omni-CLST is an error-aware curriculum learning framework with selective chain-of-thought for audio question answering that achieves state-of-the-art performance on benchmark datasets.
Details
Motivation: To improve audio question answering by efficiently leveraging existing high-quality datasets through difficulty-based organization and focused reasoning on challenging cases.Method: Uses error-aware curriculum learning to organize samples by difficulty, guided thought dropout mechanism for focused reasoning on challenging cases, and integrates with GRPO training to learn more effectively from informative samples.
Result: Achieves 73.80% accuracy on MMAU-mini and 64.30% accuracy on MMAR, establishing new state-of-the-art performance on MMAR benchmark.
Conclusion: Omni-CLST demonstrates robust performance and strong generalization capabilities in multimodal audio-language understanding tasks.
Abstract: We propose Omni-CLST, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought for audio question answering. The framework efficiently leverages existing high-quality dataset through two key strategies: an error-aware curriculum that organizes samples by difficulty, and a guided thought dropout mechanism that focuses reasoning on challenging cases. Integrated with GRPO training, these strategies enable the model to learn more effectively from informative samples. Experiments on MMAU-mini and MMAR demonstrate that Omni-CLST achieves competitive accuracy (73.80% on MMAU-mini) and establishes a new state of the art (64.30% on MMAR), highlighting its robustness and generalization capability in multimodal audio-language understanding.
[315] More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition
James Tavernor, Emily Mower Provost
Main category: cs.SD
TL;DR: Proposes using inter-annotator similarity from pre-trained models to identify similar annotators for new users, enabling low-cost personalization in speech emotion recognition without requiring extensive new training data.
Details
Motivation: Current speech emotion recognition systems predict consensus values but struggle with individual annotator predictions. Adapting models to new annotators requires substantial labeled data, which is impractical for real-world deployment.Method: Leverage inter-annotator similarity by using a model pre-trained on large annotator population to identify similar previously seen annotators. Use limited enrollment data from new annotators to make predictions based on similar existing annotators.
Result: The approach significantly outperforms other off-the-shelf methods, demonstrating effective performance with minimal new data requirements.
Conclusion: This method enables extremely low-cost personalization and lightweight emotion adaptation, making it practical for real-world deployment in speech emotion recognition systems.
Abstract: Speech emotion recognition systems often predict a consensus value generated from the ratings of multiple annotators. However, these models have limited ability to predict the annotation of any one person. Alternatively, models can learn to predict the annotations of all annotators. Adapting such models to new annotators is difficult as new annotators must individually provide sufficient labeled training data. We propose to leverage inter-annotator similarity by using a model pre-trained on a large annotator population to identify a similar, previously seen annotator. Given a new, previously unseen, annotator and limited enrollment data, we can make predictions for a similar annotator, enabling off-the-shelf annotation of unseen data in target datasets, providing a mechanism for extremely low-cost personalization. We demonstrate our approach significantly outperforms other off-the-shelf approaches, paving the way for lightweight emotion adaptation, practical for real-world deployment.
[316] Osu2MIR: Beat Tracking Dataset Derived From Osu! Data
Ziyun Liu, Chris Donahue
Main category: cs.SD
TL;DR: Osu! rhythm game beatmaps provide reliable beat/downbeat annotations for underrepresented music genres, with a pipeline to extract high-quality data for MIR research.
Details
Motivation: To explore Osu! community-created beatmaps as an alternative source of diverse beat annotations, particularly for underrepresented music genres like anime, Vocaloid, and video game music.Method: Developed a pipeline to extract annotations from Osu! beatmaps, manually analyzed timing point reliability, and partitioned data into meaningful subsets based on timing point spacing.
Result: Beatmaps with single timing points or widely spaced multiple timing points (≥5s apart) provide reliable annotations, while closely spaced timing points (<5s apart) require curation. High consistency observed across multiple annotations of same songs.
Conclusion: Osu! data represents a scalable, diverse, community-driven resource for MIR research, with released pipeline and high-quality subset osu2beat2025 to support further exploration.
Abstract: In this work, we explore the use of Osu!, a community-based rhythm game, as an alternative source of beat and downbeat annotations. Osu! beatmaps are created and refined by a large, diverse community and span underrepresented genres such as anime, Vocaloid, and video game music. We introduce a pipeline for extracting annotations from Osu! beatmaps and partition them into meaningful subsets. Through manual analysis, we find that beatmaps with a single timing point or widely spaced multiple timing points (>=5 seconds apart) provide reliable annotations, while closely spaced timing points (<5 seconds apart) often require additional curation. We also observe high consistency across multiple annotations of the same song. This study demonstrates the potential of Osu! data as a scalable, diverse, and community-driven resource for MIR research. We release our pipeline and a high-quality subset osu2beat2025 to support further exploration: https://github.com/ziyunliu4444/osu2mir.
[317] Timbre-Adaptive Transcription: A Lightweight Architecture with Associative Memory for Dynamic Instrument Separation
Ruigang Li, Yongxu Zhu
Main category: cs.SD
TL;DR: Lightweight deep clustering framework for multi-timbre transcription with timbre-agnostic backbone and novel associative memory mechanism, achieving state-of-the-art performance with fewer parameters and minimal training data.
Details
Motivation: Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments and have rigid source-count constraints, limiting their practical application.Method: Uses a timbre-agnostic backbone with half the parameters of comparable models, combined with a novel associative memory mechanism that mimics human auditory cognition through attention-based clustering for dynamic encoding of unseen timbres.
Result: Outperforms existing models on public benchmarks, demonstrates promising timbre discrimination, and achieves adaptive polyphonic separation with only 12.5 minutes of training data using a cost-effective synthetic dataset method.
Conclusion: Provides an efficient framework for timbre-related music transcription and explores new directions for timbre-aware separation through cognitive-inspired architectures, offering improved generalization and flexibility.
Abstract: Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments and rigid source-count constraints. We address these limitations with a lightweight deep clustering solution featuring: 1) a timbre-agnostic backbone achieving state-of-the-art performance with only half the parameters of comparable models, and 2) a novel associative memory mechanism that mimics human auditory cognition to dynamically encode unseen timbres via attention-based clustering. Our biologically-inspired framework enables adaptive polyphonic separation with minimal training data (12.5 minutes), supported by a new synthetic dataset method offering cost-effective, high-precision multi-timbre generation. Experiments show the timbre-agnostic transcription model outperforms existing models on public benchmarks, while the separation module demonstrates promising timbre discrimination. This work provides an efficient framework for timbre-related music transcription and explores new directions for timbre-aware separation through cognitive-inspired architectures.
[318] Beyond Bars: Distribution of Edit Operations in Historical Prints
Adrian Nachtwey, Fabian C. Moss, Anna Viktoria Katrin Plaksin
Main category: cs.SD
TL;DR: A method for comparative music corpus studies using bar sampling instead of full digitization, evaluated on Beethoven’s Bagatelles to find representative sampling approaches.
Details
Motivation: To reduce time-consuming digitization processes in musicology and enable large-scale analyses with statistically sound results for studying editorial practices.Method: Three different sampling methods for selecting representative bars from musical sources, evaluated using Beethoven’s Bagatelles Op. 33 as a case study.
Result: Identified the most effective sampling method for finding representative samples that capture differences in musical sources.
Conclusion: This sampling approach offers significant value to musicological research and contributes to understanding 19th-century editorial practices and scholarly editing of historical musical works.
Abstract: In this paper, we present a method for conducting comparative corpus studies in musicology that reduces the time-consuming digitization process. Instead of encoding whole corpora of musical sources, we suggest sampling bars from these sources. We address the challenge of selecting representative samples and evaluate three different sampling methods. We used Beethoven’s Bagatelles Op. 33 as a case study to find the method that works best in finding samples representative with respect to differences. We believe that this approach offers significant value to musicological research by enabling large-scale analyses and thereby statistically sound results. Moreover, we believe our work to be a valuable step toward understanding nineteenth-century editorial practices and enriching the field of scholarly editing of historical musical works.
[319] A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis
Javeria Amir, Farwa Attaria, Mah Jabeen, Umara Noor, Zahid Rashid
Main category: cs.SD
TL;DR: A modular pipeline combining Tortoise TTS for zero-shot voice cloning and lightweight GAN for real-time lip sync, enabling high-fidelity speech generation in noisy/low-resource environments.
Details
Motivation: Current voice cloning and talking head methods require large datasets and clean studio recordings, which are infeasible in noisy or low-resource environments.Method: Modular pipeline using Tortoise TTS (transformer-based latent diffusion model) for zero-shot voice cloning with few samples, plus lightweight GAN architecture for robust real-time lip synchronization.
Result: Enables high-fidelity voice cloning and realistic lip sync in challenging environments with minimal training data requirements.
Conclusion: The solution reduces reliance on massive pre-training, supports emotionally expressive speech generation, and works in noisy/unconstrained scenarios with easy extensibility for future multimodal applications.
Abstract: Recent developments in voice cloning and talking head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large scale datasets and computationally intensive processes using clean studio recorded inputs that is infeasible in noisy or low resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech. It is a transformer based latent diffusion model that can perform high fidelity zero shot voice cloning given only a few training samples. We use a lightweight generative adversarial network architecture for robust real time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pre training generation of emotionally expressive speech and lip synchronization in noisy and unconstrained scenarios. The modular structure of the pipeline allows an easy extension for future multi modal and text guided voice modulation and it could be used in real world systems.
[320] Improving Anomalous Sound Detection with Attribute-aware Representation from Domain-adaptive Pre-training
Xin Fang, Guirui Zhong, Qing Wang, Fan Chu, Lei Wang, Mengui Qian, Mingqi Cai, Jiangzhao Wu, Jianqing Gao, Jun Du
Main category: cs.SD
TL;DR: Proposes agglomerative hierarchical clustering for pseudo-attribute label assignment using domain-adaptive pre-trained models to address missing machine attribute labels in anomalous sound detection, achieving state-of-the-art performance.
Details
Motivation: Addressing the laborious and impractical nature of exhaustive machine attribute label collection in anomalous sound detection, where typically only normal data is available for training.Method: Uses agglomerative hierarchical clustering to assign pseudo-attribute labels from representations of a domain-adaptive pre-trained model, followed by supervised fine-tuning for machine attribute classification.
Result: Achieves new state-of-the-art performance on DCASE 2025 Challenge dataset, significantly outperforming previous top-ranking systems.
Conclusion: The proposed approach effectively handles missing attribute labels through pseudo-label assignment and model adaptation, demonstrating substantial performance improvements in anomalous sound detection.
Abstract: Anomalous Sound Detection (ASD) is often formulated as a machine attribute classification task, a strategy necessitated by the common scenario where only normal data is available for training. However, the exhaustive collection of machine attribute labels is laborious and impractical. To address the challenge of missing attribute labels, this paper proposes an agglomerative hierarchical clustering method for the assignment of pseudo-attribute labels using representations derived from a domain-adaptive pre-trained model, which are expected to capture machine attribute characteristics. We then apply model adaptation to this pre-trained model through supervised fine-tuning for machine attribute classification, resulting in a new state-of-the-art performance. Evaluation on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge dataset demonstrates that our proposed approach yields significant performance gains, ultimately outperforming our previous top-ranking system in the challenge.
[321] The CCF AATC 2025: Speech Restoration Challenge
Junan Zhang, Mengyao Zhu, Xin Xu, Hui Bu, Zhenhua Ling, Zhizheng Wu
Main category: cs.SD
TL;DR: The Speech Restoration Challenge 2025 addresses the need for robust speech enhancement algorithms that can handle multiple simultaneous distortions including acoustic degradations, signal-chain artifacts, and enhancement model artifacts.
Details
Motivation: Real-world speech communication suffers from multiple co-existing distortions that current single-target enhancement algorithms cannot effectively handle, creating a need for more comprehensive solutions.Method: The challenge involves creating a comprehensive dataset with three degradation types: complex acoustic degradations (noise/reverberation), signal-chain artifacts (MP3 compression), and secondary artifacts from pre-processing models, with evaluation of both objective performance and model complexity.
Result: The paper introduces the challenge framework including task design, dataset creation methodology, and evaluation protocol, but does not present specific algorithm results as it’s a challenge announcement.
Conclusion: This challenge aims to advance speech restoration research by providing a standardized framework for developing algorithms that can handle realistic multi-distortion scenarios, with comprehensive evaluation metrics.
Abstract: Real-world speech communication is often hampered by a variety of distortions that degrade quality and intelligibility. While many speech enhancement algorithms target specific degradations like noise or reverberation, they often fall short in realistic scenarios where multiple distortions co-exist and interact. To spur research in this area, we introduce the Speech Restoration Challenge as part of the China Computer Federation (CCF) Advanced Audio Technology Competition (AATC) 2025. This challenge focuses on restoring speech signals affected by a composite of three degradation types: (1) complex acoustic degradations including non-stationary noise and reverberation; (2) signal-chain artifacts such as those from MP3 compression; and (3) secondary artifacts introduced by other pre-processing enhancement models. We describe the challenge’s background, the design of the task, the comprehensive dataset creation methodology, and the detailed evaluation protocol, which assesses both objective performance and model complexity. Homepage: https://ccf-aatc.org.cn/.
[322] GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR
Yujie Guo, Jiaming Zhou, Yuhang Jia, Shiwan Zhao, Yong Qin
Main category: cs.SD
TL;DR: GLAD Mixture-of-Experts dynamically fuses global speaker context and local acoustic features to improve multi-talker speech recognition, outperforming existing methods on challenging overlapping speech scenarios.
Details
Motivation: End-to-end multi-talker ASR struggles with accurately transcribing overlapping speech, especially under high-overlap conditions, requiring better speaker-aware modeling.Method: Proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts that dynamically fuses speaker-aware global information and fine-grained local features to guide expert selection for speaker-specific routing.
Result: Experiments on LibriSpeechMix show GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios with high overlap.
Conclusion: This is the first work to apply Mixture-of-Experts to end-to-end MTASR with a global-local fusion strategy, demonstrating effective handling of overlapping speech through dynamic speaker-aware routing.
Abstract: End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which dynamically fuse speaker-aware global information and fine-grained local features to guide expert selection. This mechanism enables speaker-specific routing by leveraging both global context and local acoustic cues. Experiments on LibriSpeechMix show that GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios. To our best knowledge, this is the first work to apply Mixture-of-Experts (MoE) to end-to-end MTASR with a global-local fusion strategy. Our code and train dataset can be found at https://github.com/NKU-HLT/GLAD.
[323] UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model
Yudong Yang, Xiaokang Liu, Shaofeng zhao, Rongfeng Su, Nan Yan, Lan Wang
Main category: cs.SD
TL;DR: MLLM-based speech therapy system using ultrasound tongue imaging and speech signals for precise articulatory feedback, addressing limitations of traditional methods through multimodal fusion and a specialized dataset.
Details
Motivation: Traditional speech therapy systems lack real-time accessibility and articulatory motion feedback, while existing MLLMs face challenges with articulatory information fusion and domain-specific data scarcity for speech rehabilitation applications.Method: Proposed system synergistically combines ultrasound tongue imaging and speech signals, constructs high-quality UTI-speech dialogue dataset for fine-tuning, and implements spatiotemporal fusion training strategy for fine-grained articulatory analysis.
Result: The method enables precise, interactive articulatory feedback through enhanced multimodal fusion of ultrasound videos and speech signals, facilitating detailed articulatory impairment analysis.
Conclusion: The MLLM-based approach with specialized dataset and spatiotemporal fusion strategy effectively addresses key limitations in speech therapy, providing actionable feedback for speech rehabilitation assistance.
Abstract: Speech therapy plays a critical role in training speech disorders caused by neurological impairments such as stroke. However, traditional manual and computer-assisted systems are limited in real-time accessibility and articulatory motion feedback, constraining their practical utility. Recent advances in multimodal large language models (MLLMs) have demonstrated significant potential in healthcare, particularly through their ability to integrate multimodal data for adaptive assessment and therapeutic feedback. Nevertheless, challenges including insufficient acquisition and fusion of articulatory information, inadequate parsing of articulatory organ motion trajectories, and the scarcity of high-quality domain-specific datasets hinder the application of MLLMs in speech therapy. To address these limitations, we propose an MLLM-based speech rehabilitation assistance system that synergistically leverages ultrasound tongue imaging and speech signals to deliver precise, interactive articulatory feedback. We construct a high-quality domain-specific dataset comprising UTI-speech dialogue pairs. This dataset facilitates fine-tuning to enhance the model’s clinical adaptability. Building on this dataset, our methods achieves spatiotemporal fusion training strategy of ultrasound videos and speech signals, enabling fine-grained articulatory impairment analysis and ultimately generating actionable feedback.
[324] Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
Han Yin, Jung-Woo Choi
Main category: cs.SD
TL;DR: SSEU-Bench is a new audio understanding benchmark that addresses energy differences between speech/non-speech components and enables joint understanding of speech, scene, and events, with Chain-of-Thought improving LALM performance.
Details
Motivation: Existing benchmarks don't adequately address real-world audio characteristics like varying energy levels between speech/non-speech components or joint understanding of multiple audio elements within the same clip.Method: Introduces SSEU-Bench benchmark with independent and joint understanding settings for speech, scene, and events, and uses Chain-of-Thought prompting to decompose complex tasks into simpler steps.
Result: Some LALMs underperform on joint understanding tasks, but Chain-of-Thought effectively improves their joint audio understanding performance.
Conclusion: SSEU-Bench provides a more comprehensive evaluation framework for audio understanding that better reflects real-world scenarios, and Chain-of-Thought is an effective method for enhancing LALM performance on complex joint tasks.
Abstract: Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate the LALM’s audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary significantly across different scenarios. Moreover, most benchmarks do not consider the joint understanding of speech, scene, and events within the same audio clip. In this work, we introduce SSEU-Bench, the first versatile audio understanding benchmark that explicitly accounts for energy differences between speech and non-speech audio, with both independent and joint understanding settings for speech, scene, and events. Furthermore, we demonstrate that some LALMs tend to underperform on certain tasks in a joint understanding setting. To address this issue, we introduce Chain-of-Thought, which effectively improves the LALM’s joint audio understanding performance by decomposing complex tasks into simpler reasoning steps
[325] SwinSRGAN: Swin Transformer-based Generative Adversarial Network for High-Fidelity Speech Super-Resolution
Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, Yulin Wu, Chenhao Hu, Xueyang Lv
Main category: cs.SD
TL;DR: SwinSRGAN is an end-to-end speech super-resolution framework using Swin Transformer-based U-Net that operates on MDCT magnitudes, achieving real-time 48kHz upsampling with improved objective metrics and cross-dataset generalization.
Details
Motivation: Existing speech SR systems suffer from representation mismatch in two-stage pipelines, CNN-only generators cause over-smoothing, and diffusion/flow models are computationally expensive with limited robustness across domains.Method: Uses Swin Transformer-based U-Net on MDCT magnitudes with hybrid adversarial scheme combining time-domain MPD/MSD discriminators and multi-band MDCT discriminator. Includes sparse-aware regularizer on arcsinh-compressed MDCT to preserve transients.
Result: Reduces objective error and improves ABX preference scores on standard benchmarks. Outperforms NVSR and mdctGAN in zero-shot tests on HiFi-TTS without fine-tuning, demonstrating strong generalization.
Conclusion: SwinSRGAN provides an effective end-to-end solution for speech super-resolution that handles various sampling rates, operates in real-time, and shows robust performance across different datasets with strong generalization capabilities.
Abstract: Speech super-resolution (SR) reconstructs high-frequency content from low-resolution speech signals. Existing systems often suffer from representation mismatch in two-stage mel-vocoder pipelines and from over-smoothing of hallucinated high-band content by CNN-only generators. Diffusion and flow models are computationally expensive, and their robustness across domains and sampling rates remains limited. We propose SwinSRGAN, an end-to-end framework operating on Modified Discrete Cosine Transform (MDCT) magnitudes. It is a Swin Transformer-based U-Net that captures long-range spectro-temporal dependencies with a hybrid adversarial scheme combines time-domain MPD/MSD discriminators with a multi-band MDCT discriminator specialized for the high-frequency band. We employs a sparse-aware regularizer on arcsinh-compressed MDCT to better preserve transient components. The system upsamples inputs at various sampling rates to 48 kHz in a single pass and operates in real time. On standard benchmarks, SwinSRGAN reduces objective error and improves ABX preference scores. In zero-shot tests on HiFi-TTS without fine-tuning, it outperforms NVSR and mdctGAN, demonstrating strong generalization across datasets
[326] Contrastive timbre representations for musical instrument and synthesizer retrieval
Gwendal Le Vaillant, Yannick Molle
Main category: cs.SD
TL;DR: Contrastive learning framework for musical instrument retrieval that works with both single- and multi-instrument audio, outperforming previous methods especially for multi-instrument mixtures.
Details
Motivation: Efficient retrieval of specific instrument timbres from audio mixtures is challenging in digital music production. Current methods have limitations in handling both single- and multi-instrument sounds effectively.Method: Proposes a contrastive learning framework with techniques to generate realistic positive/negative sound pairs for virtual musical instruments (samplers and synthesizers), addressing limitations in common audio data augmentation methods.
Result: For single-instrument retrieval from 3,884 instruments, contrastive approaches are competitive with classification pre-training methods. For multi-instrument retrieval, the framework achieves 81.7% top-1 and 95.7% top-5 accuracies for three-instrument mixtures, outperforming related works.
Conclusion: The contrastive learning framework effectively enables direct querying of instrument databases using both single- and multi-instrument audio, with particularly strong performance on complex multi-instrument retrieval tasks.
Abstract: Efficiently retrieving specific instrument timbres from audio mixtures remains a challenge in digital music production. This paper introduces a contrastive learning framework for musical instrument retrieval, enabling direct querying of instrument databases using a single model for both single- and multi-instrument sounds. We propose techniques to generate realistic positive/negative pairs of sounds for virtual musical instruments, such as samplers and synthesizers, addressing limitations in common audio data augmentation methods. The first experiment focuses on instrument retrieval from a dataset of 3,884 instruments, using single-instrument audio as input. Contrastive approaches are competitive with previous works based on classification pre-training. The second experiment considers multi-instrument retrieval with a mixture of instruments as audio input. In this case, the proposed contrastive framework outperforms related works, achieving 81.7% top-1 and 95.7% top-5 accuracies for three-instrument mixtures.
[327] TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
Minh N. H. Nguyen, Anh Nguyen Tran, Dung Truong Dinh, Nam Van Vo
Main category: cs.SD
TL;DR: A novel Two-Stage Phoneme-Centric (TSPC) model for Vietnamese-English code-switching ASR that uses an extended Vietnamese phoneme set as intermediate representation, achieving 20.8% WER with reduced training resources.
Details
Motivation: Code-switching presents challenges for ASR systems, especially for Vietnamese-English pairs due to distinct phonological features and sound recognition ambiguity. Existing methods fail to capture subtle phonological shifts in CS scenarios.Method: Two-Stage Phoneme-Centric (TSPC) architecture using phoneme-centric approach with extended Vietnamese phoneme set as intermediate representation for mixed-lingual modeling.
Result: TSPC outperforms existing baselines including PhoWhisper-base, achieving significantly lower word error rate of 20.8% with reduced training resources.
Conclusion: The phonetic-based two-stage architecture enables phoneme adaptation and language conversion, enhancing ASR performance in complex Vietnamese-English code-switching scenarios.
Abstract: Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 20.8% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios.
cs.LG
[328] PowerGrow: Feasible Co-Growth of Structures and Dynamics for Power Grid Synthesis
Xinyu He, Chenhan Xiao, Haoran Li, Ruizhong Qiu, Zhe Xu, Yang Weng, Jingrui He, Hanghang Tong
Main category: cs.LG
TL;DR: PowerGrow is a co-generative framework that synthesizes realistic power grid test cases with topology, branch attributes, bus properties, and dynamic load profiles while maintaining physical feasibility and computational efficiency.
Details
Motivation: Modern power systems are becoming increasingly dynamic with changing topologies and time-varying loads, but publicly available test cases remain scarce due to security concerns and anonymization challenges, creating a need for generative tools.Method: Uses dependence decomposition to factorize the complex joint distribution into conditional distributions over grid topologies, time-series bus loads, and system attributes. Implements hierarchical graph beta-diffusion for structural synthesis and temporal autoencoder for time-series data embedding.
Result: Outperforms prior diffusion models in fidelity and diversity, achieves 98.9% power flow convergence rate, and shows improved N-1 contingency resilience.
Conclusion: PowerGrow demonstrates ability to generate operationally valid and realistic power grid scenarios while significantly reducing computational overhead.
Abstract: Modern power systems are becoming increasingly dynamic, with changing topologies and time-varying loads driven by renewable energy variability, electric vehicle adoption, and active grid reconfiguration. Despite these changes, publicly available test cases remain scarce, due to security concerns and the significant effort required to anonymize real systems. Such limitations call for generative tools that can jointly synthesize grid structure and nodal dynamics. However, modeling the joint distribution of network topology, branch attributes, bus properties, and dynamic load profiles remains a major challenge, while preserving physical feasibility and avoiding prohibitive computational costs. We present PowerGrow, a co-generative framework that significantly reduces computational overhead while maintaining operational validity. The core idea is dependence decomposition: the complex joint distribution is factorized into a chain of conditional distributions over feasible grid topologies, time-series bus loads, and other system attributes, leveraging their mutual dependencies. By constraining the generation process at each stage, we implement a hierarchical graph beta-diffusion process for structural synthesis, paired with a temporal autoencoder that embeds time-series data into a compact latent space, improving both training stability and sample fidelity. Experiments across benchmark settings show that PowerGrow not only outperforms prior diffusion models in fidelity and diversity but also achieves a 98.9% power flow convergence rate and improved N-1 contingency resilience. This demonstrates its ability to generate operationally valid and realistic power grid scenarios.
[329] Scaling Up Data Parallelism in Decentralized Deep Learning
Bing Xie, Junqi Yin, Zhenyu Zhou, Sarp Oral, Feiyi Wang
Main category: cs.LG
TL;DR: DBench benchmarking framework reveals decentralized learning scalability issues and parameter variance sensitivity. Proposed Ada approach dynamically adapts communication graphs to achieve convergence rates comparable to centralized learning on 1008 GPUs.
Details
Motivation: Decentralized learning lacks production readiness due to stability, scalability, and generality issues in large-scale DNN training. This work aims to understand decentralized data parallel training at scale to enable production deployment.Method: Introduced DBench benchmarking framework to compare centralized and decentralized training. Developed methodology to analyze correlations between model accuracy and parameter tensor variances across different communication graphs and training scales. Proposed Ada - a decentralized adaptive approach that dynamically adjusts communication graphs during SGD training.
Result: Found that: (1) Decentralized training has scalability/generality issues like centralized; (2) Model accuracy correlates with number of connections in communication graph; (3) Accuracy is surprisingly sensitive to parameter tensor variances. Ada achieved best convergence rates and comparable accuracy to centralized learning, even on 1008 GPUs training ResNet50 for ImageNet-1K.
Conclusion: Decentralized learning can achieve production-level performance through adaptive communication graph optimization. Ada demonstrates that dynamic graph adaptation enables decentralized training to match centralized learning performance at large scales.
Abstract: Although it has been extensively explored in theory, decentralized learning is not yet green-lighted for production use, largely due to a lack of stability, scalability, and generality in large scale DNN training. To shed light on the production use of decentralized learning, this work studies decentralized data parallel training at scale. To this end, we introduce a benchmarking framework, namely DBench, to host both centralized and decentralized DNN training. Building upon DBench, we introduce a benchmarking methodology to uncover the correlations between model accuracy and the variances of parameter tensors by varying communication graphs and training scales. Based on the benchmarking results, we observe that, (1) Similar to centralized learning, decentralized data parallel training also presents the issues of scalability and generality when the training scales up; (2) The model accuracy of decentralized learning is correlated to the number of connections in a communication graph; (3) The model accuracy of decentralized learning is surprisingly sensitive to the variance of parameter tensors across model replicas. Built upon the observations, we propose Ada, a decentralized adaptive approach that performs large scale DNN training following a decentralized SGD method and adapting the communication graph in use dynamically throughout training iterations. We apply Ada on large scale training and observe that Ada can obtain the best convergence rates consistently in decentralized DNN training, and delivers equally or comparably good model accuracy for all sample applications as centralized learning does, even when training ResNet50 for ImageNet-1K on the scale of 1008 GPUs.
[330] MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors
Xin Tong, Zhi Lin, Jingya Wang, Meng Han, Bo Jin
Main category: cs.LG
TL;DR: MEUV framework creates topic-specific unlock vectors to bypass LLM safety filters for legitimate security applications while minimizing cross-topic leakage.
Details
Motivation: Current LLM safety alignment blocks both malicious and legitimate requests in high-stakes domains like policing and defense, requiring fine-grained control over sensitive capabilities.Method: Mutually Exclusive Unlock Vectors (MEUV) factorizes refusal direction into topic-aligned orthogonal vectors using multi-task objective with differential-ablation margin, cross-topic penalties, and orthogonality constraints.
Result: Achieves 87%+ attack success rate on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B while reducing cross-topic leakage by up to 90% compared to single-direction baselines. Shows language-agnostic transfer between Chinese and English.
Conclusion: Fine-grained topic-level capability activation is feasible with minimal utility loss, enabling controlled LLM deployment in security-sensitive domains.
Abstract: Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier “refusal-direction” edits can bypass those layers, but they rely on a single vector that indiscriminately unlocks all hazardous topics, offering no semantic control. We introduce Mutually Exclusive Unlock Vectors (MEUV), a lightweight framework that factorizes the monolithic refusal direction into topic-aligned, nearly orthogonal vectors, each dedicated to one sensitive capability. MEUV is learned in a single epoch with a multi-task objective that blends a differential-ablation margin, cross-topic and orthogonality penalties, and several auxiliary terms. On bilingual malicious-prompt benchmarks, MEUV achieves an attack success rate of no less than 87% on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B, yet cuts cross-topic leakage by up to 90% compared with the best single-direction baseline. Vectors trained in Chinese transfer almost unchanged to English (and vice versa), suggesting a language-agnostic refusal subspace. The results show that fine-grained, topic-level capability activation is achievable with minimal utility loss, paving the way for controlled LLMs deployment in security-sensitive domains.
[331] Accelerating Privacy-Preserving Federated Learning in Large-Scale LEO Satellite Systems
Binquan Guo, Junteng Cao, Marie Siew, Binbin Chen, Tony Q. S. Quek, Zhu Han
Main category: cs.LG
TL;DR: Proposes a dynamic scheduling framework for federated learning over satellite networks that reduces training round times by 14-42% compared to traditional methods.
Details
Motivation: Large-scale LEO satellite systems enable AI model training across distributed regions, but privacy constraints prevent raw data aggregation. Federated learning preserves privacy but faces challenges with satellite network dynamics and limited bandwidth.Method: Discrete temporal graph-based on-demand scheduling framework that dynamically allocates communication resources to accelerate parameter exchange in federated learning.
Result: Simulation results show 14.20% to 41.48% reduction in overall training round times, with greater acceleration for larger models and more clients.
Conclusion: The proposed scheduling approach effectively addresses satellite network bottlenecks and scales well for large-scale federated learning applications.
Abstract: Large-scale low-Earth-orbit (LEO) satellite systems are increasingly valued for their ability to enable rapid and wide-area data exchange, thereby facilitating the collaborative training of artificial intelligence (AI) models across geographically distributed regions. Due to privacy concerns and regulatory constraints, raw data collected at remote clients cannot be centrally aggregated, posing a major obstacle to traditional AI training methods. Federated learning offers a privacy-preserving alternative by training local models on distributed devices and exchanging only model parameters. However, the dynamic topology and limited bandwidth of satellite systems will hinder timely parameter aggregation and distribution, resulting in prolonged training times. To address this challenge, we investigate the problem of scheduling federated learning over satellite networks and identify key bottlenecks that impact the overall duration of each training round. We propose a discrete temporal graph-based on-demand scheduling framework that dynamically allocates communication resources to accelerate federated learning. Simulation results demonstrate that the proposed approach achieves significant performance gains over traditional statistical multiplexing-based model exchange strategies, reducing overall round times by 14.20% to 41.48%. Moreover, the acceleration effect becomes more pronounced for larger models and higher numbers of clients, highlighting the scalability of the proposed approach.
[332] TripOptimizer: Generative 3D Shape Optimization and Drag Prediction using Triplane VAE Networks
Parsa Vatani, Mohamed Elrefaie, Farhad Nazarpour, Faez Ahmed
Main category: cs.LG
TL;DR: TripOptimizer is a differentiable deep learning framework using triplane-based VAE for aerodynamic shape optimization from point cloud data, achieving up to 11.8% drag reduction while being robust to geometric imperfections.
Details
Motivation: Traditional CFD-based aerodynamic shape optimization is computationally expensive and restricts design space exploration, especially with non-watertight meshes that challenge adjoint-based methods.Method: Uses Variational Autoencoder with triplane-based implicit neural representation for 3D geometry reconstruction and drag prediction, trained on DrivAerNet++ dataset (8,000 vehicle geometries with RANS-computed drag coefficients). Optimization modifies encoder parameters to steer geometry towards target drag.
Result: Achieved drag coefficient reductions up to 11.8% in optimized designs, validated by independent high-fidelity CFD simulations with 150+ million cells. Framework handles non-watertight meshes robustly.
Conclusion: TripOptimizer enables agile aerodynamic optimization workflow, reducing reliance on computationally intensive CFD simulations, particularly valuable during early design stages.
Abstract: The computational cost of traditional Computational Fluid Dynamics-based Aerodynamic Shape Optimization severely restricts design space exploration. This paper introduces TripOptimizer, a fully differentiable deep learning framework for rapid aerodynamic analysis and shape optimization directly from vehicle point cloud data. TripOptimizer employs a Variational Autoencoder featuring a triplane-based implicit neural representation for high-fidelity 3D geometry reconstruction and a drag coefficient prediction head. Trained on DrivAerNet++, a large-scale dataset of 8,000 unique vehicle geometries with corresponding drag coefficients computed via Reynolds-Averaged Navier-Stokes simulations, the model learns a latent representation that encodes aerodynamically salient geometric features. We propose an optimization strategy that modifies a subset of the encoder parameters to steer an initial geometry towards a target drag value, and demonstrate its efficacy in case studies where optimized designs achieved drag coefficient reductions up to 11.8%. These results were subsequently validated by using independent, high-fidelity Computational Fluid Dynamics simulations with more than 150 million cells. A key advantage of the implicit representation is its inherent robustness to geometric imperfections, enabling optimization of non-watertight meshes, a significant challenge for traditional adjoint-based methods. The framework enables a more agile Aerodynamic Shape Optimization workflow, reducing reliance on computationally intensive CFD simulations, especially during early design stages.
[333] A Physics-Informed Neural Networks-Based Model Predictive Control Framework for $SIR$ Epidemics
Aiping Zhong, Baike She, Philip E. Paré
Main category: cs.LG
TL;DR: A PINN-based MPC framework for SIR epidemic models that jointly estimates states and parameters in real-time using only noisy infected state data, with novel algorithms for different knowledge assumptions.
Details
Motivation: Existing MPC approaches for epidemic control either assume measurable states with learned parameters or known parameters with learned states, but not joint estimation of both from limited noisy data.Method: Proposes MPC-PINNs for SIR models with control, MPC-LS-PINNs with log-scaled loss for noise robustness, and MPC-SI-PINNs using integral operators and state coupling. Extends framework for different knowledge assumptions about recovery rate or basic reproduction number.
Result: The proposed methods effectively reconstruct complete epidemic state information and simultaneously estimate states and parameters while generating optimal control strategies.
Conclusion: The framework demonstrates effectiveness in joint real-time estimation of epidemic states and parameters under different knowledge assumptions using only noisy infected state measurements.
Abstract: This work introduces a physics-informed neural networks (PINNs)-based model predictive control (MPC) framework for susceptible-infected-recovered ($SIR$) spreading models. Existing studies in MPC design for epidemic control often assume either 1) measurable states of the dynamics, where the parameters are learned, or 2) known parameters of the model, where the states are learned. In this work, we address the joint real-time estimation of states and parameters within the MPC framework using only noisy infected states, under the assumption that 1) only the recovery rate is known, or 2) only the basic reproduction number is known. Under the first assumption, we propose MPC-PINNs and two novel PINNs algorithms, all of which are integrated into the MPC framework. First, we introduce MPC-PINNs, which are designed for $SIR$ models with control. We then propose log-scaled PINNs (MPC-LS-PINNs), which incorporate a log-scaled loss function to improve robustness against noise. Next, we present split-integral PINNs (MPC-SI-PINNs), which leverage integral operators and state coupling in the neural network training process to effectively reconstruct the complete epidemic state information. Building upon these methods, we further extend our framework for the second assumption. We establish the necessary conditions and extend our PINNs algorithms, where MPC-SI-PINNs are simplified as split-PINNs (MPC-S-PINNs). By incorporating these algorithms into the MPC framework, we simultaneously estimate the epidemic states and parameters while generating optimal control strategies. Experiment results demonstrate the effectiveness of the proposed methods under different settings.
[334] Flexible Multimodal Neuroimaging Fusion for Alzheimer’s Disease Progression Prediction
Benjamin Burns, Yuan Xue, Douglas W. Scharre, Xia Ning
Main category: cs.LG
TL;DR: PerM-MoE is a novel multimodal method that uses independent routers for each modality to improve Alzheimer’s disease progression prediction when many modalities are missing during inference.
Details
Motivation: Existing multimodal models fail to make accurate predictions when many modalities are missing during inference, which is common in clinical settings for Alzheimer's disease progression prediction.Method: PerM-MoE uses independent routers for each modality instead of a single router, creating a sparse mixture-of-experts approach that handles high modality missingness more effectively.
Result: PerM-MoE outperforms state-of-the-art Flex-MoE and unimodal models in most variations of modality missingness and demonstrates more effective utility of experts.
Conclusion: The independent router approach in PerM-MoE increases multimodal model flexibility and improves prediction accuracy under high modality missingness conditions common in clinical practice.
Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disease with high inter-patient variance in rate of cognitive decline. AD progression prediction aims to forecast patient cognitive decline and benefits from incorporating multiple neuroimaging modalities. However, existing multimodal models fail to make accurate predictions when many modalities are missing during inference, as is often the case in clinical settings. To increase multimodal model flexibility under high modality missingness, we introduce PerM-MoE, a novel sparse mixture-of-experts method that uses independent routers for each modality in place of the conventional, single router. Using T1-weighted MRI, FLAIR, amyloid beta PET, and tau PET neuroimaging data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), we evaluate PerM-MoE, state-of-the-art Flex-MoE, and unimodal neuroimaging models on predicting two-year change in Clinical Dementia Rating-Sum of Boxes (CDR-SB) scores under varying levels of modality missingness. PerM-MoE outperforms the state of the art in most variations of modality missingness and demonstrates more effective utility of experts than Flex-MoE.
[335] Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction
Marzieh Ajirak, Oded Bein, Ellen Rose Bowen, Dora Kanellopoulos, Avital Falk, Faith M. Gunning, Nili Solomonov, Logan Grosenick
Main category: cs.LG
TL;DR: A unified framework for adaptive routing in multitask, multimodal prediction that dynamically selects modality processing pathways and task-sharing strategies per sample, particularly for psychotherapy applications with heterogeneous data.
Details
Motivation: Address data heterogeneity and task interactions in psychotherapy settings where structured assessments and unstructured clinician notes coexist with missing data and correlated outcomes like depression and anxiety.Method: Routing-based architecture with multiple modality paths (raw and fused text/numeric representations) that learns to route inputs through optimal expert combinations. Task-specific predictions use shared or independent heads based on routing decisions, trained end-to-end.
Result: Outperforms fixed multitask and single-task baselines on both synthetic data and real-world psychotherapy notes. Learned routing provides interpretable insights into modality relevance and task structure.
Conclusion: Enables per-subject adaptive information processing for personalized healthcare, addressing data heterogeneity and task correlations. Could improve mental health outcomes, treatment precision, and cost-effectiveness through personalized interventions.
Abstract: We propose a unified framework for adaptive routing in multitask, multimodal prediction settings where data heterogeneity and task interactions vary across samples. Motivated by applications in psychotherapy where structured assessments and unstructured clinician notes coexist with partially missing data and correlated outcomes, we introduce a routing-based architecture that dynamically selects modality processing pathways and task-sharing strategies on a per-sample basis. Our model defines multiple modality paths, including raw and fused representations of text and numeric features and learns to route each input through the most informative expert combination. Task-specific predictions are produced by shared or independent heads depending on the routing decision, and the entire system is trained end-to-end. We evaluate the model on both synthetic data and real-world psychotherapy notes predicting depression and anxiety outcomes. Our experiments show that our method consistently outperforms fixed multitask or single-task baselines, and that the learned routing policy provides interpretable insights into modality relevance and task structure. This addresses critical challenges in personalized healthcare by enabling per-subject adaptive information processing that accounts for data heterogeneity and task correlations. Applied to psychotherapy, this framework could improve mental health outcomes, enhance treatment assignment precision, and increase clinical cost-effectiveness through personalized intervention strategies.
[336] Neural Diffeomorphic-Neural Operator for Residual Stress-Induced Deformation Prediction
Changqing Liu, Kaining Dai, Zhiwei Zhao, Tianyi Wu, Yingguang Li
Main category: cs.LG
TL;DR: A novel neural diffeomorphic-neural operator (NDNO) framework is proposed to efficiently predict machining deformation in structural components with varying geometries by mapping complex 3D shapes to a common reference domain.
Details
Motivation: Accurate prediction of machining deformation is crucial for dimensional precision, but conventional numerical methods are computationally expensive for diverse geometries. Neural operators show promise but face limitations when applied across changing geometric domains.Method: A diffeomorphic neural network maps complex 3D geometries to a common reference domain with smoothness and invertibility constraints. A neural operator is then trained on this reference domain to learn deformation fields induced by residual stresses.
Result: The method achieves high accuracy and efficiency in predicting both main-direction and multi-direction deformation fields across parts with diverse geometries, including different component types, dimensions, and features.
Conclusion: The NDNO framework provides an effective and computationally efficient solution for deformation prediction in structural components with varying geometries, enabling rapid adaptation to different shapes while maintaining accuracy.
Abstract: Accurate prediction of machining deformation in structural components is essential for ensuring dimensional precision and reliability. Such deformation often originates from residual stress fields, whose distribution and influence vary significantly with geometric complexity. Conventional numerical methods for modeling the coupling between residual stresses and deformation are computationally expensive, particularly when diverse geometries are considered. Neural operators have recently emerged as a powerful paradigm for efficiently solving partial differential equations, offering notable advantages in accelerating residual stress-deformation analysis. However, their direct application across changing geometric domains faces theoretical and practical limitations. To address this challenge, a novel framework based on diffeomorphic embedding neural operators named neural diffeomorphic-neural operator (NDNO) is introduced. Complex three-dimensional geometries are explicitly mapped to a common reference domain through a diffeomorphic neural network constrained by smoothness and invertibility. The neural operator is then trained on this reference domain, enabling efficient learning of deformation fields induced by residual stresses. Once trained, both the diffeomorphic neural network and the neural operator demonstrate efficient prediction capabilities, allowing rapid adaptation to varying geometries. The proposed method thus provides an effective and computationally efficient solution for deformation prediction in structural components subject to varying geometries. The proposed method is validated to predict both main-direction and multi-direction deformation fields, achieving high accuracy and efficiency across parts with diverse geometries including component types, dimensions and features.
[337] Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study
MSR Avinash
Main category: cs.LG
TL;DR: Systematic profiling study shows LoRA/QLoRA fine-tuning of Qwen2.5-1.5B model is feasible on 8GB RTX 4060 GPU, with paged optimizers boosting throughput by 25% and sequence lengths up to 2048 tokens achievable.
Details
Motivation: To explore the efficiency of parameter-efficient fine-tuning techniques (LoRA/QLoRA) on consumer-grade GPUs with strict 8GB VRAM limits, which remains underexplored despite the popularity of these methods.Method: Controlled profiling study using Qwen2.5-1.5B-Instruct model on NVIDIA RTX 4060, systematically varying batch size, sequence length, optimizer choice (AdamW vs PagedAdamW), and precision (fp16 vs bf16), measuring throughput, time efficiency, VRAM usage, and energy consumption.
Result: Paged optimizers improved throughput by up to 25% (628 tokens/s vs 500 tokens/s baseline), bf16 degraded efficiency compared to fp16, and sequence lengths up to 2048 tokens were feasible despite 8GB constraints.
Conclusion: This first systematic case study demonstrates that efficient LLM fine-tuning is achievable on consumer GPUs, providing reproducible benchmarks and practical guidelines for resource-constrained researchers and practitioners.
Abstract: Fine-tuning large language models (LLMs) with parameter-efficient techniques such as LoRA and QLoRA has enabled adaptation of foundation models on modest hardware. Yet the efficiency of such training on consumer-grade GPUs, especially under strict 8 GB VRAM limits, remains underexplored. We present a controlled profiling study of LoRA/QLoRA fine-tuning using the Qwen2.5-1.5B-Instruct model on a single NVIDIA RTX 4060. Across three representative configurations, we systematically vary batch size, sequence length, optimizer choice (AdamW vs. PagedAdamW), and precision (fp16 vs. bf16). We report throughput (tokens/s), time per 10k tokens, and VRAM footprint, alongside energy estimates derived from GPU board power limits. Our results show that paged optimizers improve throughput by up to 25% (628 tok/s vs. 500 tok/s baseline), while bf16 degrades efficiency relative to fp16. Despite 8 GB constraints, sequence lengths up to 2048 tokens were feasible using parameter-efficient strategies. To our knowledge, this is the first systematic case study of LLM fine-tuning efficiency on consumer GPUs, providing reproducible benchmarks and practical guidelines for resource-constrained researchers and practitioners.
[338] RL Fine-Tuning Heals OOD Forgetting in SFT
Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, Mohammad Hamdaqa
Main category: cs.LG
TL;DR: The study challenges the ‘SFT memorizes, RL generalizes’ claim, showing SFT causes OOD forgetting that peaks early, and RL acts as OOD restoration rather than creating new generalization. The key mechanism is rotation of singular vectors, not singular value changes.
Details
Motivation: To understand the evolution and mechanisms behind the synergy of SFT and RL in two-stage fine-tuning, as the common belief about their roles is oversimplified and inconclusive.Method: Used SVD analysis on parameter matrices, manually edited them, and observed impacts on model performance to uncover the underlying mechanisms of forgetting and restoration processes.
Result: Found that OOD performance peaks early in SFT then declines (forgetting), RL restores lost ability rather than creating new generalization, and the key mechanism is rotation of singular vectors (not singular value changes).
Conclusion: The findings re-identify SFT and RL roles in two-stage fine-tuning and discover singular vector rotation as the key mechanism, with RL having recovery boundaries depending on SFT training duration.
Abstract: The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better reasoning performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. In our study, we find the well-known claim “SFT memorizes, RL generalizes” is over-simplified, and discover that: (1) OOD performance peaks at the early stage of SFT and then declines (OOD forgetting), the best SFT checkpoint cannot be captured by training/test loss; (2) the subsequent RL stage does not generate fundamentally better OOD capability, instead it plays an \textbf{OOD restoration} role, recovering the lost reasoning ability during SFT; (3) The recovery ability has boundaries, \ie{} \textbf{if SFT trains for too short or too long, RL cannot recover the lost OOD ability;} (4) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis on parameter matrices, manually edit them, and observe their impacts on model performance. Unlike the common belief that the shift of model capacity mainly results from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbf{rotation of singular vectors}. Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism. %reversing the rotations induced by SFT, which shows recovery from forgetting, whereas imposing the SFT parameter directions onto a RL-tuned model results in performance degradation. Code is available at https://github.com/xiaodanguoguo/RL_Heals_SFT
[339] Data-Driven Discovery of Emergent Dynamics in Reaction-Diffusion Systems from Sparse and Noisy Observations
Saumitra Dwivedi, Ricardo da Silva Torres, Ibrahim A. Hameed, Gunnar Tufte, Anniken Susanne T. Karlsen
Main category: cs.LG
TL;DR: The paper presents a data-driven framework (DRSALife) to learn Soft Artificial Life models from observed data for reaction-diffusion systems without prior physics knowledge, achieving 74% accuracy in predicting emergent dynamics with robustness to noise and sparsity.
Details
Motivation: To address the challenge of system identification in reaction-diffusion systems when there is no prior knowledge of underlying physics, by learning Soft ALife models from observed data.Method: Uses the Data-driven Rulesets for Soft Artificial Life (DRSALife) framework to learn Agent-based and Cellular Automata models from observed data, testing on noisy and sparse datasets.
Result: Achieved 74% accuracy in predicting emergent dynamics, demonstrated robustness to Gaussian noise and temporal sparsity, and successfully identified underlying PDE structure and parameters.
Conclusion: The DRSALife framework effectively learns Soft ALife rulesets for reaction-diffusion systems without prior physics knowledge, showing promising results for system identification and prediction of emergent dynamics.
Abstract: Data-driven discovery of emergent dynamics is gaining popularity, particularly in the context of reaction-diffusion systems. These systems are widely studied across various fields, including neuroscience, ecology, epidemiology, and several other subject areas that deal with emergent dynamics. A current challenge in the discovery process relates to system identification when there is no prior knowledge of the underlying physics. We attempt to address this challenge by learning Soft Artificial Life (Soft ALife) models, such as Agent-based and Cellular Automata (CA) models, from observed data for reaction-diffusion systems. In this paper, we present findings on the applicability of a conceptual framework, the Data-driven Rulesets for Soft Artificial Life (DRSALife) model, to learn Soft ALife rulesets that accurately represent emergent dynamics in a reaction-diffusion system from observed data. This model has demonstrated promising results for Elementary CA Rule 30, Game of Life, and Vicsek Flocking problems in recent work. To our knowledge, this is one of the few studies that explore machine-based Soft ALife ruleset learning and system identification for reaction-diffusion dynamics without any prior knowledge of the underlying physics. Moreover, we provide comprehensive findings from experiments investigating the potential effects of using noisy and sparse observed datasets on learning emergent dynamics. Additionally, we successfully identify the structure and parameters of the underlying partial differential equations (PDEs) representing these dynamics. Experimental results demonstrate that the learned models are able to predict the emergent dynamics with good accuracy (74%) and exhibit quite robust performance when subjected to Gaussian noise and temporal sparsity.
[340] Interpretable Data Mining of Follicular Thyroid Cancer Ultrasound Features Using Enhanced Association Rules
Songlin Zhou, Tao Zhou, Xin Li, Stephen Shing-Toung Yau
Main category: cs.LG
TL;DR: Improved association rule mining with SHAP-inspired metrics to identify clinical indicators for preoperative diagnosis of follicular thyroid cancer from 1673 cases.
Details
Motivation: Follicular thyroid cancer lacks distinctive ultrasound signs and is harder to diagnose preoperatively than papillary thyroid cancer, with fewer established clinical studies.Method: Retrospective analysis of 1673 cases (2010-2023) using improved association rule mining with novel analytical metrics inspired by SHAP method from interpretable machine learning.
Result: Identified strong malignant associations beyond common indicators: nodule-in-nodule pattern, trabecular pattern, low TSH scores, and combination with Hashimoto’s thyroiditis.
Conclusion: Multiple clinical indications should be considered for accurate preoperative diagnosis of follicular thyroid cancer, with identified associations serving as references for clinicians.
Abstract: Purpose: Thyroid cancer has been a common cancer. Papillary thyroid cancer and follicular thyroid cancer are the two most common types of thyroid cancer. Follicular thyroid cancer lacks distinctive ultrasound signs and is more difficult to diagnose preoperatively than the more prevalent papillary thyroid cancer, and the clinical studies associated with it are less well established. We aimed to analyze the clinical data of follicular thyroid cancer based on a novel data mining tool to identify some clinical indications that may help in preoperative diagnosis. Methods: We performed a retrospective analysis based on case data collected by the Department of General Surgery of Peking University Third Hospital between 2010 and 2023. Unlike traditional statistical methods, we improved the association rule mining, a classical data mining method, and proposed new analytical metrics reflecting the malignant association between clinical indications and cancer with the help of the idea of SHAP method in interpretable machine learning. Results: The dataset was preprocessed to contain 1673 cases (in terms of nodes rather than patients), of which 1414 were benign and 259 were malignant nodes. Our analysis pointed out that in addition to some common indicators (e.g., irregular or lobulated nodal margins, uneven thickness halo, hypoechogenicity), there were also some indicators with strong malignant associations, such as nodule-in-nodule pattern, trabecular pattern, and low TSH scores. In addition, our results suggest that the combination of Hashimoto’s thyroiditis may also have a strong malignant association. Conclusion: In the preoperative diagnosis of nodules suspected of follicular thyroid cancer, multiple clinical indications should be considered for a more accurate diagnosis. The diverse malignant associations identified in our study may serve as a reference for clinicians in related fields.
[341] InJecteD: Analyzing Trajectories and Drift Dynamics in Denoising Diffusion Probabilistic Models for 2D Point Cloud Generation
Sanyam Jain, Khuram Naveed, Illia Oleksiienko, Alexandros Iosifidis, Ruben Pauwels
Main category: cs.LG
TL;DR: InJecteD is a framework for interpreting DDPMs by analyzing denoising trajectories in 2D point cloud generation, using statistical metrics to quantify trajectory properties and reveal distinct denoising phases.
Details
Motivation: To enhance transparency of denoising diffusion models and support human-AI collaboration by enabling practitioners to debug and refine generative models through trajectory analysis.Method: Analyzes sample trajectories during denoising process using statistical metrics (Wasserstein distance, cosine similarity) to quantify displacement, velocity, clustering, and drift field dynamics. Applied to three datasets with simplified DDPM architecture and customizable embeddings.
Result: Reveals distinct denoising phases: initial noise exploration, rapid shape formation, and final refinement. Fourier-based embeddings improve trajectory stability and reconstruction quality. Shows dataset-specific behaviors (bullseye concentric convergence vs. dino complex contour formation).
Conclusion: InJecteD provides valuable insights into DDPM behavior through trajectory analysis, demonstrating that embedding choices significantly impact model performance and stability, with Fourier embeddings performing best.
Abstract: This work introduces InJecteD, a framework for interpreting Denoising Diffusion Probabilistic Models (DDPMs) by analyzing sample trajectories during the denoising process of 2D point cloud generation. We apply this framework to three datasets from the Datasaurus Dozen bullseye, dino, and circle using a simplified DDPM architecture with customizable input and time embeddings. Our approach quantifies trajectory properties, including displacement, velocity, clustering, and drift field dynamics, using statistical metrics such as Wasserstein distance and cosine similarity. By enhancing model transparency, InJecteD supports human AI collaboration by enabling practitioners to debug and refine generative models. Experiments reveal distinct denoising phases: initial noise exploration, rapid shape formation, and final refinement, with dataset-specific behaviors example, bullseyes concentric convergence vs. dinos complex contour formation. We evaluate four model configurations, varying embeddings and noise schedules, demonstrating that Fourier based embeddings improve trajectory stability and reconstruction quality
[342] Why and How Auxiliary Tasks Improve JEPA Representations
Jiacan Yu, Siyi Chen, Mingrui Liu, Nono Horiuchi, Vladimir Braverman, Zicheng Xu, Dan Haramati, Randall Balestriero
Main category: cs.LG
TL;DR: JEPA with auxiliary regression head prevents representation collapse and ensures distinct latents for non-equivalent observations in deterministic MDPs.
Details
Motivation: Understand JEPA behavior in visual representation learning and model-based RL, as its theoretical properties remain poorly characterized.Method: Theoretical analysis of JEPA variant with auxiliary regression head trained jointly with latent dynamics, plus controlled ablations in counting environment.
Result: Proved No Unhealthy Representation Collapse theorem: non-equivalent observations map to distinct latent representations when both losses reach zero.
Conclusion: Joint training with auxiliary function that encodes proper equivalence relations improves JEPA encoder representations.
Abstract: Joint-Embedding Predictive Architecture (JEPA) is increasingly used for visual representation learning and as a component in model-based RL, but its behavior remains poorly understood. We provide a theoretical characterization of a simple, practical JEPA variant that has an auxiliary regression head trained jointly with latent dynamics. We prove a No Unhealthy Representation Collapse theorem: in deterministic MDPs, if training drives both the latent-transition consistency loss and the auxiliary regression loss to zero, then any pair of non-equivalent observations, i.e., those that do not have the same transition dynamics or auxiliary label, must map to distinct latent representations. Thus, the auxiliary task anchors which distinctions the representation must preserve. Controlled ablations in a counting environment corroborate the theory and show that training the JEPA model jointly with the auxiliary head generates a richer representation than training them separately. Our work indicates a path to improve JEPA encoders: training them with an auxiliary function that, together with the transition dynamics, encodes the right equivalence relations.
[343] Representation Learning on Large Non-Bipartite Transaction Networks using GraphSAGE
Mihir Tare, Clemens Rattasits, Yiming Wu, Euan Wielewski
Main category: cs.LG
TL;DR: GraphSAGE applied to banking transaction networks for scalable fraud detection and customer segmentation, showing improved performance over traditional methods.
Details
Motivation: Financial institutions need scalable tools to analyze dynamic transactional networks, but traditional graph embedding methods struggle with real-world banking data complexity and scalability.Method: Used GraphSAGE inductive Graph Neural Network framework on non-bipartite heterogeneous transaction networks, constructed from anonymized customer and merchant transactions to generate node embeddings.
Result: Embeddings revealed interpretable clusters aligned with geographic/demographic attributes and improved money mule detection by better prioritizing high-risk accounts.
Conclusion: GraphSAGE provides scalable, inductive framework for banking networks with practical applications in fraud detection and customer insights, offering a blueprint for financial organizations.
Abstract: Financial institutions increasingly require scalable tools to analyse complex transactional networks, yet traditional graph embedding methods struggle with dynamic, real-world banking data. This paper demonstrates the practical application of GraphSAGE, an inductive Graph Neural Network framework, to non-bipartite heterogeneous transaction networks within a banking context. Unlike transductive approaches, GraphSAGE scales well to large networks and can generalise to unseen nodes which is critical for institutions working with temporally evolving transactional data. We construct a transaction network using anonymised customer and merchant transactions and train a GraphSAGE model to generate node embeddings. Our exploratory work on the embeddings reveals interpretable clusters aligned with geographic and demographic attributes. Additionally, we illustrate their utility in downstream classification tasks by applying them to a money mule detection model where using these embeddings improves the prioritisation of high-risk accounts. Beyond fraud detection, our work highlights the adaptability of this framework to banking-scale networks, emphasising its inductive capability, scalability, and interpretability. This study provides a blueprint for financial organisations to harness graph machine learning for actionable insights in transactional ecosystems.
[344] Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) for Diabetes Risk Prediction
Kenneth G. Young II
Main category: cs.LG
TL;DR: QISICGM is a quantum-inspired ML framework that achieves state-of-the-art diabetes risk prediction (F1: 0.8933, AUC: 0.8699) using stacked ensemble methods and enhanced feature representations with CPU-efficient inference.
Details
Motivation: To develop an accurate and efficient AI system for diabetes risk prediction that addresses class imbalance issues while incorporating quantum-inspired techniques for improved feature representation and computational efficiency.Method: Uses PIMA dataset augmented with 2,000 synthetic samples, integrates self-improving concept graph with stacked ensemble (RF, ET, transformers, CNNs, FFNNs), and employs quantum-inspired phase feature mapping and neighborhood sequence modeling.
Result: Achieved OOF F1 score of 0.8933 and AUC of 0.8699, outperforming traditional methods, with CPU-efficient inference at 8.5 rows per second.
Conclusion: QISICGM represents a potential benchmark for AI-assisted clinical triage, emphasizing trustworthy AI through calibration, interpretability, and open-source reproducibility in diabetes prediction applications.
Abstract: The Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) is an innovative machine learning framework that harnesses quantum-inspired techniques to predict diabetes risk with exceptional accuracy and efficiency. Utilizing the PIMA Indians Diabetes dataset augmented with 2,000 synthetic samples to mitigate class imbalance (total: 2,768 samples, 1,949 positives), QISICGM integrates a self-improving concept graph with a stacked ensemble comprising Random Forests (RF), Extra Trees (ET), transformers, convolutional neural networks (CNNs), and feed-forward neural networks (FFNNs). This approach achieves an out-of-fold (OOF) F1 score of 0.8933 and an AUC of 0.8699, outperforming traditional methods. Quantum inspired elements, such as phase feature mapping and neighborhood sequence modeling, enrich feature representations, enabling CPU-efficient inference at 8.5 rows per second. This paper presents a detailed architecture, theoretical foundations, code insights, and performance evaluations, including visualizations from the outputs subfolder. The open-source implementation (v1.0.0) is available at https://github.com/keninayoung/QISICGM, positioning QISICGM as a potential benchmark for AI-assisted clinical triage in diabetes and beyond. Ultimately, this work emphasizes trustworthy AI through calibration, interpretability, and open-source reproducibility.
[345] Explainable Fraud Detection with GNNExplainer and Shapley Values
Ngoc Hieu Dao
Main category: cs.LG
TL;DR: Developing an explainable AI system for financial fraud detection to meet transparency requirements and support fraud investigations.
Details
Motivation: Increasing digital payments raise fraud risks, while regulators and society demand more transparent AI systems for reliability verification. Fraud analysts need understandable explanations to conduct effective investigations.Method: The paper focuses on developing an explainable fraud detector, though specific technical methods are not detailed in the abstract.
Result: Not specified in the abstract - the paper proposes to develop a solution but doesn’t present completed results.
Conclusion: Explainable AI systems are needed to address the dual challenges of increasing fraud risks and regulatory demands for transparency in financial fraud detection.
Abstract: The risk of financial fraud is increasing as digital payments are used more and more frequently. Although the use of artificial intelligence systems for fraud detection is widespread, society and regulators have raised the standards for these systems’ transparency for reliability verification purposes. To increase their effectiveness in conducting fraud investigations, fraud analysts also profit from having concise and understandable explanations. To solve these challenges, the paper will concentrate on developing an explainable fraud detector.
[346] Research on Short-Video Platform User Decision-Making via Multimodal Temporal Modeling and Reinforcement Learning
Jinmeiyang Wang, Jing Dong, Li Zhou
Main category: cs.LG
TL;DR: MT-DQN model combines Transformer, TGNN, and DQN for short-video recommendation, outperforming traditional models with significant F1-score and NDCG improvements while reducing MSE/MAE compared to Vanilla-DQN.
Details
Motivation: Address challenges in predicting user behavior and optimizing recommendation strategies in short-video environments where traditional models may be insufficient.Method: Integrates Transformer architecture, Temporal Graph Neural Network (TGNN), and Deep Q-Network (DQN) to create a unified model for recommendation optimization.
Result: Outperforms Concat-Modal by 10.97% F1-score and 8.3% NDCG@5 improvement. Reduces MSE by 34.8% and MAE by 26.5% compared to Vanilla-DQN.
Conclusion: MT-DQN shows superior performance but faces deployment challenges due to computational cost and latency sensitivity, requiring future architectural optimization for real-world applications.
Abstract: This paper proposes the MT-DQN model, which integrates a Transformer, Temporal Graph Neural Network (TGNN), and Deep Q-Network (DQN) to address the challenges of predicting user behavior and optimizing recommendation strategies in short-video environments. Experiments demonstrated that MT-DQN consistently outperforms traditional concatenated models, such as Concat-Modal, achieving an average F1-score improvement of 10.97% and an average NDCG@5 improvement of 8.3%. Compared to the classic reinforcement learning model Vanilla-DQN, MT-DQN reduces MSE by 34.8% and MAE by 26.5%. Nonetheless, we also recognize challenges in deploying MT-DQN in real-world scenarios, such as its computational cost and latency sensitivity during online inference, which will be addressed through future architectural optimization.
[347] Deriving the Scaled-Dot-Function via Maximum Likelihood Estimation and Maximum Entropy Approach
Jiyong Ma
Main category: cs.LG
TL;DR: Maximum likelihood estimation approach for determining value vectors in transformers, modeling sequences as Gaussian distributions with time-dependent variance and mean.
Details
Motivation: To provide a new explanation for the scaled-dot-product and softmax functions in transformer architectures through statistical modeling.Method: Model value, key, and query vectors as Gaussian distributions where variance depends on time step, key vector, and query vector, while mean depends on time step and value vector.
Result: The analysis offers an alternative statistical perspective on transformer attention mechanisms.
Conclusion: This Gaussian distribution modeling approach provides a novel explanation for transformer attention functions, complementing existing maximum entropy approaches.
Abstract: In this paper, we present a maximum likelihood estimation approach to determine the value vector in transformer models. We model the sequence of value vectors, key vectors, and the query vector as a sequence of Gaussian distributions. The variance in each Gaussian distribution depends on the time step, the corresponding key vector, and the query vector. The mean value in each Gaussian distribution depends on the time step, and the corresponding value vector. This analysis may offer a new explanation of the scaled-dot-product function or softmax function used in transformer architectures [1]. Another explanation, inspired by [4], is based on the maximum entropy approach in natural language processing [5]. In this approach, a query vector and key vectors are used to derive the feature functions for the maximum entropy model.
[348] Prediction of Stocks Index Price using Quantum GANs
Sangram Deshpande, Gopal Ramesh Dahale, Sai Nandan Morapakula, Uday Wad
Main category: cs.LG
TL;DR: QGANs outperform classical models in stock price prediction with better accuracy and convergence speed using quantum computing.
Details
Motivation: Financial markets are complex with high volatility that traditional models struggle to capture, requiring novel quantum-enhanced approaches.Method: Implemented QGAN model using AWS Braket SV1 simulator and Stocks index price data, comparing against classical LSTM and GAN models.
Result: QGANs generated synthetic data closely resembling actual market behavior and outperformed classical models in both convergence speed and prediction accuracy.
Conclusion: Quantum computing integration shows promise for financial forecasting, offering speed and precision advantages with important implications for market analysis.
Abstract: This paper investigates the application of Quantum Generative Adversarial Networks (QGANs) for stock price prediction. Financial markets are inherently complex, marked by high volatility and intricate patterns that traditional models often fail to capture. QGANs, leveraging the power of quantum computing, offer a novel approach by combining the strengths of generative models with quantum machine learning techniques. We implement a QGAN model tailored for stock price prediction and evaluate its performance using historical stock market data. Our results demonstrate that QGANs can generate synthetic data closely resembling actual market behavior, leading to enhanced prediction accuracy. The experiment was conducted using the Stocks index price data and the AWS Braket SV1 simulator for training the QGAN circuits. The quantum-enhanced model outperforms classical Long Short-Term Memory (LSTM) and GAN models in terms of convergence speed and prediction accuracy. This research represents a key step toward integrating quantum computing in financial forecasting, offering potential advantages in speed and precision over traditional methods. The findings suggest important implications for traders, financial analysts, and researchers seeking advanced tools for market analysis.
[349] C3DE: Causal-Aware Collaborative Neural Controlled Differential Equation for Long-Term Urban Crowd Flow Prediction
Yuting Liu, Qiang Zhou, Hanzhe Li, Chenqi Gong, Jingjing Gu
Main category: cs.LG
TL;DR: C3DE uses neural controlled differential equations with causal awareness to predict long-term urban crowd flow, addressing sampling errors and spurious correlations between POI evolution and crowd dynamics.
Details
Motivation: Long-term urban crowd flow prediction suffers from cumulative sampling errors and the challenge of modeling multi-timescale asynchronous dynamics between crowd flow and POI distribution with latent spurious causality.Method: Proposes Causal-aware Collaborative neural CDE (C3DE) with dual-path NCDE backbone to capture asynchronous evolution, dynamic correction mechanism with counterfactual-based causal effect estimator, and predictor for fused collaborative signals.
Result: Extensive experiments on three real-world datasets demonstrate superior performance, particularly in cities with notable flow fluctuations.
Conclusion: C3DE effectively models long-term crowd flow dynamics by addressing sampling errors and spurious correlations through causal-aware collaborative neural differential equations.
Abstract: Long-term urban crowd flow prediction suffers significantly from cumulative sampling errors, due to increased sequence lengths and sampling intervals, which inspired us to leverage Neural Controlled Differential Equations (NCDEs) to mitigate this issue. However, regarding the crucial influence of Points of Interest (POIs) evolution on long-term crowd flow, the multi-timescale asynchronous dynamics between crowd flow and POI distribution, coupled with latent spurious causality, poses challenges to applying NCDEs for long-term urban crowd flow prediction. To this end, we propose Causal-aware Collaborative neural CDE (C3DE) to model the long-term dynamic of crowd flow. Specifically, we introduce a dual-path NCDE as the backbone to effectively capture the asynchronous evolution of collaborative signals across multiple time scales. Then, we design a dynamic correction mechanism with the counterfactual-based causal effect estimator to quantify the causal impact of POIs on crowd flow and minimize the accumulation of spurious correlations. Finally, we leverage a predictor for long-term prediction with the fused collaborative signals of POI and crowd flow. Extensive experiments on three real-world datasets demonstrate the superior performance of C3DE, particularly in cities with notable flow fluctuations.
[350] Spontaneous Kolmogorov-Arnold Geometry in Shallow MLPs
Michael Freedman, Michael Mulligan
Main category: cs.LG
TL;DR: The paper investigates whether Kolmogorov-Arnold (KA) geometry emerges naturally in single hidden layer neural networks during training, rather than being engineered into the architecture like in KA-Networks (KANs).
Details
Motivation: To understand how neural networks organically learn to prepare input data for downstream processing and to learn about the emergence of KA geometry to potentially accelerate learning through timely hyperparameter interventions.Method: Quantify KA geometry through statistical properties of the exterior powers of the Jacobian matrix J(x), including number of zero rows and various observables for minor statistics that measure scale and axis alignment of J(x).
Result: KA geometry often emerges when training vanilla single hidden layer neural networks, and the study provides understanding of where this geometry occurs in the space of function complexity and model hyperparameters.
Conclusion: The research demonstrates that Kolmogorov-Arnold geometry can emerge organically in conventional shallow MLPs during training, representing the “flip side” of engineered KA-Networks where KA is built into the architecture.
Abstract: The Kolmogorov-Arnold (KA) representation theorem constructs universal, but highly non-smooth inner functions (the first layer map) in a single (non-linear) hidden layer neural network. Such universal functions have a distinctive local geometry, a “texture,” which can be characterized by the inner function’s Jacobian $J({\mathbf{x}})$, as $\mathbf{x}$ varies over the data. It is natural to ask if this distinctive KA geometry emerges through conventional neural network optimization. We find that indeed KA geometry often is produced when training vanilla single hidden layer neural networks. We quantify KA geometry through the statistical properties of the exterior powers of $J(\mathbf{x})$: number of zero rows and various observables for the minor statistics of $J(\mathbf{x})$, which measure the scale and axis alignment of $J(\mathbf{x})$. This leads to a rough understanding for where KA geometry occurs in the space of function complexity and model hyperparameters. The motivation is first to understand how neural networks organically learn to prepare input data for later downstream processing and, second, to learn enough about the emergence of KA geometry to accelerate learning through a timely intervention in network hyperparameters. This research is the “flip side” of KA-Networks (KANs). We do not engineer KA into the neural network, but rather watch KA emerge in shallow MLPs.
[351] Integrating Attention-Enhanced LSTM and Particle Swarm Optimization for Dynamic Pricing and Replenishment Strategies in Fresh Food Supermarkets
Xianchen Liu, Tianhui Zhang, Xinyu Zhang, Lingmin Hou, Zhen Guo, Yuanhao Tian, Yang Liu
Main category: cs.LG
TL;DR: Combines LSTM with attention mechanism for sales/pricing/spoilage prediction and PSO for optimization to maximize profits while reducing food waste in fresh food supermarkets.
Details
Motivation: To address the challenge of dynamic pricing and inventory management for perishable goods, maximizing profitability while reducing food waste in supermarket operations.Method: Uses LSTM network with attention mechanism for 7-day sales volume, pricing trends, and spoilage rate predictions. These predictions feed into PSO algorithm that optimizes pricing and replenishment strategies with cost-plus pricing for dynamic adjustments.
Result: The framework maximizes profits while adhering to inventory constraints, reduces food waste, and provides interpretable insights through attention mechanism for better decision-making.
Conclusion: This integrated approach bridges predictive modeling with optimization, offering a scalable solution for dynamic pricing and inventory management in perishable goods retail.
Abstract: This paper presents a novel approach to optimizing pricing and replenishment strategies in fresh food supermarkets by combining Long Short-Term Memory (LSTM) networks with Particle Swarm Optimization (PSO). The LSTM model, enhanced with an attention mechanism, is used to predict sales volumes, pricing trends, and spoilage rates over a seven-day period. The predictions generated by the LSTM model serve as inputs for the PSO algorithm, which iteratively optimizes pricing and replenishment strategies to maximize profitability while adhering to inventory constraints. The integration of cost-plus pricing allows for dynamic adjustments based on fixed and variable costs, ensuring real-time adaptability to market fluctuations. The framework not only maximizes profits but also reduces food waste, contributing to more sustainable supermarket operations. The attention mechanism enhances the interpretability of the LSTM model by identifying key time points and factors influencing sales, improving decision-making accuracy. This methodology bridges the gap between predictive modeling and optimization, offering a scalable solution for dynamic pricing and inventory management in fresh food retail and other industries dealing with perishable goods.
[352] FEDONet : Fourier-Embedded DeepONet for Spectrally Accurate Operator Learning
Arth Sojitra, Mrigank Dhingra, Omer San
Main category: cs.LG
TL;DR: FEDONet enhances DeepONets with Fourier-embedded trunk networks using random Fourier features to better capture complex spatial structures in PDEs, achieving 2-3x accuracy improvements across multiple PDE benchmarks.
Details
Motivation: Standard DeepONets with fully connected linear layers in trunk networks have limitations in capturing complex spatial structures inherent in various partial differential equations.Method: Introduces Fourier-embedded trunk networks within DeepONet architecture using random Fourier feature mappings to enrich spatial representation capabilities.
Result: FEDONet demonstrates superior performance with average relative L2 performance gains of 2-3x compared to traditional DeepONet across multiple PDE datasets including Poisson, Burgers’, Lorenz systems, and others.
Conclusion: Fourier embeddings significantly enhance neural operator learning for PDE surrogate modeling, providing a robust and broadly applicable methodology.
Abstract: Deep Operator Networks (DeepONets) have recently emerged as powerful data-driven frameworks for learning nonlinear operators, particularly suited for approximating solutions to partial differential equations (PDEs). Despite their promising capabilities, the standard implementation of DeepONets, which typically employs fully connected linear layers in the trunk network, can encounter limitations in capturing complex spatial structures inherent to various PDEs. To address this, we introduce Fourier-embedded trunk networks within the DeepONet architecture, leveraging random Fourier feature mappings to enrich spatial representation capabilities. Our proposed Fourier-embedded DeepONet, FEDONet demonstrates superior performance compared to the traditional DeepONet across a comprehensive suite of PDE-driven datasets, including the two-dimensional Poisson equation, Burgers’ equation, the Lorenz-63 chaotic system, Eikonal equation, Allen-Cahn equation, Kuramoto-Sivashinsky equation, and the Lorenz-96 system. Empirical evaluations of FEDONet consistently show significant improvements in solution reconstruction accuracy, with average relative L2 performance gains ranging between 2-3x compared to the DeepONet baseline. This study highlights the effectiveness of Fourier embeddings in enhancing neural operator learning, offering a robust and broadly applicable methodology for PDE surrogate modeling.
[353] Linear Dimensionality Reduction for Word Embeddings in Tabular Data Classification
Liam Ressel, Hamza A. A. Gardi
Main category: cs.LG
TL;DR: PCA with optimal subspace dimension outperforms raw word embeddings, while regularized LDA and novel Partitioned-LDA method significantly improve classification performance for salary prediction using high-dimensional word embeddings in tabular data.
Details
Motivation: The challenge involves classifying salary categories using tabular data with 300-dimensional word embeddings, creating high dimensionality issues with limited training samples. Linear dimensionality reduction methods for this specific problem remain underexplored.Method: Studied PCA and LDA for dimensionality reduction. Proposed Partitioned-LDA which splits embeddings into equal blocks and performs LDA separately on each to reduce covariance matrix size. Applied shrinkage regularization to address covariance estimation errors.
Result: PCA with appropriate subspace dimension outperformed raw embeddings. Regularized LDA performed well even with only 2 dimensions. Partitioned-LDA outperformed regular LDA and achieved top-10 accuracy on competition leaderboard when combined with shrinkage.
Conclusion: The proposed Partitioned-LDA method effectively enhances word embedding performance in tabular data classification with limited training samples, providing an efficient solution for high-dimensional feature reduction.
Abstract: The Engineers’ Salary Prediction Challenge requires classifying salary categories into three classes based on tabular data. The job description is represented as a 300-dimensional word embedding incorporated into the tabular features, drastically increasing dimensionality. Additionally, the limited number of training samples makes classification challenging. Linear dimensionality reduction of word embeddings for tabular data classification remains underexplored. This paper studies Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). We show that PCA, with an appropriate subspace dimension, can outperform raw embeddings. LDA without regularization performs poorly due to covariance estimation errors, but applying shrinkage improves performance significantly, even with only two dimensions. We propose Partitioned-LDA, which splits embeddings into equal-sized blocks and performs LDA separately on each, thereby reducing the size of the covariance matrices. Partitioned-LDA outperforms regular LDA and, combined with shrinkage, achieves top-10 accuracy on the competition public leaderboard. This method effectively enhances word embedding performance in tabular data classification with limited training samples.
[354] Unsupervised Atomic Data Mining via Multi-Kernel Graph Autoencoders for Machine Learning Force Fields
Hong Sun, Joshua A. Vita, Amit Samanta, Vincenzo Lordi
Main category: cs.LG
TL;DR: MEAGraph is an unsupervised graph autoencoder model that uses multi-kernel edge attention to identify and group similar atomic environments in chemical datasets, enabling effective pruning to remove sampling bias without information loss.
Details
Motivation: Traditional dataset generation techniques in computational chemistry often oversample regions of potential energy surfaces, creating sampling bias that is difficult to identify and remove due to high-dimensional atomic descriptors and misalignment with human intuition.Method: The Multi-kernel Edge Attention-based Graph Autoencoder (MEAGraph) combines multiple linear kernel transformations with attention-based message passing to capture geometric sensitivity and group similar atomic environments without requiring labels or extensive training.
Result: MEAGraph successfully groups similar atomic environments in niobium, tantalum, and iron datasets, allowing basic pruning techniques to effectively remove sampling bias while preserving chemical diversity.
Conclusion: MEAGraph provides an effective unsupervised approach for representation learning, clustering, and dataset optimization in computational chemistry, enabling better force field training through bias-free chemically diverse datasets.
Abstract: Constructing a chemically diverse dataset while avoiding sampling bias is critical to training efficient and generalizable force fields. However, in computational chemistry and materials science, many common dataset generation techniques are prone to oversampling regions of the potential energy surface. Furthermore, these regions can be difficult to identify and isolate from each other or may not align well with human intuition, making it challenging to systematically remove bias in the dataset. While traditional clustering and pruning (down-sampling) approaches can be useful for this, they can often lead to information loss or a failure to properly identify distinct regions of the potential energy surface due to difficulties associated with the high dimensionality of atomic descriptors. In this work, we introduce the Multi-kernel Edge Attention-based Graph Autoencoder (MEAGraph) model, an unsupervised approach for analyzing atomic datasets. MEAGraph combines multiple linear kernel transformations with attention-based message passing to capture geometric sensitivity and enable effective dataset pruning without relying on labels or extensive training. Demonstrated applications on niobium, tantalum, and iron datasets show that MEAGraph efficiently groups similar atomic environments, allowing for the use of basic pruning techniques for removing sampling bias. This approach provides an effective method for representation learning and clustering that can be used for data analysis, outlier detection, and dataset optimization.
[355] Enhancing Smart Farming Through Federated Learning: A Secure, Scalable, and Efficient Approach for AI-Driven Agriculture
Ritesh Janga, Rushit Dave
Main category: cs.LG
TL;DR: Federated learning framework for crop disease detection in Minnesota farms that maintains data privacy while achieving high accuracy through collaborative model updates.
Details
Motivation: Address the tension between growing demand for data-driven agriculture and farmers' privacy concerns about sharing sensitive operational data, providing a secure solution for smart farming.Method: Data collection from Minnesota farms, local deep learning algorithms with transfer learning, and central aggregation server for model refinement using federated learning approach.
Result: Anticipates improved disease detection accuracy, better generalization across agricultural scenarios, reduced communication/training costs, and earlier disease identification (requires empirical validation).
Conclusion: Bridges advanced ML techniques with practical privacy needs of farmers, offering a secure federated learning solution that could revolutionize smart farming while maintaining data confidentiality.
Abstract: The agricultural sector is undergoing a transformation with the integration of advanced technologies, particularly in data-driven decision-making. This work proposes a federated learning framework for smart farming, aiming to develop a scalable, efficient, and secure solution for crop disease detection tailored to the environmental and operational conditions of Minnesota farms. By maintaining sensitive farm data locally and enabling collaborative model updates, our proposed framework seeks to achieve high accuracy in crop disease classification without compromising data privacy. We outline a methodology involving data collection from Minnesota farms, application of local deep learning algorithms, transfer learning, and a central aggregation server for model refinement, aiming to achieve improved accuracy in disease detection, good generalization across agricultural scenarios, lower costs in communication and training time, and earlier identification and intervention against diseases in future implementations. We outline a methodology and anticipated outcomes, setting the stage for empirical validation in subsequent studies. This work comes in a context where more and more demand for data-driven interpretations in agriculture has to be weighed with concerns about privacy from farms that are hesitant to share their operational data. This will be important to provide a secure and efficient disease detection method that can finally revolutionize smart farming systems and solve local agricultural problems with data confidentiality. In doing so, this paper bridges the gap between advanced machine learning techniques and the practical, privacy-sensitive needs of farmers in Minnesota and beyond, leveraging the benefits of federated learning.
[356] Explainable Unsupervised Multi-Anomaly Detection and Temporal Localization in Nuclear Times Series Data with a Dual Attention-Based Autoencoder
Konstantinos Vasili, Zachery T. Dahm, Stylianos Chatzidakis
Main category: cs.LG
TL;DR: Proposes an unsupervised LSTM autoencoder with dual attention mechanism for anomaly detection and localization in nuclear reactor radiation monitoring systems, evaluated on real-world PUR-1 research reactor data.
Details
Motivation: Next-generation nuclear reactors generate multivariate time-series data that could enable enhanced monitoring, but existing ML/DL approaches lack explainability, real-world data access, and abnormal event scarcity for proper benchmarking.Method: Uses LSTM autoencoder with dual attention mechanism - feature attention weights abnormal radiation sensors, time attention highlights irregular timesteps, enabling both detection and localization of anomalies in a single unified network.
Result: Framework evaluated on real-world datasets from PUR-1 research reactor of increasing complexity, successfully detecting and localizing abnormal events by identifying affected sensors and anomaly duration.
Conclusion: The proposed unsupervised methodology with dual attention provides explainable anomaly detection and localization for nuclear reactor monitoring systems, addressing key challenges of interpretability in safety-critical domains.
Abstract: The nuclear industry is advancing toward more new reactor designs, with next-generation reactors expected to be smaller in scale and power output. These systems have the potential to produce large volumes of information in the form of multivariate time-series data, which could be used for enhanced real-time monitoring and control. In this context, the development of remote autonomous or semi-autonomous control systems for reactor operation has gained significant interest. A critical first step toward such systems is an accurate diagnostics module capable of detecting and localizing anomalies within the reactor system. Recent studies have proposed various ML and DL approaches for anomaly detection in the nuclear domain. Despite promising results, key challenges remain, including limited to no explainability, lack of access to real-world data, and scarcity of abnormal events, which impedes benchmarking and characterization. Most existing studies treat these methods as black boxes, while recent work highlights the need for greater interpretability of ML/DL outputs in safety-critical domains. Here, we propose an unsupervised methodology based on an LSTM autoencoder with a dual attention mechanism for characterization of abnormal events in a real-world reactor radiation area monitoring system. The framework includes not only detection but also localization of the event and was evaluated using real-world datasets of increasing complexity from the PUR-1 research reactor. The attention mechanisms operate in both the feature and temporal dimensions, where the feature attention assigns weights to radiation sensors exhibiting abnormal patterns, while time attention highlights the specific timesteps where irregularities occur, thus enabling localization. By combining the results, the framework can identify both the affected sensors and the duration of each anomaly within a single unified network.
[357] Diffusion-Based Generation and Imputation of Driving Scenarios from Limited Vehicle CAN Data
Julian Ripper, Ousama Esbel, Rafael Fietzek, Max Mühlhäuser, Thomas Kreutz
Main category: cs.LG
TL;DR: DDPMs effectively generate synthetic automotive time series data and correct corrupted samples, outperforming training data in physical correctness while enabling plausible driving behavior simulation and data quality improvement.
Details
Motivation: Training deep learning on small, corrupted time series datasets is challenging. Diffusion models show promise for generating realistic synthetic data and correcting corrupted samples through imputation in automotive applications.Method: Proposed hybrid generative approach combining autoregressive and non-autoregressive techniques with improved DDPM architectures for time series generation. Introduced three metrics to evaluate physical correctness and test track adherence.
Result: Best model outperformed training data in physical correctness while maintaining plausible driving behavior. Successfully used for imputing physically implausible regions, improving overall data quality.
Conclusion: DDPMs are effective for generating realistic automotive time series data and correcting corrupted samples, demonstrating superior performance in physical correctness and practical utility for data quality enhancement.
Abstract: Training deep learning methods on small time series datasets that also include corrupted samples is challenging. Diffusion models have shown to be effective to generate realistic and synthetic data, and correct corrupted samples through imputation. In this context, this paper focuses on generating synthetic yet realistic samples of automotive time series data. We show that denoising diffusion probabilistic models (DDPMs) can effectively solve this task by applying them to a challenging vehicle CAN-dataset with long-term data and a limited number of samples. Therefore, we propose a hybrid generative approach that combines autoregressive and non-autoregressive techniques. We evaluate our approach with two recently proposed DDPM architectures for time series generation, for which we propose several improvements. To evaluate the generated samples, we propose three metrics that quantify physical correctness and test track adherence. Our best model is able to outperform even the training data in terms of physical correctness, while showing plausible driving behavior. Finally, we use our best model to successfully impute physically implausible regions in the training data, thereby improving the data quality.
[358] Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization
Mohamed Zayaan S
Main category: cs.LG
TL;DR: CSML is a causal-symbolic meta-learning framework that infers latent causal structures from task distributions, enabling robust few-shot learning and causal reasoning.
Details
Motivation: Deep learning models are limited by spurious correlations and poor generalization. Human-like intelligence requires understanding causal mechanisms for robust, sample-efficient learning.Method: Three-module framework: perception module for disentangled symbolic representations, differentiable causal induction module for discovering causal graphs, and graph-based reasoning module for predictions. Meta-learns shared causal world model across task distributions.
Result: CSML dramatically outperforms state-of-the-art meta-learning and neuro-symbolic baselines, especially on tasks requiring true causal inference. Introduces CausalWorld benchmark for testing causal capabilities.
Conclusion: The framework enables rapid adaptation to novel tasks with few examples, including intervention and counterfactual reasoning, by learning causal structures rather than relying on spurious correlations.
Abstract: Modern deep learning models excel at pattern recognition but remain fundamentally limited by their reliance on spurious correlations, leading to poor generalization and a demand for massive datasets. We argue that a key ingredient for human-like intelligence-robust, sample-efficient learning-stems from an understanding of causal mechanisms. In this work, we introduce Causal-Symbolic Meta-Learning (CSML), a novel framework that learns to infer the latent causal structure of a task distribution. CSML comprises three key modules: a perception module that maps raw inputs to disentangled symbolic representations; a differentiable causal induction module that discovers the underlying causal graph governing these symbols and a graph-based reasoning module that leverages this graph to make predictions. By meta-learning a shared causal world model across a distribution of tasks, CSML can rapidly adapt to novel tasks, including those requiring reasoning about interventions and counterfactuals, from only a handful of examples. We introduce CausalWorld, a new physics-based benchmark designed to test these capabilities. Our experiments show that CSML dramatically outperforms state-of-the-art meta-learning and neuro-symbolic baselines, particularly on tasks demanding true causal inference.
[359] Evaluating the printability of stl files with ML
Janik Henn, Adrian Hauptmannl, Hamza A. A. Gardi
Main category: cs.LG
TL;DR: AI-powered 3D model analysis to detect printability issues before printing begins
Details
Motivation: Make 3D printing more accessible to non-experts by identifying problematic geometries that cause print failures, addressing the gap between professional tools and consumer-friendly solutionsMethod: Training an AI model to analyze 3D models and detect common issues and difficult-to-print geometries that typically lead to print failures
Result: A novel layer of support in slicing software that can sanity check 3D models before generating gcode, helping prevent failed prints
Conclusion: AI-assisted model analysis can significantly improve 3D printing success rates for less experienced users by catching issues early in the workflow
Abstract: 3D printing has long been a technology for industry professionals and enthusiasts willing to tinker or even build their own machines. This stands in stark contrast to today’s market, where recent developments have prioritized ease of use to attract a broader audience. Slicing software nowadays has a few ways to sanity check the input file as well as the output gcode. Our approach introduces a novel layer of support by training an AI model to detect common issues in 3D models. The goal is to assist less experienced users by identifying features that are likely to cause print failures due to difficult to print geometries before printing even begins.
[360] Adaptive Spatial Goodness Encoding: Advancing and Scaling Forward-Forward Learning Without Backpropagation
Qingchun Gong, Robert Bogdan Staszewski, Kai Xu
Main category: cs.LG
TL;DR: ASGE is a new Forward-Forward training framework for CNNs that addresses channel explosion issues and achieves competitive BP-free performance across multiple datasets including ImageNet.
Details
Motivation: Existing Forward-Forward algorithms suffer from limited representational capacity and poor scalability to large datasets due to exploding channel dimensionality, creating a need for better BP-free training methods.Method: Proposes adaptive spatial goodness encoding (ASGE) that leverages feature maps to compute spatially-aware goodness representations at each layer, enabling layer-wise supervision while decoupling classification complexity from channel dimensionality.
Result: Outperforms all other FF-based approaches with test accuracies of 99.65% on MNIST, 93.41% on FashionMNIST, 90.62% on CIFAR-10, 65.42% on CIFAR-100, and achieves first successful FF-based training on ImageNet with 26.21% Top-1 and 47.49% Top-5 accuracy.
Conclusion: ASGE establishes a viable foundation for scalable BP-free CNN training by entirely eliminating backpropagation and significantly narrowing the performance gap with BP-trained models.
Abstract: The Forward-Forward (FF) algorithm offers a promising alternative to backpropagation (BP). Despite advancements in recent FF-based extensions, which have enhanced the original algorithm and adapted it to convolutional neural networks (CNNs), they often suffer from limited representational capacity and poor scalability to large-scale datasets, primarily due to exploding channel dimensionality. In this work, we propose adaptive spatial goodness encoding (ASGE), a new FF-based training framework tailored for CNNs. ASGE leverages feature maps to compute spatially-aware goodness representations at each layer, enabling layer-wise supervision. Crucially, this approach decouples classification complexity from channel dimensionality, thereby addressing the issue of channel explosion and achieving competitive performance compared to other BP-free methods. ASGE outperforms all other FF-based approaches across multiple benchmarks, delivering test accuracies of 99.65% on MNIST, 93.41% on FashionMNIST, 90.62% on CIFAR-10, and 65.42% on CIFAR-100. Moreover, we present the first successful application of FF-based training to ImageNet, with Top-1 and Top-5 accuracies of 26.21% and 47.49%. By entirely eliminating BP and significantly narrowing the performance gap with BP-trained models, the ASGE framework establishes a viable foundation toward scalable BP-free CNN training.
[361] Bayesian Parametric Matrix Models: Principled Uncertainty Quantification for Spectral Learning
Mohammad Nooraiepour
Main category: cs.LG
TL;DR: Bayesian parametric matrix models (B-PMMs) extend deterministic spectral learning methods to provide uncertainty quantification while preserving computational efficiency and spectral structure.
Details
Motivation: Current spectral learning approaches only provide point estimates without uncertainty quantification, limiting their use in safety-critical applications where prediction confidence is essential.Method: B-PMMs use adaptive spectral decomposition with regularized matrix perturbation bounds, structured variational inference with manifold-aware matrix-variate Gaussian posteriors that respect Hermitian constraints, and provide finite-sample calibration guarantees.
Result: Experimental validation shows B-PMMs achieve exceptional uncertainty calibration (ECE < 0.05) across matrix dimensions from 5x5 to 500x500 with perfect convergence rates, maintaining favorable scaling and graceful degradation under spectral ill-conditioning.
Conclusion: The framework supports robust spectral learning in uncertainty-critical domains and lays the groundwork for broader Bayesian spectral machine learning.
Abstract: Scientific machine learning increasingly uses spectral methods to understand physical systems. Current spectral learning approaches provide only point estimates without uncertainty quantification, limiting their use in safety-critical applications where prediction confidence is essential. Parametric matrix models have emerged as powerful tools for scientific machine learning, achieving exceptional performance by learning governing equations. However, their deterministic nature limits deployment in uncertainty quantification applications. We introduce Bayesian parametric matrix models (B-PMMs), a principled framework that extends PMMs to provide uncertainty estimates while preserving their spectral structure and computational efficiency. B-PMM addresses the fundamental challenge of quantifying uncertainty in matrix eigenvalue problems where standard Bayesian methods fail due to the geometric constraints of spectral decomposition. The theoretical contributions include: (i) adaptive spectral decomposition with regularized matrix perturbation bounds that characterize eigenvalue uncertainty propagation, (ii) structured variational inference algorithms using manifold-aware matrix-variate Gaussian posteriors that respect Hermitian constraints, and (iii) finite-sample calibration guarantees with explicit dependence on spectral gaps and problem conditioning. Experimental validation across matrix dimensions from 5x5 to 500x500 with perfect convergence rates demonstrates that B-PMMs achieve exceptional uncertainty calibration (ECE < 0.05) while maintaining favorable scaling. The framework exhibits graceful degradation under spectral ill-conditioning and provides reliable uncertainty estimates even in near-degenerate regimes. The proposed framework supports robust spectral learning in uncertainty-critical domains and lays the groundwork for broader Bayesian spectral machine learning.
[362] Surrogate Representation Inference for Noisy Text and Image Annotations
Kentaro Nakamura
Main category: cs.LG
TL;DR: SRI is a new method that reduces standard errors by over 50% in ML-based data annotation by learning low-dimensional representations that satisfy surrogate assumptions, even with measurement errors in human annotations.
Details
Motivation: Existing methods for correcting bias in ML/LLM annotations yield large standard errors and require error-free human annotation, which limits their practical applicability.Method: Proposes Surrogate Representation Inference (SRI) with neural network architecture that learns low-dimensional representations of unstructured data while maintaining surrogate assumptions. Uses semiparametric efficient estimation strategies and can correct non-differential measurement errors when multiple human annotations are available.
Result: Simulation studies and real-world applications show SRI reduces standard errors by over 50% when ML prediction accuracy is moderate, and provides valid inference even with measurement errors in human annotations.
Conclusion: SRI offers a significant improvement over existing methods by enabling more efficient and robust statistical inference in ML-based data annotation tasks, particularly in text-as-outcome settings.
Abstract: As researchers increasingly rely on machine learning models and LLMs to annotate unstructured data, such as texts or images, various approaches have been proposed to correct bias in downstream statistical analysis. However, existing methods tend to yield large standard errors and require some error-free human annotation. In this paper, I introduce Surrogate Representation Inference (SRI), which assumes that unstructured data fully mediate the relationship between human annotations and structured variables. The assumption is guaranteed by design provided that human coders rely only on unstructured data for annotation. Under this setting, I propose a neural network architecture that learns a low-dimensional representation of unstructured data such that the surrogate assumption remains to be satisfied. When multiple human annotations are available, SRI can further correct non-differential measurement errors that may exist in human annotations. Focusing on text-as-outcome settings, I formally establish the identification conditions and semiparametric efficient estimation strategies that enable learning and leveraging such a low-dimensional representation. Simulation studies and a real-world application demonstrate that SRI reduces standard errors by over 50% when machine learning prediction accuracy is moderate and provides valid inference even when human annotations contain non-differential measurement errors.
[363] On the Regularity and Fairness of Combinatorial Multi-Armed Bandit
Xiaoyi Wu, Bin Li
Main category: cs.LG
TL;DR: A novel combinatorial multi-armed bandit algorithm that simultaneously addresses cumulative reward maximization, fairness guarantees, and reward regularity using virtual queues, TSLR metrics, and UCB estimates.
Details
Motivation: Inspired by wireless network applications requiring not only cumulative reward maximization but also fairness among arms (minimum average rewards) and reward regularity (frequency of rewards).Method: Proposes a parameterized algorithm combining virtual queue-lengths (tracking fairness violations), Time-Since-Last-Reward (TSLR) metrics (capturing regularity), and Upper Confidence Bound (UCB) estimates for exploration-exploitation tradeoff.
Result: Analytically achieves zero cumulative fairness violation, ensures reward regularity, and provides cumulative regret guarantees. Verified through simulations on two real-world datasets.
Conclusion: The proposed algorithm successfully addresses all three objectives (reward maximization, fairness, regularity) in combinatorial multi-armed bandit problems with theoretical guarantees and empirical validation.
Abstract: The combinatorial multi-armed bandit model is designed to maximize cumulative rewards in the presence of uncertainty by activating a subset of arms in each round. This paper is inspired by two critical applications in wireless networks, where it’s not only essential to maximize cumulative rewards but also to guarantee fairness among arms (i.e., the minimum average reward required by each arm) and ensure reward regularity (i.e., how often each arm receives the reward). In this paper, we propose a parameterized regular and fair learning algorithm to achieve these three objectives. In particular, the proposed algorithm linearly combines virtual queue-lengths (tracking the fairness violations), Time-Since-Last-Reward (TSLR) metrics, and Upper Confidence Bound (UCB) estimates in its weight measure. Here, TSLR is similar to age-of-information and measures the elapsed number of rounds since the last time an arm received a reward, capturing the reward regularity performance, and UCB estimates are utilized to balance the tradeoff between exploration and exploitation in online learning. By exploring a key relationship between virtual queue-lengths and TSLR metrics and utilizing several non-trivial Lyapunov functions, we analytically characterize zero cumulative fairness violation, reward regularity, and cumulative regret performance under our proposed algorithm. These theoretical outcomes are verified by simulations based on two real-world datasets.
[364] Nonlocal Neural Tangent Kernels via Parameter-Space Interactions
Sriram Nagaraj, Vishakh Hari
Main category: cs.LG
TL;DR: The paper proposes a Nonlocal Neural Tangent Kernel (NNTK) that extends NTK theory to nonsmooth functions and broader model families by replacing local gradients with nonlocal interaction-based approximations.
Details
Motivation: The standard Neural Tangent Kernel framework assumes differentiability, which breaks down for non-smooth target functions and models with non-differentiable behavior, limiting its applicability.Method: Develops a Nonlocal Neural Tangent Kernel using nonlocal interaction-based approximations in parameter space instead of local gradients, exploring both fixed-kernel and attention-based formulations.
Result: The NNTK framework enables extension of NTK theory to nonsmooth functions, stochastic estimators, and broader model families, as demonstrated through numerical studies.
Conclusion: The proposed nonlocal approach successfully overcomes the differentiability limitations of traditional NTK, expanding its theoretical reach to previously inaccessible non-smooth and non-differentiable scenarios.
Abstract: The Neural Tangent Kernel (NTK) framework has provided deep insights into the training dynamics of neural networks under gradient flow. However, it relies on the assumption that the network is differentiable with respect to its parameters, an assumption that breaks down when considering non-smooth target functions or parameterized models exhibiting non-differentiable behavior. In this work, we propose a Nonlocal Neural Tangent Kernel (NNTK) that replaces the local gradient with a nonlocal interaction-based approximation in parameter space. Nonlocal gradients are known to exist for a wider class of functions than the standard gradient. This allows NTK theory to be extended to nonsmooth functions, stochastic estimators, and broader families of models. We explore both fixed-kernel and attention-based formulations of this nonlocal operator. We illustrate the new formulation with numerical studies.
[365] Comparative Analysis of Wave Scattering Numerical Modeling Using the Boundary Element Method and Physics-Informed Neural Networks
Oscar Rincón-Cardeno, Gregorio Pérez Bernal, Silvana Montoya Noguera, Nicolás Guarín-Zapata
Main category: cs.LG
TL;DR: Comparison study between Boundary Element Method (BEM) and Physics-Informed Neural Networks (PINNs) for solving 2D Helmholtz equation in wave scattering problems, showing PINNs have much longer training time but faster evaluation time than BEM, with BEM maintaining better generalization outside training domain.
Details
Motivation: To evaluate and compare the performance of traditional BEM and emerging PINNs methods for solving wave scattering problems under identical conditions, providing quantitative data to support method selection in wave propagation research.Method: Solved Helmholtz equation using both BEM with boundary discretization and PINNs trained by minimizing residual of governing equations and boundary conditions. Conducted hyperparameter optimization for PINNs and varied integration points for BEM. Evaluated accuracy, computation time, and generalization capacity.
Result: At comparable accuracy, PINNs required training times ~42 times longer than BEM but achieved evaluation times up to 204 times faster. PINNs showed poor generalization outside training domain (error increased from 7.46×10⁻² to 8.22), while BEM maintained similar error levels in extended regions.
Conclusion: PINNs offer fast evaluation once trained but require extensive training time and show limited generalization capacity compared to BEM. The study provides quantitative performance data to guide method selection for wave propagation problems and identifies future research directions.
Abstract: Purpose - This study compares the Boundary Element Method (BEM) and Physics-Informed Neural Networks (PINNs) for solving the two-dimensional Helmholtz equation in wave scattering problems. The objective is to evaluate the performance of both methods under the same conditions. Design/methodology/approach - We solve the Helmholtz equation using BEM and PINNs for the same scattering problem. The PINNs are trained by minimizing the residual of the governing equations and boundary conditions, with their configuration determined through hyperparameter optimization, while the BEM is applied using boundary discretization. Both methods are evaluated in terms of solution accuracy, computation time, and generalization capacity. Findings - Numerical experiments were conducted by varying the number of integration points for BEM and the number of layers and neurons per layer for PINNs. Hyperparameter tuning provided further insight into suitable configurations for wave scattering problems. At comparable accuracy, PINNs produced consistent solutions but required training times approximately 42 times longer than BEM. However, once trained, PINNs achieved evaluation times up to 204 times faster. The generalization capacity was also assessed outside the PINN training domain, where the relative error increased from $7.46 \times 10^{-2}$ to 8.22, while BEM maintained a similar error level in the extended region. Originality/value - This work presents a direct comparison between PINNs and BEM for the Helmholtz equation. The analysis provides quantitative data on the performance of both methods, supporting their selection in future research on wave propagation problems and establishing future challenges and directions.
[366] Finite-Agent Stochastic Differential Games on Large Graphs: II. Graph-Based Architectures
Ruimeng Hu, Jihao Long, Haosheng Zhou
Main category: cs.LG
TL;DR: Proposes Non-Trainable Modification (NTM) neural architecture for computing Nash equilibria in stochastic differential games on graphs, using fixed graph-aligned components to reduce parameters while maintaining performance.
Details
Motivation: Address computational challenges in solving Nash equilibria for graph-structured multi-agent systems under uncertainty, which arise in finance, robotics, energy, and social dynamics applications.Method: Imposes graph-guided sparsification on feedforward networks with fixed non-trainable components aligned with graph topology. Integrates NTM into Direct Parameterization and Deep BSDE solvers to create sparse variants (NTM-DP and NTM-DBSDE).
Result: Theoretical universal approximation property established for static games. Numerical experiments on three SDGs show comparable performance to fully trainable methods with improved computational efficiency.
Conclusion: NTM architecture enhances interpretability, stability, and computational efficiency while maintaining competitive performance in solving stochastic differential games on graphs.
Abstract: We propose a novel neural network architecture, called Non-Trainable Modification (NTM), for computing Nash equilibria in stochastic differential games (SDGs) on graphs. These games model a broad class of graph-structured multi-agent systems arising in finance, robotics, energy, and social dynamics, where agents interact locally under uncertainty. The NTM architecture imposes a graph-guided sparsification on feedforward neural networks, embedding fixed, non-trainable components aligned with the underlying graph topology. This design enhances interpretability and stability, while significantly reducing the number of trainable parameters in large-scale, sparse settings. We theoretically establish a universal approximation property for NTM in static games on graphs and numerically validate its expressivity and robustness through supervised learning tasks. Building on this foundation, we incorporate NTM into two state-of-the-art game solvers, Direct Parameterization and Deep BSDE, yielding their sparse variants (NTM-DP and NTM-DBSDE). Numerical experiments on three SDGs across various graph structures demonstrate that NTM-based methods achieve performance comparable to their fully trainable counterparts, while offering improved computational efficiency.
[367] Prediction and Causality of functional MRI and synthetic signal using a Zero-Shot Time-Series Foundation Model
Alessandro Crimi, Andrea Brovelli
Main category: cs.LG
TL;DR: Foundation models show competitive zero-shot forecasting of fMRI brain signals and more precise causal interaction detection compared to traditional Granger causality methods.
Details
Motivation: To evaluate how foundation models compare to traditional methods for brain signal forecasting and causality analysis, and whether they can be applied in zero-shot settings for neuroscience applications.Method: Evaluated a foundation model against classical Wiener-Granger causality methods for inferring directional interactions from fMRI data. Tested forecasting ability in zero-shot and fine-tuned settings, and assessed causality by comparing Granger-like estimates with standard Granger causality. Validated using synthetic time series from ground-truth causal models.
Result: Foundation model achieved competitive zero-shot forecasting (MAPE 0.55 in controls, 0.27 in patients). While standard Granger causality showed no clear quantitative differences, the foundation model provided more precise detection of causal interactions.
Conclusion: Foundation models offer versatility, strong zero-shot performance, and potential utility for forecasting and causal discovery in time-series neuroscience data.
Abstract: Time-series forecasting and causal discovery are central in neuroscience, as predicting brain activity and identifying causal relationships between neural populations and circuits can shed light on the mechanisms underlying cognition and disease. With the rise of foundation models, an open question is how they compare to traditional methods for brain signal forecasting and causality analysis, and whether they can be applied in a zero-shot setting. In this work, we evaluate a foundation model against classical methods for inferring directional interactions from spontaneous brain activity measured with functional magnetic resonance imaging (fMRI) in humans. Traditional approaches often rely on Wiener-Granger causality. We tested the forecasting ability of the foundation model in both zero-shot and fine-tuned settings, and assessed causality by comparing Granger-like estimates from the model with standard Granger causality. We validated the approach using synthetic time series generated from ground-truth causal models, including logistic map coupling and Ornstein-Uhlenbeck processes. The foundation model achieved competitive zero-shot forecasting fMRI time series (mean absolute percentage error of 0.55 in controls and 0.27 in patients). Although standard Granger causality did not show clear quantitative differences between models, the foundation model provided a more precise detection of causal interactions. Overall, these findings suggest that foundation models offer versatility, strong zero-shot performance, and potential utility for forecasting and causal discovery in time-series data.
[368] Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time
Yifan Lan, Yuanpu Cao, Weitong Zhang, Lu Lin, Jinghui Chen
Main category: cs.LG
TL;DR: Preference Hijacking (Phi) method manipulates MLLM output preferences using optimized images, creating subtle biased responses that are hard to detect without model modifications.
Details
Motivation: Multimodal Large Language Models (MLLMs) have serious safety concerns, particularly that their output preferences can be manipulated through carefully crafted images, creating subtle biases that evade detection.Method: Preference Hijacking (Phi) uses preference hijacked images at inference time to manipulate MLLM responses. Includes universal transferable perturbations that can be embedded into different images to hijack responses toward attacker-specified preferences.
Result: Experimental results across various tasks demonstrate the effectiveness of the Phi approach in successfully hijacking MLLM response preferences.
Conclusion: The paper reveals a new safety vulnerability in MLLMs where output preferences can be arbitrarily manipulated through optimized images, presenting significant security risks that are difficult to detect due to the subtle nature of the biased responses.
Abstract: Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation – a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.
[369] Selective Risk Certification for LLM Outputs via Information-Lift Statistics: PAC-Bayes, Robustness, and Skeleton Design
Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma
Main category: cs.LG
TL;DR: First comprehensive theory of information-lift certificates for selective classification with PAC-Bayes analysis, sensitivity guarantees, and practical validation showing reduced abstention with low overhead.
Details
Motivation: Large language models often produce plausible but incorrect outputs, and existing heuristics like HallBayes lack formal guarantees, creating a need for theoretically sound certification methods.Method: Developed PAC-Bayes sub-gamma analysis extending beyond Bernstein bounds, explicit skeleton sensitivity theorems, failure-mode guarantees, and principled variational method for skeleton construction.
Result: Validated assumptions empirically across six datasets and multiple model families, reduced abstention by 12-15% at the same risk level, and maintained runtime overhead below 20% (further reduced via batching).
Conclusion: The proposed information-lift certificates provide the first comprehensive theoretical framework for selective classification with formal guarantees, demonstrating practical effectiveness in reducing unnecessary abstention while maintaining computational efficiency.
Abstract: Large language models often produce plausible but incorrect outputs. Existing heuristics such as HallBayes lack formal guarantees. We develop the first comprehensive theory of \emph{information-lift certificates} under selective classification. Our contributions are: (i) a PAC-Bayes \emph{sub-gamma} analysis extending beyond standard Bernstein bounds; (ii) explicit skeleton sensitivity theorems quantifying robustness to misspecification; (iii) failure-mode guarantees under assumption violations; and (iv) a principled variational method for skeleton construction. Across six datasets and multiple model families, we validate assumptions empirically, reduce abstention by 12–15% at the same risk, and maintain runtime overhead below 20% (further reduced via batching).
[370] Graph Homophily Booster: Rethinking the Role of Discrete Features on Heterophilic Graphs
Ruizhong Qiu, Ting-Wei Li, Gaotang Li, Hanghang Tong
Main category: cs.LG
TL;DR: GRAPHITE is a novel framework that addresses graph heterophily by directly transforming graphs to increase homophily through feature nodes, outperforming 21 state-of-the-art GNNs on heterophilic datasets.
Details
Motivation: Existing GNNs struggle with heterophilic graphs where connected nodes have dissimilar features, often performing worse than simple MLPs. Current methods focus on architectural designs without addressing the root cause of heterophily.Method: Proposes GRAPHITE framework that creates feature nodes to facilitate homophilic message passing between nodes with similar features, directly transforming the graph structure to increase homophily.
Result: GRAPHITE significantly increases homophily of heterophilic graphs with minimal graph size increase, outperforms 21 state-of-the-art GNNs on challenging heterophilic datasets like Actor, and achieves comparable performance on homophilic graphs.
Conclusion: GRAPHITE represents a new paradigm for addressing graph heterophily by directly transforming graph structure rather than focusing solely on architectural designs, demonstrating superior performance on heterophilic graphs while maintaining effectiveness on homophilic graphs.
Abstract: Graph neural networks (GNNs) have emerged as a powerful tool for modeling graph-structured data. However, existing GNNs often struggle with heterophilic graphs, where connected nodes tend to have dissimilar features or labels. While numerous methods have been proposed to address this challenge, they primarily focus on architectural designs without directly targeting the root cause of the heterophily problem. These approaches still perform even worse than the simplest MLPs on challenging heterophilic datasets. For instance, our experiments show that 21 latest GNNs still fall behind the MLP on the Actor dataset. This critical challenge calls for an innovative approach to addressing graph heterophily beyond architectural designs. To bridge this gap, we propose and study a new and unexplored paradigm: directly increasing the graph homophily via a carefully designed graph transformation. In this work, we present a simple yet effective framework called GRAPHITE to address graph heterophily. To the best of our knowledge, this work is the first method that explicitly transforms the graph to directly improve the graph homophily. Stemmed from the exact definition of homophily, our proposed GRAPHITE creates feature nodes to facilitate homophilic message passing between nodes that share similar features. Furthermore, we both theoretically and empirically show that our proposed GRAPHITE significantly increases the homophily of originally heterophilic graphs, with only a slight increase in the graph size. Extensive experiments on challenging datasets demonstrate that our proposed GRAPHITE significantly outperforms state-of-the-art methods on heterophilic graphs while achieving comparable accuracy with state-of-the-art methods on homophilic graphs.
[371] Cross-Modal Deep Metric Learning for Time Series Anomaly Detection
Wei Li, Zheze Yang
Main category: cs.LG
TL;DR: A cross-modal deep metric learning method for time series anomaly detection that uses triplet networks, vMF distribution for directional analysis, and achieves high sensitivity and fast detection.
Details
Motivation: To address low sensitivity and high time consumption issues in traditional time series anomaly detection methods.Method: Constructs a cross-modal deep metric learning model with triplet selection and loss computation layers. Uses squared Euclidean distances between cluster centers, stochastic gradient descent optimization, and vMF distribution for directional analysis of time series data.
Result: The method accurately classifies time series data with different attributes, shows high sensitivity to anomalies, and achieves high detection accuracy with fast speed and strong robustness.
Conclusion: The proposed cross-modal deep metric learning approach effectively solves sensitivity and efficiency problems in time series anomaly detection, demonstrating superior performance compared to traditional methods.
Abstract: To effectively address the issues of low sensitivity and high time consumption in time series anomaly detection, we propose an anomaly detection method based on cross-modal deep metric learning. A cross-modal deep metric learning feature clustering model is constructed, composed of an input layer, a triplet selection layer, and a loss function computation layer. The squared Euclidean distances between cluster centers are calculated, and a stochastic gradient descent strategy is employed to optimize the model and classify different time series features. The inner product of principal component direction vectors is used as a metric for anomaly measurement. The von Mises-Fisher (vMF) distribution is applied to describe the directional characteristics of time series data, and historical data is used to train and obtain evaluation parameters. By comparing the principal component direction vector of actual time series data with the threshold, anomaly detection is performed. Experimental results demonstrate that the proposed method accurately classifies time series data with different attributes, exhibits high sensitivity to anomalies, and achieves high detection accuracy, fast detection speed, and strong robustness.
[372] iCD: A Implicit Clustering Distillation Mathod for Structural Information Mining
Xiang Xue, Yatu Ji, Qing-dao-er-ji Ren, Bao Shi, Min Lu, Nier Wu, Xufei Zhuang, Haiteng Xu, Gan-qi-qi-ge Cha
Main category: cs.LG
TL;DR: iCD is a novel knowledge distillation method that transfers interpretable structural knowledge from logits using Gram matrices, improving performance without requiring labels or feature alignment.
Details
Motivation: Logit Knowledge Distillation lacks interpretability in decision-making despite its simplicity and no need for intermediate feature alignment.Method: Uses implicit Clustering Distillation (iCD) with Gram matrices over decoupled local logit representations to transfer latent semantic structural patterns.
Result: Achieves significant improvements across diverse teacher-student architectures, with +5.08% peak improvement in fine-grained classification tasks.
Conclusion: iCD provides an effective and interpretable approach to knowledge distillation that works well without ground-truth labels or feature-space alignment.
Abstract: Logit Knowledge Distillation has gained substantial research interest in recent years due to its simplicity and lack of requirement for intermediate feature alignment; however, it suffers from limited interpretability in its decision-making process. To address this, we propose implicit Clustering Distillation (iCD): a simple and effective method that mines and transfers interpretable structural knowledge from logits, without requiring ground-truth labels or feature-space alignment. iCD leverages Gram matrices over decoupled local logit representations to enable student models to learn latent semantic structural patterns. Extensive experiments on benchmark datasets demonstrate the effectiveness of iCD across diverse teacher-student architectures, with particularly strong performance in fine-grained classification tasks – achieving a peak improvement of +5.08% over the baseline. The code is available at: https://github.com/maomaochongaa/iCD.
[373] No Need for “Learning” to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction
Tim Bary, Benoît Macq, Louis Petit
Main category: cs.LG
TL;DR: A training-free framework for expert deferral using conformal prediction that outperforms standalone models and strongest experts while reducing expert workload by up to 11x, without requiring retraining when experts change.
Details
Motivation: AI systems often fail to deliver reliable predictions across all inputs, and existing Learning to Defer approaches require retraining when expert composition changes, creating scalability issues.Method: Uses conformal prediction to generate prediction sets, identifies label-specific uncertainty, and selects the most discriminative expert using a segregativity criterion that measures how well experts distinguish between plausible labels.
Result: Achieves 99.57% accuracy on CIFAR10-H and 99.40% on ImageNet16-H, reduces expert workload by up to 11x, remains robust under degraded expert performance, and shows gradual performance degradation in low-information settings.
Conclusion: Provides a scalable, retraining-free alternative to Learning to Defer for real-world human-AI collaboration that is model- and expert-agnostic.
Abstract: AI systems often fail to deliver reliable predictions across all inputs, prompting the need for hybrid human-AI decision-making. Existing Learning to Defer (L2D) approaches address this by training deferral models, but these are sensitive to changes in expert composition and require significant retraining if experts change. We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction. Our method uses the prediction set generated by a conformal predictor to identify label-specific uncertainty and selects the most discriminative expert using a segregativity criterion, measuring how well an expert distinguishes between the remaining plausible labels. Experiments on CIFAR10-H and ImageNet16-H show that our method consistently outperforms both the standalone model and the strongest expert, with accuracies attaining $99.57\pm0.10%$ and $99.40\pm0.52%$, while reducing expert workload by up to a factor of $11$. The method remains robust under degraded expert performance and shows a gradual performance drop in low-information settings. These results suggest a scalable, retraining-free alternative to L2D for real-world human-AI collaboration.
[374] Exploring Training Data Attribution under Limited Access Constraints
Shiyuan Zhang, Junwei Deng, Juhan Bae, Jiaqi Ma
Main category: cs.LG
TL;DR: This paper systematically studies training data attribution (TDA) methods under various access and resource constraints, proposing solutions like proxy models to enable TDA when full model access is unavailable or computational resources are limited.
Details
Motivation: Existing gradient-based TDA methods require full model access and high computational costs, which limits their practical adoption in real-world scenarios where commercial models are not publicly accessible and resources are constrained.Method: The authors investigate the feasibility of performing TDA under varying access constraints using appropriately designed solutions such as proxy models, and examine whether attribution scores from models not trained on target datasets remain informative.
Result: The study demonstrates that attribution scores obtained from models without prior training on the target dataset remain informative across various tasks, making TDA feasible in resource-constrained scenarios.
Conclusion: The findings provide practical guidance for deploying TDA in real-world environments, improving feasibility and efficiency under limited access conditions through proxy models and alternative approaches.
Abstract: Training data attribution (TDA) plays a critical role in understanding the influence of individual training data points on model predictions. Gradient-based TDA methods, popularized by \textit{influence function} for their superior performance, have been widely applied in data selection, data cleaning, data economics, and fact tracing. However, in real-world scenarios where commercial models are not publicly accessible and computational resources are limited, existing TDA methods are often constrained by their reliance on full model access and high computational costs. This poses significant challenges to the broader adoption of TDA in practical applications. In this work, we present a systematic study of TDA methods under various access and resource constraints. We investigate the feasibility of performing TDA under varying levels of access constraints by leveraging appropriately designed solutions such as proxy models. Besides, we demonstrate that attribution scores obtained from models without prior training on the target dataset remain informative across a range of tasks, which is useful for scenarios where computational resources are limited. Our findings provide practical guidance for deploying TDA in real-world environments, aiming to improve feasibility and efficiency under limited access.
[375] A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction
Huajun Zhou, Fengtao Zhou, Jiabo Ma, Yingxue Xu, Xi Wang, Xiuming Zhang, Li Liang, Zhenhui Li, Hao Chen
Main category: cs.LG
TL;DR: MICE is a multimodal foundation model that integrates pathology images, clinical reports, and genomics data for pan-cancer prognosis prediction, outperforming existing models with improved generalizability and data efficiency.
Details
Motivation: Existing AI models struggle to effectively harness the rich information in multimodal cancer data and extract generalizable representations for tumor microenvironment understanding.Method: MICE employs multiple functionally diverse experts (instead of conventional multi-expert modules) to capture both cross-cancer and cancer-specific insights, using contrastive and supervised learning on data from 11,799 patients across 30 cancer types.
Result: MICE outperformed unimodal and state-of-the-art multimodal models with C-index improvements of 3.8-11.2% on internal cohorts and 5.8-8.8% on independent cohorts, while demonstrating remarkable data efficiency.
Conclusion: MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction with strong potential for personalized therapy and improved treatment outcomes.
Abstract: Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates pathology images, clinical reports, and genomics data for precise pan-cancer prognosis prediction. Instead of conventional multi-expert modules, MICE employs multiple functionally diverse experts to comprehensively capture both cross-cancer and cancer-specific insights. Leveraging data from 11,799 patients across 30 cancer types, we enhanced MICE’s generalizability by coupling contrastive and supervised learning. MICE outperformed both unimodal and state-of-the-art multi-expert-based multimodal models, demonstrating substantial improvements in C-index ranging from 3.8% to 11.2% on internal cohorts and 5.8% to 8.8% on independent cohorts, respectively. Moreover, it exhibited remarkable data efficiency across diverse clinical scenarios. With its enhanced generalizability and data efficiency, MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction, holding strong potential to personalize tailored therapies and improve treatment outcomes.
[376] A Novel Recurrent Neural Network Framework for Prediction and Treatment of Oncogenic Mutation Progression
Rishab Parthasarathy, Achintya Bhowmik
Main category: cs.LG
TL;DR: AI-based pathway analysis framework using time-series ML models to predict cancer severity and mutation progression, achieving >60% accuracy without wet lab data.
Details
Motivation: Cancer remains a major cause of death, and current pathway analysis relies on time-consuming wet lab data. Need for efficient, cost-effective methods to predict cancer progression and recommend treatments.Method: Combines time-series ML models with pathway analysis. Uses TCGA mutation data, novel preprocessing to filter key mutations by frequency, RNN for cancer severity prediction, and integrates drug-target databases to predict future mutations and recommend treatments.
Result: Achieved robust results with ROC curves showing >60% accuracy (similar to existing diagnostics). Identified few-hundred key driver mutations per cancer stage. Generated heatmaps highlighting key mutations.
Conclusion: First efficient end-to-end framework for projecting cancer progression and treatment recommendations without wet lab work, demonstrating cost-effective AI approach for cancer analysis.
Abstract: Despite significant medical advancements, cancer remains the second leading cause of death, with over 600,000 deaths per year in the US. One emerging field, pathway analysis, is promising but still relies on manually derived wet lab data, which is time-consuming to acquire. This work proposes an efficient, effective end-to-end framework for Artificial Intelligence (AI) based pathway analysis that predicts both cancer severity and mutation progression, thus recommending possible treatments. The proposed technique involves a novel combination of time-series machine learning models and pathway analysis. First, mutation sequences were isolated from The Cancer Genome Atlas (TCGA) Database. Then, a novel preprocessing algorithm was used to filter key mutations by mutation frequency. This data was fed into a Recurrent Neural Network (RNN) that predicted cancer severity. Then, the model probabilistically used the RNN predictions, information from the preprocessing algorithm, and multiple drug-target databases to predict future mutations and recommend possible treatments. This framework achieved robust results and Receiver Operating Characteristic (ROC) curves (a key statistical metric) with accuracies greater than 60%, similar to existing cancer diagnostics. In addition, preprocessing played an instrumental role in isolating important mutations, demonstrating that each cancer stage studied may contain on the order of a few-hundred key driver mutations, consistent with current research. Heatmaps based on predicted gene frequency were also generated, highlighting key mutations in each cancer. Overall, this work is the first to propose an efficient, cost-effective end-to-end framework for projecting cancer progression and providing possible treatments without relying on expensive, time-consuming wet lab work.
[377] High-Energy Concentration for Federated Learning in Frequency Domain
Haozhi Shi, Weiying Xie, Hangyu Ye, Daixun Li, Jitao Ma, Leyuan Fang
Main category: cs.LG
TL;DR: FedFD is a frequency-domain federated learning method that reduces communication costs by filtering high-frequency noise and preserving low-frequency components using binary masks and frequency-domain alignment.
Details
Motivation: Traditional FL methods using dataset distillation suffer from redundant information and noise in spatial-domain designs, increasing communication burden. The discovery that discrete cosine transform concentrates energy in specific regions motivates filtering low-energy components to reduce costs.Method: Proposes FedFD which uses binary masks to preserve low-frequency components, implements frequency-domain distribution alignment, and incorporates real data-driven synthetic classification loss to enhance low-frequency quality.
Result: Achieves superior performance on 5 image/speech datasets with significant communication cost reduction (min 37.78% on CIFAR-10) and performance gains (10.88% on CIFAR-10 with α=0.01).
Conclusion: FedFD effectively addresses communication burden in FL by leveraging frequency-domain energy concentration properties, demonstrating both performance improvements and cost reductions across multiple datasets.
Abstract: Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain designs, which inevitably increases the communication burden. In this paper, we propose a novel Frequency-Domain aware FL method with high-energy concentration (FedFD) to address this problem. Our FedFD is inspired by the discovery that the discrete cosine transform predominantly distributes energy to specific regions, referred to as high-energy concentration. The principle behind FedFD is that low-energy like high-frequency components usually contain redundant information and noise, thus filtering them helps reduce communication costs and optimize performance. Our FedFD is mathematically formulated to preserve the low-frequency components using a binary mask, facilitating an optimal solution through frequency-domain distribution alignment. In particular, real data-driven synthetic classification is imposed into the loss to enhance the quality of the low-frequency components. On five image and speech datasets, FedFD achieves superior performance than state-of-the-art methods while reducing communication costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient $\alpha = 0.01$, FedFD achieves a minimum reduction of 37.78% in the communication cost, while attaining a 10.88% performance gain.
[378] Similarity-Distance-Magnitude Activations
Allen Schmaltz
Main category: cs.LG
TL;DR: A new activation function called Similarity-Distance-Magnitude (SDM) is introduced that enhances softmax with similarity awareness and distance-to-training-distribution awareness, making it more robust to distribution shifts and providing interpretability via exemplar matching.
Details
Motivation: Standard softmax activation lacks robustness to co-variate shifts and out-of-distribution inputs, and provides limited interpretability. The authors aim to create a more robust and interpretable activation function for neural networks.Method: The SDM activation function extends softmax by incorporating three components: Similarity awareness (correctly predicted depth-matches), Distance-to-training-distribution awareness, and existing output Magnitude awareness. It’s designed as a final-layer activation for language models.
Result: SDM activation demonstrates improved robustness to co-variate shifts and out-of-distribution inputs in high-probability regions. It provides interpretability-by-exemplar through dense matching and enables partitioning of class-wise empirical CDFs to guard against low class-wise recall.
Conclusion: The SDM activation function is preferable over standard softmax for selective classification tasks, offering better robustness, interpretability, and performance even when compared to post-hoc calibration methods applied to softmax.
Abstract: We introduce a more robust and interpretable formulation of the standard softmax activation function commonly used with neural networks by adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness. When used as the final-layer activation with language models, the resulting Similarity-Distance-Magnitude (SDM) activation function is more robust than the softmax function to co-variate shifts and out-of-distribution inputs in high-probability regions, and provides interpretability-by-exemplar via dense matching. Complementing the prediction-conditional estimates, the SDM activation enables a partitioning of the class-wise empirical CDFs to guard against low class-wise recall among selective classifications. These properties make it preferable for selective classification, even when considering post-hoc calibration methods over the softmax.
[379] Leveraging Intermediate Representations of Time Series Foundation Models for Anomaly Detection
Chan Sik Han, Keon Myung Lee
Main category: cs.LG
TL;DR: TimeRep is a novel anomaly detection method that leverages intermediate layer representations from time series foundation models instead of final layer outputs, using distance-based scoring and adaptive core-set strategies to outperform existing approaches.
Details
Motivation: Existing anomaly detection methods using time series foundation models rely only on final layer representations, potentially missing valuable information from intermediate layers that could improve detection accuracy.Method: TimeRep selects optimal intermediate layer and patch-token positions from pre-trained TSFMs, creates a core-set reference collection from training data, and computes anomaly scores as distances to this collection. It includes adaptation mechanisms for concept drift by adding non-redundant representations during inference.
Result: Extensive experiments on UCR Anomaly Archive (250 univariate time series) show TimeRep consistently outperforms state-of-the-art baselines including non-DL, DL, and foundation model-based methods.
Conclusion: Leveraging intermediate representations from time series foundation models with adaptive core-set strategies provides superior anomaly detection performance compared to existing methods that only use final layer outputs.
Abstract: Detecting anomalies in time series data is essential for the reliable operation of many real-world systems. Recently, time series foundation models (TSFMs) have emerged as a powerful tool for anomaly detection. However, existing methods typically rely on the final layer’s representations of TSFMs, computing the anomaly score as a reconstruction or forecasting error via a task-specific head. Instead, we propose TimeRep, a novel anomaly detection approach that leverages the intermediate layer’s representations of TSFMs, computing the anomaly score as the distance between these representations. Given a pre-trained TSFM, TimeRep selects the intermediate layer and patch-token position that yield the most informative representation. TimeRep forms a reference collection of intermediate representations from the training data and applies a core-set strategy to reduce its size while maintaining distributional coverage. During inference, TimeRep computes the anomaly score for incoming data by measuring the distance between its intermediate representations and those of the collection. To address concept drift, TimeRep integrates an adaptation mechanism that, at inference time, augments the collection exclusively with non-redundant intermediate representations from incoming data. We conducted extensive experiments on the UCR Anomaly Archive, which contains 250 univariate time series. TimeRep consistently outperforms a broad spectrum of state-of-the-art baselines, including non-DL, DL, and foundation model-based methods.
[380] Instance-level Randomization: Toward More Stable LLM Evaluations
Yiyang Li, Yonghuang Wu, Ying Luo, Liangtai Sun, Zishu Qin, Lin Qiu, Xuezhi Cao, Xunliang Cai
Main category: cs.LG
TL;DR: ILR method randomizes evaluation factors per instance across multiple experiments to reduce variance and improve fairness in LLM comparisons, achieving better robustness with lower computational cost.
Details
Motivation: Current LLM evaluations suffer from instability where small changes in random factors (like few-shot examples) cause significant score fluctuations and unfair model rankings, making fixed evaluation settings problematic.Method: Proposes instance-level randomization (ILR) - randomizes all factors affecting evaluation scores for every single instance, runs multiple experiments, and reports averaged scores instead of using fixed settings across benchmarks.
Result: Theoretical and empirical results show ILR reduces variance and unfair comparisons caused by random factors, achieving similar robustness level with less than half the computational cost of previous methods.
Conclusion: ILR effectively mitigates evaluation volatility in LLM assessments by addressing specific variance sources through per-instance randomization and averaging across multiple experiments, providing more fair and stable model comparisons.
Abstract: Evaluations of large language models (LLMs) suffer from instability, where small changes of random factors such as few-shot examples can lead to drastic fluctuations of scores and even model rankings. Moreover, different LLMs can have different preferences for a certain setting of random factors. As a result, using a fixed setting of random factors, which is often adopted as the paradigm of current evaluations, can lead to potential unfair comparisons between LLMs. To mitigate the volatility of evaluations, we first theoretically analyze the sources of variance induced by changes in random factors. Targeting these specific sources, we then propose the instance-level randomization (ILR) method to reduce variance and enhance fairness in model comparisons. Instead of using a fixed setting across the whole benchmark in a single experiment, we randomize all factors that affect evaluation scores for every single instance, run multiple experiments and report the averaged score. Theoretical analyses and empirical results demonstrate that ILR can reduce the variance and unfair comparisons caused by random factors, as well as achieve similar robustness level with less than half computational cost compared with previous methods.
[381] Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety
Denis Janiak, Julia Moska, Dawid Motyka, Karolina Seweryn, Paweł Walkowiak, Bartosz Żuk, Arkadiusz Janz
Main category: cs.LG
TL;DR: A unified evaluation framework compares LLM alignment methods (PPO, DPO, ORPO, KTO) across five dimensions: factuality, safety, conciseness, proactivity, and diversity, revealing method-specific strengths and trade-offs.
Details
Motivation: Existing studies focus on individual techniques or specific dimensions of LLM alignment, lacking a holistic assessment of the inherent trade-offs between competing objectives like factuality, safety, conciseness, proactivity, and diversity.Method: Proposed a unified evaluation framework using both in-distribution and out-of-distribution datasets, leveraging a specialized LLM-as-Judge prompt validated through human studies to compare PPO, DPO, ORPO, and KTO alignment methods.
Result: DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. The study reveals method-specific strengths across different alignment dimensions.
Conclusion: The findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs by understanding which methods excel in specific alignment objectives.
Abstract: Large language models (LLMs) require careful alignment to balance competing objectives - factuality, safety, conciseness, proactivity, and diversity. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of the inherent trade-offs. We propose a unified evaluation framework that compares LLM alignment methods (PPO, DPO, ORPO, KTO) across these five axes, using both in-distribution and out-of-distribution datasets. Leveraging a specialized LLM-as-Judge prompt, validated through human studies, we reveal that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. Our findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs.
[382] Large Language Model Scaling Laws for Neural Quantum States in Quantum Chemistry
Oliver Knitter, Dan Zhao, Stefan Leichenauer, Shravan Veerapaneni
Main category: cs.LG
TL;DR: The paper investigates scaling laws for transformer-based neural quantum states (NQS) in quantum chemistry applications, finding that unlike language models, the relationship between model size and training time is highly dependent on loss metric and ansatz type.
Details
Motivation: As neural quantum states increasingly incorporate LLM-based components, the authors seek to understand NQS scaling laws to reveal scalability and optimal performance-resource trade-offs for quantum chemistry applications.Method: The study identifies scaling laws that predict performance (measured by absolute error and V-score) for transformer-based NQS as a function of problem size in second-quantized quantum chemistry. They perform compute-constrained optimization of the obtained parametric curves.
Result: The relationship between model size and training time for NQS is highly dependent on loss metric and ansatz type, and does not follow the approximately linear relationship found for language models.
Conclusion: NQS scaling behavior differs significantly from language models, with performance-resource trade-offs being more complex and dependent on specific metrics and ansatz choices in quantum chemistry applications.
Abstract: Scaling laws have been used to describe how large language model (LLM) performance scales with model size, training data size, or amount of computational resources. Motivated by the fact that neural quantum states (NQS) has increasingly adopted LLM-based components, we seek to understand NQS scaling laws, thereby shedding light on the scalability and optimal performance–resource trade-offs of NQS ansatze. In particular, we identify scaling laws that predict the performance, as measured by absolute error and V-score, for transformer-based NQS as a function of problem size in second-quantized quantum chemistry applications. By performing analogous compute-constrained optimization of the obtained parametric curves, we find that the relationship between model size and training time is highly dependent on loss metric and ansatz, and does not follow the approximately linear relationship found for language models.
[383] ZTree: A Subgroup Identification Based Decision Tree Learning Framework
Eric Cheng, Jie Cheng
Main category: cs.LG
TL;DR: ZTree is a novel decision tree framework that replaces CART’s purity-based splitting with statistical hypothesis testing for subgroup identification, using cross-validation for multiple testing correction and making tree complexity controlled by a single z-threshold parameter.
Details
Motivation: Traditional decision trees like CART use purity-based splitting which lacks statistical rigor. The authors aim to create a more statistically principled approach that provides better interpretability and performance, especially in low-data regimes.Method: ZTree uses hypothesis tests (z-tests, t-tests, Mann-Whitney U, log-rank) at each node to identify meaningful subgroups. It employs cross-validation for multiple testing correction and uses a z-threshold parameter to control tree complexity without post-pruning.
Result: Empirical evaluation on five large-scale UCI datasets shows ZTree delivers strong performance, particularly in low data regimes, and tends to grow simpler trees than CART without sacrificing performance.
Conclusion: ZTree provides a statistically grounded alternative to traditional decision trees that is efficient, flexible, and intuitive to tune, with the z-threshold parameter serving as both a complexity controller and essentially a p-value for easy statistical test integration.
Abstract: Decision trees are a commonly used class of machine learning models valued for their interpretability and versatility, capable of both classification and regression. We propose ZTree, a novel decision tree learning framework that replaces CART’s traditional purity based splitting with statistically principled subgroup identification. At each node, ZTree applies hypothesis testing (e.g., z-tests, t-tests, Mann-Whitney U, log-rank) to assess whether a candidate subgroup differs meaningfully from the complement. To adjust for the complication of multiple testing, we employ a cross-validation-based approach to determine if further node splitting is needed. This robust stopping criterion eliminates the need for post-pruning and makes the test threshold (z-threshold) the only parameter for controlling tree complexity. Because of the simplicity of the tree growing procedure, once a detailed tree is learned using the most lenient z-threshold, all simpler trees can be derived by simply removing nodes that do not meet the larger z-thresholds. This makes parameter tuning intuitive and efficient. Furthermore, this z-threshold is essentially a p-value, allowing users to easily plug in appropriate statistical tests into our framework without adjusting the range of parameter search. Empirical evaluation on five large-scale UCI datasets demonstrates that ZTree consistently delivers strong performance, especially at low data regimes. Compared to CART, ZTree also tends to grow simpler trees without sacrificing performance. ZTree introduces a statistically grounded alternative to traditional decision tree splitting by leveraging hypothesis testing and a cross-validation approach to multiple testing correction, resulting in an efficient and flexible framework.
[384] When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning
Mengyi Deng, Xin Li, Tingyu Zhu, Zhicheng Yang, Zhijiang Guo, Wei Wang
Main category: cs.LG
TL;DR: This paper shows that creating reverse reasoning data (r1k) from forward examples improves model accuracy by 1.6%-6.8% over forward-only training, but mixing both data types causes conflicts that require direction-aware alignment strategies.
Details
Motivation: Existing methods focus on unidirectional supervised fine-tuning and overlook the interplay between diverse reasoning patterns, creating a need to understand how bidirectional reasoning affects model alignment.Method: Constructed r1k dataset by inverting 1,000 forward examples from s1k, then examined SFT and DPO effects on alignment under bidirectional reasoning objectives through comparative experiments.
Result: SFT on r1k yielded 1.6%-6.8% accuracy improvement over s1k, but mixing forward and reverse data weakened directional distinction. DPO partially recovered distinction but suppressed less preferred reasoning paths by shifting probability mass toward irrelevant outputs.
Conclusion: Mixed reasoning data introduces conflicting supervision signals, highlighting the need for robust and direction-aware alignment strategies rather than naive data mixing.
Abstract: Existing work has shown that o1-level performance can be achieved with limited data distillation, but most existing methods focus on unidirectional supervised fine-tuning (SFT), overlooking the intricate interplay between diverse reasoning patterns. In this paper, we construct r1k, a high-quality reverse reasoning dataset derived by inverting 1,000 forward examples from s1k, and examine how SFT and Direct Preference Optimization (DPO) affect alignment under bidirectional reasoning objectives. SFT on r1k yields a 1.6%–6.8% accuracy improvement over s1k across evaluated benchmarks. However, naively mixing forward and reverse data during SFT weakens the directional distinction. Although DPO can partially recover this distinction, it also suppresses less preferred reasoning paths by shifting the probability mass toward irrelevant outputs. These findings suggest that mixed reasoning data introduce conflicting supervision signals, underscoring the need for robust and direction-aware alignment strategies.
[385] Soft Graph Transformer for MIMO Detection
Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang
Main category: cs.LG
TL;DR: Soft Graph Transformer (SGT) is a neural architecture for MIMO detection that combines message passing with graph-aware attention, achieving near-ML performance with computational efficiency.
Details
Motivation: Existing MIMO detectors have limitations: ML detection is computationally prohibitive, message passing algorithms rely on impractical assumptions, and prior Transformer-based detectors fail to incorporate MIMO factor graph structure or utilize decoder-side soft information.Method: SGT integrates message passing directly into a graph-aware attention mechanism and supports decoder-informed updates through soft-input embeddings, enabling effective soft-output generation.
Result: SGT closely approaches Maximum Likelihood performance as a standalone detector and surpasses prior Transformer-based approaches.
Conclusion: SGT overcomes limitations of existing MIMO detectors by effectively combining graph structure awareness with attention mechanisms, making it suitable for practical implementations including iterative detection-decoding systems.
Abstract: We propose the Soft Graph Transformer (SGT), a Soft-Input-Soft-Output neural architecture tailored for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its prohibitive exponential complexity renders it impractical for real-world systems. Conventional message passing algorithms offer tractable alternatives but rely on large-system asymptotics and random matrix assumptions, both of which break down under practical implementations. Prior Transformer-based detectors, on the other hand, fail to incorporate the MIMO factor graph structure and cannot utilize decoder-side soft information, limiting their standalone performance and their applicability in iterative detection-decoding (IDD). To overcome these limitations, SGT integrates message passing directly into a graph-aware attention mechanism and supports decoder-informed updates through soft-input embeddings. This design enables effective soft-output generation while preserving computational efficiency. As a standalone detector, SGT closely approaches ML performance and surpasses prior Transformer-based approaches.
[386] Bi-level Personalization for Federated Foundation Models: A Task-vector Aggregation Approach
Yiyuan Yang, Guodong Long, Qinghua Lu, Liming Zhu, Jing Jiang
Main category: cs.LG
TL;DR: A bi-level personalization framework for federated fine-tuning of foundation models that combines client-level personalized fine-tuning with server-level personalized aggregation using similar users identified by task vectors.
Details
Motivation: Fine-tuning foundation models for small user groups or specialized scenarios with limited data presents challenges in balancing personalization and federation, especially with non-IID data distributions.Method: Proposes a bi-level framework: 1) Client-level personalized fine-tuning using private data, 2) Server-level personalized aggregation using similar users identified through client-specific task vectors to mitigate disturbance from irrelevant or conflicting clients.
Result: The effectiveness of the proposed algorithm has been demonstrated through extensive experimental analysis on benchmark datasets.
Conclusion: The bi-level personalization framework successfully addresses the trade-off between personalization and federation in fine-tuning foundation models for specialized scenarios with limited data.
Abstract: Federated foundation models represent a new paradigm to jointly fine-tune pre-trained foundation models across clients. It is still a challenge to fine-tune foundation models for a small group of new users or specialized scenarios, which typically involve limited data compared to the large-scale data used in pre-training. In this context, the trade-off between personalization and federation becomes more sensitive. To tackle these, we proposed a bi-level personalization framework for federated fine-tuning on foundation models. Specifically, we conduct personalized fine-tuning on the client-level using its private data, and then conduct a personalized aggregation on the server-level using similar users measured by client-specific task vectors. Given the personalization information gained from client-level fine-tuning, the server-level personalized aggregation can gain group-wise personalization information while mitigating the disturbance of irrelevant or interest-conflict clients with non-IID data. The effectiveness of the proposed algorithm has been demonstrated by extensive experimental analysis in benchmark datasets.
[387] NORA: A Nephrology-Oriented Representation Learning Approach Towards Chronic Kidney Disease Classification
Mohammad Abdul Hafeez Khan, Twisha Bhattacharyya, Omar Khan, Noorah Khan, Alina Aziz Fatima Khan, Mohammed Qutub Khan, Sujoy Ghosh Hajra
Main category: cs.LG
TL;DR: NORA framework uses contrastive learning and Random Forest to predict CKD from non-renal clinical variables, showing improved early-stage detection and cross-dataset generalization.
Details
Motivation: Early CKD detection is challenging in outpatient settings where renal biomarkers are often unavailable, creating a need for predictive models using routinely collected non-renal clinical data.Method: NORA (Nephrology-Oriented Representation leArning) combines supervised contrastive learning with nonlinear Random Forest classifier to derive discriminative patient representations from tabular EHR data for CKD classification.
Result: NORA improves class separability and overall classification performance, particularly enhancing F1-score for early-stage CKD, and demonstrates effectiveness across distinct patient cohorts on both clinic-based and UCI CKD datasets.
Conclusion: The NORA approach effectively leverages non-renal clinical variables for CKD risk stratification, offering a promising solution for early detection in resource-constrained outpatient settings with good generalizability across different patient populations.
Abstract: Chronic Kidney Disease (CKD) affects millions of people worldwide, yet its early detection remains challenging, especially in outpatient settings where laboratory-based renal biomarkers are often unavailable. In this work, we investigate the predictive potential of routinely collected non-renal clinical variables for CKD classification, including sociodemographic factors, comorbid conditions, and urinalysis findings. We introduce the Nephrology-Oriented Representation leArning (NORA) approach, which combines supervised contrastive learning with a nonlinear Random Forest classifier. NORA first derives discriminative patient representations from tabular EHR data, which are then used for downstream CKD classification. We evaluated NORA on a clinic-based EHR dataset from Riverside Nephrology Physicians. Our results demonstrated that NORA improves class separability and overall classification performance, particularly enhancing the F1-score for early-stage CKD. Additionally, we assessed the generalizability of NORA on the UCI CKD dataset, demonstrating its effectiveness for CKD risk stratification across distinct patient cohorts.
[388] Spatio-temporal DeepKriging in PyTorch: A Supplementary Application to Precipitation Data for Interpolation and Probabilistic Forecasting
Pratik Nag
Main category: cs.LG
TL;DR: A spatio-temporal DeepKriging framework using PyTorch for high-resolution interpolation and multi-step forecasting of precipitation data over Europe, with reproducible code implementations.
Details
Motivation: To address the challenges of handling spatio-temporal irregularities in precipitation data while providing accurate interpolations and forecasts for climate applications.Method: Implemented a Spatio-temporal DeepKriging (STDK) framework using PyTorch platform, capable of handling irregular spatio-temporal data patterns and generating both interpolations and multi-step forecasts.
Result: The approach demonstrated effectiveness through extensive evaluation on daily precipitation measurements, showing strong predictive performance and robustness in handling European climate data.
Conclusion: The developed STDK framework provides a powerful tool for precipitation analysis with reproducible PyTorch implementations that can be broadly applied to similar climate datasets.
Abstract: A detailed analysis of precipitation data over Europe is presented, with a focus on interpolation and forecasting applications. A Spatio-temporal DeepKriging (STDK) framework has been implemented using the PyTorch platform to achieve these objectives. The proposed model is capable of handling spatio-temporal irregularities while generating high-resolution interpolations and multi-step forecasts. Reproducible code modules have been developed as standalone PyTorch implementations for the interpolation\footnote[2]{Interpolation - https://github.com/pratiknag/Spatio-temporalDeepKriging-Pytorch.git} and forecasting\footnote[3]{Forecasting - https://github.com/pratiknag/pytorch-convlstm.git}, facilitating broader application to similar climate datasets. The effectiveness of this approach is demonstrated through extensive evaluation on daily precipitation measurements, highlighting predictive performance and robustness.
[389] WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Main category: cs.LG
TL;DR: WebSailor is a post-training methodology that enables open-source LLMs to match proprietary agents’ performance in complex information-seeking tasks by teaching systematic uncertainty reduction through novel task generation and agentic RL training.
Details
Motivation: Proprietary agentic systems like DeepResearch demonstrate superhuman capabilities on complex information-seeking benchmarks that open-source models cannot achieve, due to their inability to systematically reduce extreme uncertainty when navigating vast information landscapes.Method: WebSailor uses structured sampling and information obfuscation to generate novel high-uncertainty tasks, RFT cold start, and an efficient agentic RL training algorithm called Duplicating Sampling Policy Optimization (DUPO) to instill systematic uncertainty reduction capabilities.
Result: WebSailor significantly outperforms all open-source agents in complex information-seeking tasks and matches the performance of proprietary agents, effectively closing the capability gap between open-source and proprietary systems.
Conclusion: The integrated WebSailor pipeline successfully instills the crucial capability of systematic uncertainty reduction in open-source models, enabling them to achieve performance levels previously only attainable by proprietary agentic systems on complex information-seeking benchmarks.
Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.
[390] Unbiased Online Curvature Approximation for Regularized Graph Continual Learning
Jie Yin, Ke Sun, Han Wu
Main category: cs.LG
TL;DR: A new online curvature approximation method for graph continual learning that outperforms existing regularization approaches by better capturing the loss landscape without explicit FIM storage.
Details
Motivation: Existing regularization methods like EWC use diagonal approximations of Fisher information matrix from previous tasks, which have limitations in capturing the full curvature of the loss landscape during continual learning.Method: Proposes an unbiased online curvature approximation of the full Fisher information matrix based on the model’s current learning state, directly estimating regularization terms without explicitly storing FIM.
Result: Significantly outperforms existing regularization-based methods on three graph datasets, achieving superior trade-off between stability (retaining old knowledge) and plasticity (acquiring new knowledge).
Conclusion: The proposed online curvature approximation framework provides a more effective approach for graph continual learning by better modeling the loss landscape curvature during sequential task learning.
Abstract: Graph continual learning (GCL) aims to learn from a continuous sequence of graph-based tasks. Regularization methods are vital for preventing catastrophic forgetting in GCL, particularly in the challenging replay-free, class-incremental setting, where each task consists of a set of unique classes. In this work, we first establish a general regularization framework for GCL based on the curved parameter space induced by the Fisher information matrix (FIM). We show that the dominant Elastic Weight Consolidation (EWC) and its variants are a special case within this framework, using a diagonal approximation of the empirical FIM based on parameters from previous tasks. To overcome their limitations, we propose a new unbiased online curvature approximation of the full FIM based on the model’s current learning state. Our method directly estimates the regularization term in an online manner without explicitly evaluating and storing the FIM itself. This enables the model to better capture the loss landscape during learning new tasks while retaining the knowledge learned from previous tasks. Extensive experiments on three graph datasets demonstrate that our method significantly outperforms existing regularization-based methods, achieving a superior trade-off between stability (retaining old knowledge) and plasticity (acquiring new knowledge).
[391] A Graph Machine Learning Approach for Detecting Topological Patterns in Transactional Graphs
Francesco Zola, Jon Ander Medina, Andrea Venturi, Amaia Gil, Raul Orduna
Main category: cs.LG
TL;DR: Proposes graph machine learning approach with autoencoders to detect financial crime patterns in transactional networks, overcoming sparse data limitations through novel preprocessing and weak labeling.
Details
Motivation: Traditional rule-based systems lack adaptability to detect sophisticated coordinated criminal behaviors in digital financial ecosystems, requiring analysis of actors' interactions to uncover suspicious activities.Method: Four-step preprocessing framework: (i) extract graph structures, (ii) manage data temporality for large node sets, (iii) detect communities, (iv) apply automatic labeling for weak ground-truth. Then implement and compare three Graph Autoencoder variants for pattern detection.
Result: Preliminary results show the pattern-focused, topology-driven method is effective for detecting complex financial crime schemes.
Conclusion: The approach offers a promising alternative to conventional rule-based detection systems for financial crime detection in transactional networks.
Abstract: The rise of digital ecosystems has exposed the financial sector to evolving abuse and criminal tactics that share operational knowledge and techniques both within and across different environments (fiat-based, crypto-assets, etc.). Traditional rule-based systems lack the adaptability needed to detect sophisticated or coordinated criminal behaviors (patterns), highlighting the need for strategies that analyze actors’ interactions to uncover suspicious activities and extract their modus operandi. For this reason, in this work, we propose an approach that integrates graph machine learning and network analysis to improve the detection of well-known topological patterns within transactional graphs. However, a key challenge lies in the limitations of traditional financial datasets, which often provide sparse, unlabeled information that is difficult to use for graph-based pattern analysis. Therefore, we firstly propose a four-step preprocessing framework that involves (i) extracting graph structures, (ii) considering data temporality to manage large node sets, (iii) detecting communities within, and (iv) applying automatic labeling strategies to generate weak ground-truth labels. Then, once the data is processed, Graph Autoencoders are implemented to distinguish among the well-known topological patterns. Specifically, three different GAE variants are implemented and compared in this analysis. Preliminary results show that this pattern-focused, topology-driven method is effective for detecting complex financial crime schemes, offering a promising alternative to conventional rule-based detection systems.
[392] EmbeddedML: A New Optimized and Fast Machine Learning Library
Halil Hüseyin Çalışkan, Talha Koruk
Main category: cs.LG
TL;DR: EmbeddedML is a mathematically enhanced machine learning library that significantly reduces training time while maintaining or improving accuracy compared to scikit-learn, with speed improvements up to 800x for large datasets.
Details
Motivation: Machine learning models and libraries suffer from slow and long training times on large datasets, creating a need for optimized implementations that can handle large-scale data more efficiently.Method: The authors mathematically rewrote and optimized machine learning algorithms including Multiple Linear Regression, Logistic Regression, and Support Vector Machines (SVM) to reduce computational complexity and improve training efficiency.
Result: The library achieved approximately 2x speedup for SVM on small datasets and 800x speedup on large datasets, 4x speedup for Logistic Regression, and maintained equivalent accuracy to scikit-learn without any loss in regression models.
Conclusion: EmbeddedML provides a comprehensive set of mathematically optimized machine learning algorithms for regression, classification, clustering, and dimensionality reduction that significantly reduce training time while preserving accuracy.
Abstract: Machine learning models and libraries can train datasets of different sizes and perform prediction and classification operations, but machine learning models and libraries cause slow and long training times on large datasets. This article introduces EmbeddedML, a training-time-optimized and mathematically enhanced machine learning library. The speed was increased by approximately times compared to scikit-learn without any loss in terms of accuracy in regression models such as Multiple Linear Regression. Logistic Regression and Support Vector Machines (SVM) algorithms have been mathematically rewritten to reduce training time and increase accuracy in classification models. With the applied mathematical improvements, training time has been reduced by approximately 2 times for SVM on small datasets and by around 800 times on large datasets, and by approximately 4 times for Logistic Regression, compared to the scikit-learn implementation. In summary, the EmbeddedML library offers regression, classification, clustering, and dimensionality reduction algorithms that are mathematically rewritten and optimized to reduce training time.
[393] Energy-Efficient Quantized Federated Learning for Resource-constrained IoT devices
Wilfrid Sougrinoma Compaoré, Yaya Etiabi, El Mehdi Amhoud, Mohamad Assaad
Main category: cs.LG
TL;DR: A federated learning framework for IoT networks that combines finite blocklength transmission, model quantization, and error-aware aggregation to significantly reduce energy consumption while maintaining model accuracy.
Details
Motivation: Resource-constrained IoT devices face challenges with limited energy, unreliable communication, and the impracticality of infinite blocklength transmission in federated learning scenarios.Method: Proposes a framework integrating finite blocklength transmission, model quantization, error-aware aggregation mechanism, and uplink transmission power optimization to enhance energy efficiency and communication reliability.
Result: Simulation results show up to 75% reduction in energy consumption compared to standard FL models while maintaining robust model accuracy.
Conclusion: The proposed approach provides a viable solution for efficient and reliable federated learning implementations in real-world IoT deployments with constrained resources.
Abstract: Federated Learning (FL) has emerged as a promising paradigm for enabling collaborative machine learning while preserving data privacy, making it particularly suitable for Internet of Things (IoT) environments. However, resource-constrained IoT devices face significant challenges due to limited energy,unreliable communication channels, and the impracticality of assuming infinite blocklength transmission. This paper proposes a federated learning framework for IoT networks that integrates finite blocklength transmission, model quantization, and an error-aware aggregation mechanism to enhance energy efficiency and communication reliability. The framework also optimizes uplink transmission power to balance energy savings and model performance. Simulation results demonstrate that the proposed approach significantly reduces energy consumption by up to 75% compared to a standard FL model, while maintaining robust model accuracy, making it a viable solution for FL in real-world IoT scenarios with constrained resources. This work paves the way for efficient and reliable FL implementations in practical IoT deployments. Index Terms: Federated learning, IoT, finite blocklength, quantization, energy efficiency.
[394] Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?
Hannah Markgraf, Shamburaj Sawant, Hanna Krasowski, Lukas Schäfer, Sebastien Gros, Matthias Althoff
Main category: cs.LG
TL;DR: Theoretical comparison of two projection-based safety filter integration strategies in RL (SE-RL vs SP-RL), highlighting how action aliasing affects policy gradients differently in each approach and providing mitigation strategies.
Details
Motivation: Despite widespread use of projection-based safety filters in safety-critical RL applications, there's a lack of formal understanding of the differences between treating safeguards as part of the environment (SE-RL) versus embedding them within the policy through differentiable optimization (SP-RL).Method: The authors provide a unified formalization of both approaches in actor-critic algorithms, analyze their policy gradient estimates theoretically, study the role of action aliasing, and compare mitigation strategies including a novel penalty-based improvement for SP-RL.
Result: Empirical results show action aliasing is more detrimental for SP-RL than SE-RL, but with appropriate improvement strategies, SP-RL can match or outperform improved SE-RL across various environments.
Conclusion: The findings provide actionable insights for choosing and refining projection-based safe RL methods based on task characteristics, with SP-RL being viable when proper mitigation strategies are employed.
Abstract: Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe environment RL (SE-RL), where the safeguard is treated as part of the environment, and safe policy RL (SP-RL), where it is embedded within the policy through differentiable optimization layers. Despite their practical relevance in safety-critical settings, a formal understanding of their differences is lacking. In this work, we present a theoretical comparison of SE-RL and SP-RL. We identify a key distinction in how each approach is affected by action aliasing, a phenomenon in which multiple unsafe actions are projected to the same safe action, causing information loss in the policy gradients. In SE-RL, this effect is implicitly approximated by the critic, while in SP-RL, it manifests directly as rank-deficient Jacobians during backpropagation through the safeguard. Our contributions are threefold: (i) a unified formalization of SE-RL and SP-RL in the context of actor-critic algorithms, (ii) a theoretical analysis of their respective policy gradient estimates, highlighting the role of action aliasing, and (iii) a comparative study of mitigation strategies, including a novel penalty-based improvement for SP-RL that aligns with established SE-RL practices. Empirical results support our theoretical predictions, showing that action aliasing is more detrimental for SP-RL than for SE-RL. However, with appropriate improvement strategies, SP-RL can match or outperform improved SE-RL across a range of environments. These findings provide actionable insights for choosing and refining projection-based safe RL methods based on task characteristics.
[395] Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use
Yabo Zhang, Yihan Zeng, Qingyun Li, Zhen Hu, Kavin Han, Wangmeng Zuo
Main category: cs.LG
TL;DR: Tool-R1 is a reinforcement learning framework that enables LLMs to perform multi-step tool use by generating executable Python code, achieving 10% accuracy improvement on GAIA benchmark.
Details
Motivation: LLMs have strong language capabilities but struggle with real-world tasks requiring up-to-date knowledge, precise operations, or specialized tool use.Method: Reinforcement learning framework with executable Python code generation, user-defined tool integration, variable sharing across steps, outcome-based rewards, and dynamic sample queue for efficient training.
Result: Substantially improves accuracy and robustness with ~10% gain over baselines on GAIA benchmark, with larger improvements on complex multi-step tasks.
Conclusion: Tool-R1 shows potential for enabling reliable and efficient tool-augmented reasoning in real-world applications through reinforcement learning and code generation.
Abstract: Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning, yet they remain limited when tackling real-world tasks that require up-to-date knowledge, precise operations, or specialized tool use. To address this, we propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code. Tool-R1 supports integration of user-defined tools and standard libraries, with variable sharing across steps to construct coherent workflows. An outcome-based reward function, combining LLM-based answer judgment and code execution success, guides policy optimization. To improve training efficiency, we maintain a dynamic sample queue to cache and reuse high-quality trajectories, reducing the overhead of costly online sampling. Experiments on the GAIA benchmark show that Tool-R1 substantially improves both accuracy and robustness, achieving about 10% gain over strong baselines, with larger improvements on complex multi-step tasks. These results highlight the potential of Tool-R1 for enabling reliable and efficient tool-augmented reasoning in real-world applications. Our code will be available at https://github.com/YBYBZhang/Tool-R1.
[396] TimeCluster with PCA is Equivalent to Subspace Identification of Linear Dynamical Systems
Christian L. Hines, Samuel Spillard, Daniel P. Martin
Main category: cs.LG
TL;DR: TimeCluster visual analytics technique is mathematically equivalent to classical linear subspace identification when using PCA, as both extract the same low-dimensional linear subspace from time series data through Hankel matrix SVD decomposition.
Details
Motivation: To establish the mathematical equivalence between TimeCluster (a visual analytics technique for time series structure discovery) and classical subspace system identification methods, revealing that they produce identical embeddings.Method: The paper shows that forming sliding-window matrices from time series creates Hankel matrices, and applying PCA via SVD to these matrices recovers the same principal directions as subspace identification methods, making TimeCluster coordinates coincide with subspace identification embeddings.
Result: Experiments on synthetic and real dynamical signals confirm that the two embeddings (TimeCluster and subspace identification) are identical, demonstrating mathematical equivalence.
Conclusion: The equivalence opens future opportunities including forecasting from identified state spaces, streaming/online extensions, incorporating external inputs, and robust techniques for visualizing trends in corrupted data.
Abstract: TimeCluster is a visual analytics technique for discovering structure in long multivariate time series by projecting overlapping windows of data into a low-dimensional space. We show that, when Principal Component Analysis (PCA) is chosen as the dimensionality reduction technique, this procedure is mathematically equivalent to classical linear subspace identification (block-Hankel matrix plus Singular Vector Decomposition (SVD)). In both approaches, the same low-dimensional linear subspace is extracted from the time series data. We first review the TimeCluster method and the theory of subspace system identification. Then we show that forming the sliding-window matrix of a time series yields a Hankel matrix, so applying PCA (via SVD) to this matrix recovers the same principal directions as subspace identification. Thus the cluster coordinates from TimeCluster coincide with the subspace identification methods. We present experiments on synthetic and real dynamical signals confirming that the two embeddings coincide. Finally, we explore and discuss future opportunities enabled by this equivalence, including forecasting from the identified state space, streaming/online extensions, incorporating and visualising external inputs and robust techniques for displaying underlying trends in corrupted data.
[397] Reversible Deep Equilibrium Models
Sam McCallum, Kamran Arora, James Foster
Main category: cs.LG
TL;DR: RevDEQs introduce reversible deep equilibrium models that enable exact gradient calculation, eliminating the need for regularization and reducing function evaluations compared to standard DEQs, achieving SOTA performance on language modeling and image classification.
Details
Motivation: Deep Equilibrium Models (DEQs) suffer from approximate gradient calculations that lead to unstable training dynamics, requiring regularization and many function evaluations to stabilize.Method: The authors propose Reversible Deep Equilibrium Models (RevDEQs) that allow for exact gradient calculation through reversible transformations, eliminating the need for regularization and reducing computational overhead.
Result: RevDEQs achieve state-of-the-art performance on language modeling and image classification tasks, outperforming both comparable implicit and explicit models.
Conclusion: Reversible DEQs provide a more stable and efficient alternative to standard DEQs by enabling exact gradient computation while maintaining competitive performance on large-scale tasks.
Abstract: Deep Equilibrium Models (DEQs) are an interesting class of implicit model where the model output is implicitly defined as the fixed point of a learned function. These models have been shown to outperform explicit (fixed-depth) models in large-scale tasks by trading many deep layers for a single layer that is iterated many times. However, gradient calculation through DEQs is approximate. This often leads to unstable training dynamics and requires regularisation or many function evaluations to fix. Here, we introduce Reversible Deep Equilibrium Models (RevDEQs) that allow for exact gradient calculation, no regularisation and far fewer function evaluations than DEQs. We show that RevDEQs achieve state-of-the-art performance on language modelling and image classification tasks against comparable implicit and explicit models.
[398] Soft Gradient Boosting with Learnable Feature Transforms for Sequential Regression
Huseyin Karaca, Suleyman Serdar Kozat
Main category: cs.LG
TL;DR: A soft gradient boosting framework that learns linear feature transforms during boosting iterations, particularly effective for high-dimensional, data-scarce scenarios by optimizing feature selection and boosting end-to-end.
Details
Motivation: To address the challenge of high-dimensional, data-scarce regression problems by discovering relevant input representations during the boosting process, avoiding overfitting while improving performance.Method: At each boosting iteration, train a soft decision tree while simultaneously learning a linear input feature transform Q. The approach combines feature selection/transformation with boosting in an end-to-end optimization framework, with extension to differentiable non-linear transforms when appropriate.
Result: The method effectively and efficiently increases performance on both synthetic and real-world datasets, demonstrating successful end-to-end optimization of feature selection/transform and boosting while preventing overfitting.
Conclusion: The proposed soft gradient boosting framework with learnable feature transforms provides an effective solution for sequential regression in challenging high-dimensional, low-data scenarios, with publicly shared code for reproducibility and future research.
Abstract: We propose a soft gradient boosting framework for sequential regression that embeds a learnable linear feature transform within the boosting procedure. At each boosting iteration, we train a soft decision tree and learn a linear input feature transform Q together. This approach is particularly advantageous in high-dimensional, data-scarce scenarios, as it discovers the most relevant input representations while boosting. We demonstrate, using both synthetic and real-world datasets, that our method effectively and efficiently increases the performance by an end-to-end optimization of feature selection/transform and boosting while avoiding overfitting. We also extend our algorithm to differentiable non-linear transforms if overfitting is not a problem. To support reproducibility and future work, we share our code publicly.
[399] Sy-FAR: Symmetry-based Fair Adversarial Robustness
Haneen Najjar, Eyal Ronen, Mahmood Sharif
Main category: cs.LG
TL;DR: Sy-FAR introduces symmetry-based fairness for adversarial robustness in ML systems, focusing on making attacks equally successful in both directions between classes rather than perfect parity, which proves more tractable and effective.
Details
Motivation: Existing adversarial robustness methods often create unfair robustness where some classes/groups are easier to attack than others. Perfect fairness is infeasible in realistic tasks like face recognition due to inherent class similarities.Method: Developed Sy-FAR technique that encourages symmetry between classes - making attacks from class i to j as successful as from j to i. This approach leverages the natural symmetric nature of class resemblance relationships.
Result: Extensive evaluation across 5 datasets and 3 model architectures shows Sy-FAR significantly improves fair adversarial robustness compared to state-of-the-art methods. It’s faster, more consistent, and reduces vulnerability of target classes that adversarial examples tend to be classified into.
Conclusion: Symmetry is a more tractable and effective fairness notion for adversarial robustness than perfect parity, particularly in security-critical applications like face recognition. Sy-FAR successfully achieves symmetric robustness while improving overall performance and addressing discovered unfairness patterns.
Abstract: Security-critical machine-learning (ML) systems, such as face-recognition systems, are susceptible to adversarial examples, including real-world physically realizable attacks. Various means to boost ML’s adversarial robustness have been proposed; however, they typically induce unfair robustness: It is often easier to attack from certain classes or groups than from others. Several techniques have been developed to improve adversarial robustness while seeking perfect fairness between classes. Yet, prior work has focused on settings where security and fairness are less critical. Our insight is that achieving perfect parity in realistic fairness-critical tasks, such as face recognition, is often infeasible – some classes may be highly similar, leading to more misclassifications between them. Instead, we suggest that seeking symmetry – i.e., attacks from class $i$ to $j$ would be as successful as from $j$ to $i$ – is more tractable. Intuitively, symmetry is a desirable because class resemblance is a symmetric relation in most domains. Additionally, as we prove theoretically, symmetry between individuals induces symmetry between any set of sub-groups, in contrast to other fairness notions where group-fairness is often elusive. We develop Sy-FAR, a technique to encourage symmetry while also optimizing adversarial robustness and extensively evaluate it using five datasets, with three model architectures, including against targeted and untargeted realistic attacks. The results show Sy-FAR significantly improves fair adversarial robustness compared to state-of-the-art methods. Moreover, we find that Sy-FAR is faster and more consistent across runs. Notably, Sy-FAR also ameliorates another type of unfairness we discover in this work – target classes that adversarial examples are likely to be classified into become significantly less vulnerable after inducing symmetry.
[400] Spatiotemporal graph neural process for reconstruction, extrapolation, and classification of cardiac trajectories
Jaume Banus, Augustin C. Ogier, Roger Hullin, Philippe Meyer, Ruud B. van Heeswijk, Jonas Richiardi
Main category: cs.LG
TL;DR: A probabilistic framework combining neural ODEs, graph neural networks, and neural processes to model structured spatiotemporal dynamics from sparse observations, with applications in cardiac motion analysis.
Details
Motivation: To develop a unified model that can capture uncertainty, temporal continuity, and anatomical structure in dynamic systems, particularly for cardiac motion analysis from sparse observations.Method: Integrates neural ODEs, graph neural networks, and neural processes to represent dynamic systems as spatiotemporal multiplex graphs. Uses GNN-parameterized vector field to model latent trajectories and infers distributions over latent initial states and control variables from sparse context observations.
Result: Validated on three synthetic systems and two real-world cardiac datasets (ACDC with N=150 and UK Biobank with N=526). Achieves 99% accuracy on ACDC classification and 67% accuracy for atrial fibrillation detection in UK Biobank. Accurately reconstructs trajectories and extrapolates future cardiac cycles from single observed cycle.
Conclusion: The framework provides a flexible approach for analyzing cardiac motion and serves as a foundation for graph-based learning in structured biomedical spatiotemporal time-series data.
Abstract: We present a probabilistic framework for modeling structured spatiotemporal dynamics from sparse observations, focusing on cardiac motion. Our approach integrates neural ordinary differential equations (NODEs), graph neural networks (GNNs), and neural processes into a unified model that captures uncertainty, temporal continuity, and anatomical structure. We represent dynamic systems as spatiotemporal multiplex graphs and model their latent trajectories using a GNN-parameterized vector field. Given the sparse context observations at node and edge levels, the model infers a distribution over latent initial states and control variables, enabling both interpolation and extrapolation of trajectories. We validate the method on three synthetic dynamical systems (coupled pendulum, Lorenz attractor, and Kuramoto oscillators) and two real-world cardiac imaging datasets - ACDC (N=150) and UK Biobank (N=526) - demonstrating accurate reconstruction, extrapolation, and disease classification capabilities. The model accurately reconstructs trajectories and extrapolates future cardiac cycles from a single observed cycle. It achieves state-of-the-art results on the ACDC classification task (up to 99% accuracy), and detects atrial fibrillation in UK Biobank subjects with competitive performance (up to 67% accuracy). This work introduces a flexible approach for analyzing cardiac motion and offers a foundation for graph-based learning in structured biomedical spatiotemporal time-series data.
[401] BAPFL: Exploring Backdoor Attacks Against Prototype-based Federated Learning
Honghong Zeng, Jiong Lou, Zhe Wang, Hefeng Zhou, Chentao Wu, Wei Zhao, Jie Li
Main category: cs.LG
TL;DR: BAPFL is a novel backdoor attack method specifically designed for prototype-based federated learning (PFL) that achieves 35%-75% higher attack success rates than traditional attacks while maintaining main task accuracy.
Details
Motivation: PFL has shown inherent resistance to existing backdoor attacks due to its prototype learning mechanism and data heterogeneity, but its security vulnerabilities remain unexplored.Method: BAPFL combines prototype poisoning strategy (manipulating global prototype trajectories) with trigger optimization mechanism (learning unique stealthy triggers per target label) to misalign clean and trigger-embedded sample prototypes.
Result: Experimental results across multiple datasets and PFL variants show BAPFL achieves 35%-75% improvement in attack success rate compared to traditional backdoor attacks while preserving main task accuracy.
Conclusion: BAPFL demonstrates effectiveness, stealthiness, and adaptability in attacking PFL frameworks, highlighting the need for robust defense mechanisms in prototype-based federated learning systems.
Abstract: Prototype-based federated learning (PFL) has emerged as a promising paradigm to address data heterogeneity problems in federated learning, as it leverages mean feature vectors as prototypes to enhance model generalization. However, its robustness against backdoor attacks remains largely unexplored. In this paper, we identify that PFL is inherently resistant to existing backdoor attacks due to its unique prototype learning mechanism and local data heterogeneity. To further explore the security of PFL, we propose BAPFL, the first backdoor attack method specifically designed for PFL frameworks. BAPFL integrates a prototype poisoning strategy with a trigger optimization mechanism. The prototype poisoning strategy manipulates the trajectories of global prototypes to mislead the prototype training of benign clients, pushing their local prototypes of clean samples away from the prototypes of trigger-embedded samples. Meanwhile, the trigger optimization mechanism learns a unique and stealthy trigger for each potential target label, and guides the prototypes of trigger-embedded samples to align closely with the global prototype of the target label. Experimental results across multiple datasets and PFL variants demonstrate that BAPFL achieves a 35%-75% improvement in attack success rate compared to traditional backdoor attacks, while preserving main task accuracy. These results highlight the effectiveness, stealthiness, and adaptability of BAPFL in PFL.
[402] Causal Discovery via Quantile Partial Effect
Yikang Chen, Xingzhe Sun, Dehui Du
Main category: cs.LG
TL;DR: The paper introduces Quantile Partial Effect (QPE) as a statistic for causal discovery, showing that when QPE lies in a finite linear span, cause and effect become identifiable from observational data without requiring noise or Markov assumptions.
Details
Motivation: To develop a causal discovery method that directly utilizes asymmetry in observational distribution shape characteristics rather than relying on functional causal models with specific noise assumptions or Markov conditions.Method: Proposes using QPE statistics from conditional quantile regression, performs basis function tests on estimated QPE to distinguish causal directions for bivariate cases, and uses Fisher Information for multivariate causal order determination based on QPE’s second moment assumptions.
Result: Empirically effective on numerous bivariate causal discovery datasets, and validates feasibility of using Fisher Information for causal order identification on both synthetic and real-world multivariate datasets.
Conclusion: QPE provides a novel approach to causal discovery that operates purely at the observational level, generalizing previous identifiability results and offering effective methods for both bivariate and multivariate causal discovery.
Abstract: Quantile Partial Effect (QPE) is a statistic associated with conditional quantile regression, measuring the effect of covariates at different levels. Our theory demonstrates that when the QPE of cause on effect is assumed to lie in a finite linear span, cause and effect are identifiable from their observational distribution. This generalizes previous identifiability results based on Functional Causal Models (FCMs) with additive, heteroscedastic noise, etc. Meanwhile, since QPE resides entirely at the observational level, this parametric assumption does not require considering mechanisms, noise, or even the Markov assumption, but rather directly utilizes the asymmetry of shape characteristics in the observational distribution. By performing basis function tests on the estimated QPE, causal directions can be distinguished, which is empirically shown to be effective in experiments on a large number of bivariate causal discovery datasets. For multivariate causal discovery, leveraging the close connection between QPE and score functions, we find that Fisher Information is sufficient as a statistical measure to determine causal order when assumptions are made about the second moment of QPE. We validate the feasibility of using Fisher Information to identify causal order on multiple synthetic and real-world multivariate causal discovery datasets.
[403] Bridging Performance Gaps for Foundation Models: A Post-Training Strategy for ECGFounder
Ya Zhou, Yujie Yang, Xiaohan Fan, Wei Zhao
Main category: cs.LG
TL;DR: A simple post-training approach enhances ECG foundation model performance, improving AUROC by 1.2%-3.3% and AUPRC by 5.3%-20.9% on PTB-XL benchmark, with better stability and sample efficiency.
Details
Motivation: ECG foundation models show performance gaps compared to task-specific models despite large pre-training, indicating a need for effective post-training strategies to improve clinical applicability.Method: Proposes a post-training approach for ECGFounder (pre-trained on 7M+ ECG recordings) with key components including stochastic depth and preview linear probing, evaluated on PTB-XL benchmark.
Result: Significant performance improvements: 1.2%-3.3% in macro AUROC, 5.3%-20.9% in macro AUPRC. Achieves 9.1% AUROC and 34.9% AUPRC improvement using only 10% training data. Outperforms state-of-the-art task-specific and advanced architectures.
Conclusion: Post-training strategies effectively enhance ECG foundation models, with identified key components contributing to improved performance, stability, and sample efficiency, supporting continued development of ECG foundation models.
Abstract: ECG foundation models are increasingly popular due to their adaptability across various tasks. However, their clinical applicability is often limited by performance gaps compared to task-specific models, even after pre-training on large ECG datasets and fine-tuning on target data. This limitation is likely due to the lack of an effective post-training strategy. In this paper, we propose a simple yet effective post-training approach to enhance ECGFounder, a state-of-the-art ECG foundation model pre-trained on over 7 million ECG recordings. Experiments on the PTB-XL benchmark show that our approach improves the baseline fine-tuning strategy by 1.2%-3.3% in macro AUROC and 5.3%-20.9% in macro AUPRC. Additionally, our method outperforms several recent state-of-the-art approaches, including task-specific and advanced architectures. Further evaluation reveals that our method is more stable and sample-efficient compared to the baseline, achieving a 9.1% improvement in macro AUROC and a 34.9% improvement in macro AUPRC using just 10% of the training data. Ablation studies identify key components, such as stochastic depth and preview linear probing, that contribute to the enhanced performance. These findings underscore the potential of post-training strategies to improve ECG foundation models, and we hope this work will contribute to the continued development of foundation models in the ECG domain.
[404] Ensemble Visualization With Variational Autoencoder
Cenyang Wu, Qinhan Yu, Liang Zhou
Main category: cs.LG
TL;DR: A new method for visualizing data ensembles using structured probabilistic representations in latent spaces via variational autoencoders, enabling analytical computation of confidence intervals and density estimation.
Details
Motivation: To create effective visualization of data ensembles by transforming spatial features into structured latent spaces that follow multivariate Gaussian distributions for probabilistic analysis.Method: Transform spatial ensemble features into latent space through feature space conversion and unsupervised learning using variational autoencoder (VAE), resulting in multivariate standard Gaussian distributions.
Result: Preliminary results on weather forecasting ensembles demonstrate the method’s effectiveness and versatility in enabling analytical computation of confidence intervals and density estimation.
Conclusion: The approach successfully creates structured probabilistic representations in latent spaces that facilitate analytical ensemble visualization and probabilistic distribution analysis.
Abstract: We present a new method to visualize data ensembles by constructing structured probabilistic representations in latent spaces, i.e., lower-dimensional representations of spatial data features. Our approach transforms the spatial features of an ensemble into a latent space through feature space conversion and unsupervised learning using a variational autoencoder (VAE). The resulting latent spaces follow multivariate standard Gaussian distributions, enabling analytical computation of confidence intervals and density estimation of the probabilistic distribution that generates the data ensemble. Preliminary results on a weather forecasting ensemble demonstrate the effectiveness and versatility of our method.
[405] ReTrack: Data Unlearning in Diffusion Models through Redirecting the Denoising Trajectory
Qitan Shi, Cheng Jin, Jiawei Zhang, Yuantao Gu
Main category: cs.LG
TL;DR: ReTrack is a fast data unlearning method for diffusion models that uses importance sampling and k-nearest neighbors redirection to efficiently remove specific training data influence while maintaining generation quality.
Details
Motivation: Diffusion models suffer from training data memorization which creates privacy and safety concerns, requiring methods to remove specific data influence without full retraining.Method: Uses importance sampling to construct an efficient fine-tuning loss, approximating by retaining only dominant terms to redirect denoising trajectories toward k-nearest neighbors.
Result: Achieves state-of-the-art performance on MNIST T-Shirt, CelebA-HQ, CIFAR-10, and Stable Diffusion, providing the best trade-off between unlearning strength and generation quality preservation.
Conclusion: ReTrack offers an effective and efficient solution for data unlearning in diffusion models, addressing privacy concerns while maintaining high generative performance.
Abstract: Diffusion models excel at generating high-quality, diverse images but suffer from training data memorization, raising critical privacy and safety concerns. Data unlearning has emerged to mitigate this issue by removing the influence of specific data without retraining from scratch. We propose ReTrack, a fast and effective data unlearning method for diffusion models. ReTrack employs importance sampling to construct a more efficient fine-tuning loss, which we approximate by retaining only dominant terms. This yields an interpretable objective that redirects denoising trajectories toward the $k$-nearest neighbors, enabling efficient unlearning while preserving generative quality. Experiments on MNIST T-Shirt, CelebA-HQ, CIFAR-10, and Stable Diffusion show that ReTrack achieves state-of-the-art performance, striking the best trade-off between unlearning strength and generation quality preservation.
[406] Spiking Vocos: An Energy-Efficient Neural Vocoder
Yukun Chen, Zhaoxi Mu, Andong Li, Peilin Li, Xinyu Yang
Main category: cs.LG
TL;DR: Spiking Vocos is an ultra-low energy neural vocoder using spiking neural networks that achieves comparable performance to ANN counterparts while consuming only 14.7% of the energy.
Details
Motivation: High energy consumption of neural vocoders prevents practical deployment on edge devices. SNNs offer energy efficiency but suffer from information bottlenecks and performance gaps compared to ANNs.Method: Built on Vocos framework with Spiking ConvNeXt module to reduce MAC operations, amplitude shortcut path to preserve signal dynamics, self-architectural distillation for knowledge transfer, and Temporal Shift Module for temporal information fusion.
Result: Achieves UTMOS score of 3.74 and PESQ score of 3.45, comparable to ANN performance, while consuming only 14.7% of the energy.
Conclusion: Spiking Vocos successfully bridges the performance gap between SNNs and ANNs for neural vocoding while maintaining ultra-low energy consumption, making it suitable for edge device deployment.
Abstract: Despite the remarkable progress in the synthesis speed and fidelity of neural vocoders, their high energy consumption remains a critical barrier to practical deployment on computationally restricted edge devices. Spiking Neural Networks (SNNs), widely recognized for their high energy efficiency due to their event-driven nature, offer a promising solution for low-resource scenarios. In this paper, we propose Spiking Vocos, a novel spiking neural vocoder with ultra-low energy consumption, built upon the efficient Vocos framework. To mitigate the inherent information bottleneck in SNNs, we design a Spiking ConvNeXt module to reduce Multiply-Accumulate (MAC) operations and incorporate an amplitude shortcut path to preserve crucial signal dynamics. Furthermore, to bridge the performance gap with its Artificial Neural Network (ANN) counterpart, we introduce a self-architectural distillation strategy to effectively transfer knowledge. A lightweight Temporal Shift Module is also integrated to enhance the model’s ability to fuse information across the temporal dimension with negligible computational overhead. Experiments demonstrate that our model achieves performance comparable to its ANN counterpart, with UTMOS and PESQ scores of 3.74 and 3.45 respectively, while consuming only 14.7% of the energy. The source code is available at https://github.com/pymaster17/Spiking-Vocos.
[407] Traces Propagation: Memory-Efficient and Scalable Forward-Only Learning in Spiking Neural Networks
Lorenzo Pes, Bojian Yin, Sander Stuijk, Federico Corradi
Main category: cs.LG
TL;DR: Proposes Traces Propagation (TP), a fully local and memory-efficient learning rule for Spiking Neural Networks that combines eligibility traces with layer-wise contrastive loss, eliminating the need for auxiliary matrices while maintaining competitive performance.
Details
Motivation: Existing SNN training methods like BPTT are biologically implausible and computationally expensive, while local learning rules fail to address spatial credit assignment without memory-intensive auxiliary matrices, limiting on-device learning capabilities.Method: Traces Propagation (TP) combines eligibility traces for temporal credit assignment with layer-wise contrastive loss for spatial credit assignment, operating in a forward-only manner without requiring auxiliary layer-wise matrices.
Result: TP outperforms other fully local learning rules on NMNIST and SHD datasets, shows competitive performance on DVS-GESTURE and DVS-CIFAR10, scales effectively to deeper architectures like VGG-9, and demonstrates practical utility for fine-tuning tasks like keyword spotting.
Conclusion: TP provides a memory-efficient, scalable, and fully local learning solution for SNNs that enables efficient edge learning while maintaining competitive performance across various datasets and architectures.
Abstract: Spiking Neural Networks (SNNs) provide an efficient framework for processing dynamic spatio-temporal signals and for investigating the learning principles underlying biological neural systems. A key challenge in training SNNs is to solve both spatial and temporal credit assignment. The dominant approach for training SNNs is Backpropagation Through Time (BPTT) with surrogate gradients. However, BPTT is in stark contrast with the spatial and temporal locality observed in biological neural systems and leads to high computational and memory demands, limiting efficient training strategies and on-device learning. Although existing local learning rules achieve local temporal credit assignment by leveraging eligibility traces, they fail to address the spatial credit assignment without resorting to auxiliary layer-wise matrices, which increase memory overhead and hinder scalability, especially on embedded devices. In this work, we propose Traces Propagation (TP), a forward-only, memory-efficient, scalable, and fully local learning rule that combines eligibility traces with a layer-wise contrastive loss without requiring auxiliary layer-wise matrices. TP outperforms other fully local learning rules on NMNIST and SHD datasets. On more complex datasets such as DVS-GESTURE and DVS-CIFAR10, TP showcases competitive performance and scales effectively to deeper SNN architectures such as VGG-9, while providing favorable memory scaling compared to prior fully local scalable rules, for datasets with a significant number of classes. Finally, we show that TP is well suited for practical fine-tuning tasks, such as keyword spotting on the Google Speech Commands dataset, thus paving the way for efficient learning at the edge.
[408] Discovering Mathematical Equations with Diffusion Language Model
Xiaoxu Han, Chengzhen Ning, Jinghui Zhong, Fubiao Yang, Yu Wang, Xin Mu
Main category: cs.LG
TL;DR: DiffuSR is a pre-training framework for symbolic regression that uses a continuous-state diffusion language model to discover mathematical equations from data, achieving competitive performance with state-of-the-art methods while generating more interpretable expressions.
Details
Motivation: Symbolic regression remains challenging due to vast search space and accuracy-complexity trade-offs. There's a need for effective methods to discover valid mathematical equations from observed data for scientific discovery.Method: Uses a continuous-state diffusion language model with trainable embedding layer to map discrete symbols to continuous latent space. Employs iterative denoising to convert noisy sequences into equations, guided by numerical data via cross-attention. Includes inference strategy with logit priors injected into genetic programming.
Result: Achieves competitive performance with state-of-the-art autoregressive methods on standard benchmarks. Generates more interpretable and diverse mathematical expressions compared to existing approaches.
Conclusion: DiffuSR provides an effective diffusion-based framework for symbolic regression that successfully models equation distributions and produces meaningful mathematical discoveries with improved interpretability.
Abstract: Discovering valid and meaningful mathematical equations from observed data plays a crucial role in scientific discovery. While this task, symbolic regression, remains challenging due to the vast search space and the trade-off between accuracy and complexity. In this paper, we introduce DiffuSR, a pre-training framework for symbolic regression built upon a continuous-state diffusion language model. DiffuSR employs a trainable embedding layer within the diffusion process to map discrete mathematical symbols into a continuous latent space, modeling equation distributions effectively. Through iterative denoising, DiffuSR converts an initial noisy sequence into a symbolic equation, guided by numerical data injected via a cross-attention mechanism. We also design an effective inference strategy to enhance the accuracy of the diffusion-based equation generator, which injects logit priors into genetic programming. Experimental results on standard symbolic regression benchmarks demonstrate that DiffuSR achieves competitive performance with state-of-the-art autoregressive methods and generates more interpretable and diverse mathematical expressions.
[409] Curriculum Learning for Mesh-based simulations
Paul Garnier, Vincent Lannelongue, Elie Hachem
Main category: cs.LG
TL;DR: Coarse-to-fine curriculum learning accelerates GNN training for CFD by starting with coarse meshes and progressively increasing resolution, reducing training time by 50% while maintaining accuracy.
Details
Motivation: Training graph neural networks on high-resolution unstructured meshes for computational fluid dynamics is computationally expensive and time-consuming, requiring more efficient training methods.Method: A curriculum learning approach where the model is first trained on very coarse meshes, then progressively introduced to medium and high-resolution meshes (up to 300,000 nodes), without changing the model architecture itself.
Result: Achieved comparable generalization accuracy while reducing total wall-clock training time by up to 50%. The method also helped models break through performance plateaus when they lacked capacity to learn underlying physics.
Conclusion: Coarse-to-fine curriculum learning is an effective strategy for accelerating GNN training in CFD applications, providing significant time savings without compromising accuracy and helping overcome model capacity limitations.
Abstract: Graph neural networks (GNNs) have emerged as powerful surrogates for mesh-based computational fluid dynamics (CFD), but training them on high-resolution unstructured meshes with hundreds of thousands of nodes remains prohibitively expensive. We study a \emph{coarse-to-fine curriculum} that accelerates convergence by first training on very coarse meshes and then progressively introducing medium and high resolutions (up to (3\times10^5) nodes). Unlike multiscale GNN architectures, the model itself is unchanged; only the fidelity of the training data varies over time. We achieve comparable generalization accuracy while reducing total wall-clock time by up to 50%. Furthermore, on datasets where our model lacks the capacity to learn the underlying physics, using curriculum learning enables it to break through plateaus.
[410] Learning from Heterophilic Graphs: A Spectral Theory Perspective on the Impact of Self-Loops and Parallel Edges
Kushal Bose, Swagatam Das
Main category: cs.LG
TL;DR: Analysis of how adding self-loops and parallel edges affects graph spectra and GCN performance on heterophilic graphs, showing connections between graph properties and filter behavior without costly eigenvalue decomposition.
Details
Motivation: Graph heterophily challenges MP-GNN performance, particularly low-pass filters like GCNs, which suffer from blending messages from dissimilar nodes. The relationship between graph spectra and filter performance on heterophilic graphs needs deeper investigation.Method: Updated heterophilic graphs by adding self-loops and parallel edges, observed changes in graph Laplacian eigenvalues, conducted studies on GCN performance across benchmark heterophilic networks with these modifications, and established connections between graph spectra and performance trends.
Result: GCN performance showed varying trends (increasing or decreasing) when adding self-loops versus parallel edges. Graph spectra changes reflected intrinsic graph properties like connected components, sparsity, average degree, and cluster structures.
Conclusion: The work provides a method to evaluate graph spectrum and properties through observing low-pass filter performance trends, avoiding expensive eigenvalue decomposition. Theoretical foundations validate the impact of self-loops and parallel edges on graph spectrum.
Abstract: Graph heterophily poses a formidable challenge to the performance of Message-passing Graph Neural Networks (MP-GNNs). The familiar low-pass filters like Graph Convolutional Networks (GCNs) face performance degradation, which can be attributed to the blending of the messages from dissimilar neighboring nodes. The performance of the low-pass filters on heterophilic graphs still requires an in-depth analysis. In this context, we update the heterophilic graphs by adding a number of self-loops and parallel edges. We observe that eigenvalues of the graph Laplacian decrease and increase respectively by increasing the number of self-loops and parallel edges. We conduct several studies regarding the performance of GCN on various benchmark heterophilic networks by adding either self-loops or parallel edges. The studies reveal that the GCN exhibited either increasing or decreasing performance trends on adding self-loops and parallel edges. In light of the studies, we established connections between the graph spectra and the performance trends of the low-pass filters on the heterophilic graphs. The graph spectra characterize the essential intrinsic properties of the input graph like the presence of connected components, sparsity, average degree, cluster structures, etc. Our work is adept at seamlessly evaluating graph spectrum and properties by observing the performance trends of the low-pass filters without pursuing the costly eigenvalue decomposition. The theoretical foundations are also discussed to validate the impact of adding self-loops and parallel edges on the graph spectrum.
[411] FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang
Main category: cs.LG
TL;DR: FinSearchComp is the first open-source benchmark for evaluating LLM agents on realistic financial search and reasoning tasks, featuring 635 expert-annotated questions across global and Greater China markets.
Details
Motivation: Existing financial datasets don't evaluate end-to-end agent search capabilities, and constructing realistic financial tasks requires deep expertise and time-sensitive data that's hard to evaluate.Method: Created three task types (Time-Sensitive Data Fetching, Simple Historical Lookup, Complex Historical Investigation) with 70 financial experts for annotation and multi-stage QA pipeline. Evaluated 21 models on 635 questions.
Result: Grok 4 (web) performed best on global subset, DouBao (web) led on Greater China subset. Web search and financial plugins significantly improved performance. Model/tool country origin impacted results.
Conclusion: FinSearchComp provides a professional, high-difficulty testbed for complex financial search and reasoning that aligns with real-world analyst workflows.
Abstract: Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks – Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation – closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.
[412] On the Correlation between Individual Fairness and Predictive Accuracy in Probabilistic Models
Alessandro Antonucci, Eric Rossetto, Ivan Duvnjak
Main category: cs.LG
TL;DR: Analysis of individual fairness in generative classifiers through robustness to private feature perturbations, showing correlation between robustness and predictive accuracy using Bayesian networks and Markov random fields.
Details
Motivation: To investigate individual fairness in probabilistic classifiers by analyzing how robust posterior inferences are to perturbations in private features, addressing the fairness-accuracy trade-off.Method: Used Bayesian networks as generative models on 14 benchmark datasets with fairness concerns. Reformulated robustness analysis as a most probable explanation task in an auxiliary Markov random field to handle computational complexity.
Result: Empirical experiments confirmed the hypothesis that instances with greater robustness to private feature perturbations are more likely to be classified accurately.
Conclusion: The correlation between robustness and accuracy suggests novel approaches to mitigate the traditional trade-off between fairness and predictive performance in machine learning systems.
Abstract: We investigate individual fairness in generative probabilistic classifiers by analysing the robustness of posterior inferences to perturbations in private features. Building on established results in robustness analysis, we hypothesise a correlation between robustness and predictive accuracy, specifically, instances exhibiting greater robustness are more likely to be classified accurately. We empirically assess this hypothesis using a benchmark of fourteen datasets with fairness concerns, employing Bayesian networks as the underlying generative models. To address the computational complexity associated with robustness analysis over multiple private features with Bayesian networks, we reformulate the problem as a most probable explanation task in an auxiliary Markov random field. Our experiments confirm the hypothesis about the correlation, suggesting novel directions to mitigate the traditional trade-off between fairness and accuracy.
[413] CoVariance Filters and Neural Networks over Hilbert Spaces
Claudio Battiloro, Andrea Cavallo, Elvin Isufi
Main category: cs.LG
TL;DR: CoVariance Neural Networks extended to infinite-dimensional Hilbert spaces using covariance operators, with theoretical guarantees and practical validation on time-series classification.
Details
Motivation: Extend the robustness and transferability properties of CoVariance Neural Networks from finite-dimensional to infinite-dimensional Hilbert spaces, addressing the gap in existing literature.Method: Introduce Hilbert coVariance Filters (HVFs) and Networks (HVNs) centered on covariance operators, with principled discretization and theoretical analysis showing recovery of Functional PCA.
Result: HVNs demonstrate robust performance on synthetic and real-world time-series classification tasks, outperforming MLP and FPCA-based classifiers.
Conclusion: The proposed framework successfully extends covariance-based neural networks to infinite-dimensional settings with theoretical foundations and practical effectiveness.
Abstract: CoVariance Neural Networks (VNNs) perform graph convolutions on the empirical covariance matrix of signals defined over finite-dimensional Hilbert spaces, motivated by robustness and transferability properties. Yet, little is known about how these arguments extend to infinite-dimensional Hilbert spaces. In this work, we take a first step by introducing a novel convolutional learning framework for signals defined over infinite-dimensional Hilbert spaces, centered on the (empirical) covariance operator. We constructively define Hilbert coVariance Filters (HVFs) and design Hilbert coVariance Networks (HVNs) as stacks of HVF filterbanks with nonlinear activations. We propose a principled discretization procedure, and we prove that empirical HVFs can recover the Functional PCA (FPCA) of the filtered signals. We then describe the versatility of our framework with examples ranging from multivariate real-valued functions to reproducing kernel Hilbert spaces. Finally, we validate HVNs on both synthetic and real-world time-series classification tasks, showing robust performance compared to MLP and FPCA-based classifiers.
[414] Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy
Yunchuan Guan, Yu Liu, Ke Zhou, Zhiqi Shen, Jenq-Neng Hwang, Serge Belongie, Lei Li
Main category: cs.LG
TL;DR: Meta-learning shows tighter generalization bounds than whole-class training in entropy-limited settings, proving more efficient with limited entropy and robust to noise. MINO framework enhances unsupervised performance through adaptive clustering and stability-based scaling.
Details
Motivation: To demonstrate the value of meta-learning over whole-class training strategies in few-shot tasks by establishing fair comparison settings and showing meta-learning's advantages in limited entropy scenarios.Method: Established entropy-limited supervised setting for fair comparisons, conducted theoretical analysis and experimental validation. Proposed MINO framework with DBSCAN clustering algorithm, dynamic head for task construction, and stability-based meta-scaler for noise robustness.
Result: Meta-learning has tighter generalization bound than whole-class training, is more efficient with limited entropy, and more robust to label noise and heterogeneous tasks. MINO framework shows effectiveness in multiple unsupervised few-shot and zero-shot tasks.
Conclusion: Meta-learning provides significant advantages in entropy-limited settings and unsupervised tasks. The proposed MINO framework successfully leverages these advantages for improved performance in unsupervised few-shot and zero-shot learning scenarios.
Abstract: Meta-learning is a powerful paradigm for tackling few-shot tasks. However, recent studies indicate that models trained with the whole-class training strategy can achieve comparable performance to those trained with meta-learning in few-shot classification tasks. To demonstrate the value of meta-learning, we establish an entropy-limited supervised setting for fair comparisons. Through both theoretical analysis and experimental validation, we establish that meta-learning has a tighter generalization bound compared to whole-class training. We unravel that meta-learning is more efficient with limited entropy and is more robust to label noise and heterogeneous tasks, making it well-suited for unsupervised tasks. Based on these insights, We propose MINO, a meta-learning framework designed to enhance unsupervised performance. MINO utilizes the adaptive clustering algorithm DBSCAN with a dynamic head for unsupervised task construction and a stability-based meta-scaler for robustness against label noise. Extensive experiments confirm its effectiveness in multiple unsupervised few-shot and zero-shot tasks.
[415] TRUST-FS: Tensorized Reliable Unsupervised Multi-View Feature Selection for Incomplete Data
Minghui Lu, Yanyong Huang, Minbo Ma, Dongjie Wang, Xiuwen Yi, Tianrui Li
Main category: cs.LG
TL;DR: TRUST-FS is a novel tensor-based method for multi-view unsupervised feature selection that handles missing variables, integrates feature selection with imputation, and learns reliable similarity graphs using Subjective Logic.
Details
Motivation: Existing MUFS methods have limitations: they can't handle missing variables (only missing views), treat imputation and feature selection separately, and suffer from inaccurate similarity graphs due to missing data.Method: Proposes TRUST-FS with adaptive-weighted CP decomposition that simultaneously performs feature selection, missing-variable imputation, and view weight learning within a unified tensor factorization framework. Uses Subjective Logic to acquire trustworthy cross-view similarity information.
Result: Comprehensive experimental results demonstrate the effectiveness and superiority of TRUST-FS over state-of-the-art methods.
Conclusion: TRUST-FS provides a unified solution that addresses multiple challenges in incomplete multi-view feature selection through tensor factorization and reliable similarity learning.
Abstract: Multi-view unsupervised feature selection (MUFS), which selects informative features from multi-view unlabeled data, has attracted increasing research interest in recent years. Although great efforts have been devoted to MUFS, several challenges remain: 1) existing methods for incomplete multi-view data are limited to handling missing views and are unable to address the more general scenario of missing variables, where some features have missing values in certain views; 2) most methods address incomplete data by first imputing missing values and then performing feature selection, treating these two processes independently and overlooking their interactions; 3) missing data can result in an inaccurate similarity graph, which reduces the performance of feature selection. To solve this dilemma, we propose a novel MUFS method for incomplete multi-view data with missing variables, termed Tensorized Reliable UnSupervised mulTi-view Feature Selection (TRUST-FS). TRUST-FS introduces a new adaptive-weighted CP decomposition that simultaneously performs feature selection, missing-variable imputation, and view weight learning within a unified tensor factorization framework. By utilizing Subjective Logic to acquire trustworthy cross-view similarity information, TRUST-FS facilitates learning a reliable similarity graph, which subsequently guides feature selection and imputation. Comprehensive experimental results demonstrate the effectiveness and superiority of our method over state-of-the-art methods.
[416] B-TGAT: A Bi-directional Temporal Graph Attention Transformer for Clustering Multivariate Spatiotemporal Data
Francis Ndikum Nji, Vandana Janaja, Jianwu Wang
Main category: cs.LG
TL;DR: A hybrid U-Net autoencoder with Bi-directional Temporal Graph Attention Transformer (B-TGAT) and ConvLSTM2D modules for superior spatiotemporal climate data clustering.
Details
Motivation: Conventional clustering methods struggle with high-dimensional spatiotemporal climate data due to complex temporal dependencies, evolving spatial interactions, and non-stationary dynamics that require capturing both local/global temporal relationships while preserving spatial context.Method: Time-distributed hybrid U-Net autoencoder with ConvLSTM2D modules for joint spatial-temporal feature extraction, skip connections for multiscale spatial detail preservation, and B-TGAT bottleneck for graph-based spatial modeling with attention-driven temporal encoding and adaptive temporal neighbor weighting.
Result: Superior cluster separability, temporal stability, and alignment with known climate transitions on three distinct spatiotemporal climate datasets compared to state-of-the-art baselines.
Conclusion: The integration of ConvLSTM2D, U-Net skip connections, and B-TGAT enhances temporal clustering performance while providing interpretable insights into complex spatiotemporal variability, advancing both methodological development and climate science applications.
Abstract: Clustering high-dimensional multivariate spatiotemporal climate data is challenging due to complex temporal dependencies, evolving spatial interactions, and non-stationary dynamics. Conventional clustering methods, including recurrent and convolutional models, often struggle to capture both local and global temporal relationships while preserving spatial context. We present a time-distributed hybrid U-Net autoencoder that integrates a Bi-directional Temporal Graph Attention Transformer (B-TGAT) to guide efficient temporal clustering of multidimensional spatiotemporal climate datasets. The encoder and decoder are equipped with ConvLSTM2D modules that extract joint spatial–temporal features by modeling localized dynamics and spatial correlations over time, and skip connections that preserve multiscale spatial details during feature compression and reconstruction. At the bottleneck, B-TGAT integrates graph-based spatial modeling with attention-driven temporal encoding, enabling adaptive weighting of temporal neighbors and capturing both short and long-range dependencies across regions. This architecture produces discriminative latent embeddings optimized for clustering. Experiments on three distinct spatiotemporal climate datasets demonstrate superior cluster separability, temporal stability, and alignment with known climate transitions compared to state-of-the-art baselines. The integration of ConvLSTM2D, U-Net skip connections, and B-TGAT enhances temporal clustering performance while providing interpretable insights into complex spatiotemporal variability, advancing both methodological development and climate science applications.
[417] HAM: Hierarchical Adapter Merging for Scalable Continual Learning
Eric Nuertey Coleman, Luigi Quarantiello, Samrat Mukherjee, Julio Hurtado, Vincenzo Lomonaco
Main category: cs.LG
TL;DR: HAM is a novel hierarchical adapter merging framework that dynamically combines adapters from different tasks to address catastrophic forgetting in continual learning, outperforming state-of-the-art methods especially with increasing task numbers.
Details
Motivation: Current PEFT methods like LoRA struggle with scaling to dynamic learning scenarios and long task sequences, as maintaining separate adapters per task introduces complexity and potential interference between tasks.Method: HAM maintains a fixed set of groups that hierarchically consolidate new adapters. For each task, it trains a low-rank adapter with importance scalar, dynamically groups tasks based on adapter similarity, then prunes, scales and merges adapters within groups to facilitate transfer learning.
Result: Extensive experiments on three vision benchmarks show HAM significantly outperforms state-of-the-art methods, particularly as the number of tasks increases.
Conclusion: HAM provides an effective framework for scalable continual learning by dynamically merging adapters in a hierarchical structure, enabling better knowledge retention and transfer between related tasks while managing complexity.
Abstract: Continual learning is an essential capability of human cognition, yet it poses significant challenges for current deep learning models. The primary issue is that new knowledge can interfere with previously learned information, causing the model to forget earlier knowledge in favor of the new, a phenomenon known as catastrophic forgetting. Although large pre-trained models can partially mitigate forgetting by leveraging their existing knowledge and over-parameterization, they often struggle when confronted with novel data distributions. Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, enable efficient adaptation to new knowledge. However, they still face challenges in scaling to dynamic learning scenarios and long sequences of tasks, as maintaining one adapter per task introduces complexity and increases the potential for interference. In this paper, we introduce Hierarchical Adapters Merging (HAM), a novel framework that dynamically combines adapters from different tasks during training. This approach enables HAM to scale effectively, allowing it to manage more tasks than competing baselines with improved efficiency. To achieve this, HAM maintains a fixed set of groups that hierarchically consolidate new adapters. For each task, HAM trains a low-rank adapter along with an importance scalar, then dynamically groups tasks based on adapter similarity. Within each group, adapters are pruned, scaled and merge, facilitating transfer learning between related tasks. Extensive experiments on three vision benchmarks show that HAM significantly outperforms state-of-the-art methods, particularly as the number of tasks increases.
[418] Density-Aware Farthest Point Sampling
Paolo Climaco, Jochen Garcke
Main category: cs.LG
TL;DR: DA-FPS sampling method reduces prediction error in regression with limited labeled data by minimizing weighted fill distance.
Details
Motivation: Limited labeled training data due to computational constraints or high labeling costs requires efficient training set selection methods.Method: Proposed Density-Aware Farthest Point Sampling (DA-FPS) that minimizes weighted fill distance to reduce prediction error bound for Lipschitz continuous regression models.
Result: DA-FPS significantly reduces mean absolute prediction error compared to other sampling strategies across three datasets with two regression models.
Conclusion: DA-FPS is an effective passive, model-agnostic sampling method that improves regression performance when labeled data is scarce.
Abstract: We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set, a quantity we can estimate simply by considering the data features. We introduce “Density-Aware Farthest Point Sampling” (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.
[419] FOSSIL: Regret-minimizing weighting for robust learning under imbalance and small data
J. Cha, J. Lee, J. Cho, J. Shin
Main category: cs.LG
TL;DR: FOSSIL is a unified weighting framework that addresses imbalanced and small data problems by integrating class imbalance correction, difficulty-aware curricula, augmentation penalties, and warmup dynamics into a single interpretable formula with theoretical guarantees.
Details
Motivation: Imbalanced and small data regimes are common in domains like rare disease imaging and genomics, where labeled samples are scarce and naive augmentation introduces artifacts. Existing solutions address isolated aspects but remain fragile or complex.Method: FOSSIL (Flexible Optimization via Sample Sensitive Importance Learning) - a unified weighting framework that integrates class imbalance correction, difficulty-aware curricula, augmentation penalties, and warmup dynamics into a single interpretable formula with regret-based theoretical guarantees.
Result: Achieves consistent empirical gains over ERM, curriculum, and meta-weighting baselines on synthetic and real-world datasets, while requiring no architectural changes.
Conclusion: FOSSIL provides a robust and theoretically grounded solution for imbalanced and small data problems, outperforming existing methods without requiring architectural modifications.
Abstract: Imbalanced and small data regimes are pervasive in domains such as rare disease imaging, genomics, and disaster response, where labeled samples are scarce and naive augmentation often introduces artifacts. Existing solutions such as oversampling, focal loss, or meta-weighting address isolated aspects of this challenge but remain fragile or complex. We introduce FOSSIL (Flexible Optimization via Sample Sensitive Importance Learning), a unified weighting framework that seamlessly integrates class imbalance correction, difficulty-aware curricula, augmentation penalties, and warmup dynamics into a single interpretable formula. Unlike prior heuristics, the proposed framework provides regret-based theoretical guarantees and achieves consistent empirical gains over ERM, curriculum, and meta-weighting baselines on synthetic and real-world datasets, while requiring no architectural changes.
[420] On the Out-of-Distribution Backdoor Attack for Federated Learning
Jiahao Xu, Zikai Zhang, Rui Hu
Main category: cs.LG
TL;DR: A novel out-of-distribution backdoor attack (OBA) for federated learning uses OOD data as both poisoned samples and triggers, with SoDa enhancing stealthiness through model regularization. BNGuard defense detects malicious updates via batch normalization statistics deviations.
Details
Motivation: Traditional backdoor attacks in FL have limitations due to visible triggers and physical modifications, requiring more practical and stealthy attack methods that can evade existing defenses.Method: OBA uses out-of-distribution data as both poisoned samples and triggers. SoDa regularizes malicious local models’ magnitude and direction to align with benign versions for stealth. BNGuard defense monitors batch normalization layer statistics deviations to detect malicious updates.
Result: OBA effectively circumvents state-of-the-art defenses while maintaining high main task accuracy. BNGuard successfully identifies and excludes malicious model updates, enhancing FL backdoor robustness across various settings.
Conclusion: The paper presents both an advanced stealthy backdoor attack (OBA/SoDa) and corresponding defense (BNGuard) for federated learning, demonstrating the ongoing security arms race in FL systems and providing practical solutions for both attack and defense scenarios.
Abstract: Traditional backdoor attacks in federated learning (FL) operate within constrained attack scenarios, as they depend on visible triggers and require physical modifications to the target object, which limits their practicality. To address this limitation, we introduce a novel backdoor attack prototype for FL called the out-of-distribution (OOD) backdoor attack ($\mathtt{OBA}$), which uses OOD data as both poisoned samples and triggers simultaneously. Our approach significantly broadens the scope of backdoor attack scenarios in FL. To improve the stealthiness of $\mathtt{OBA}$, we propose $\mathtt{SoDa}$, which regularizes both the magnitude and direction of malicious local models during local training, aligning them closely with their benign versions to evade detection. Empirical results demonstrate that $\mathtt{OBA}$ effectively circumvents state-of-the-art defenses while maintaining high accuracy on the main task. To address this security vulnerability in the FL system, we introduce $\mathtt{BNGuard}$, a new server-side defense method tailored against $\mathtt{SoDa}$. $\mathtt{BNGuard}$ leverages the observation that OOD data causes significant deviations in the running statistics of batch normalization layers. This allows $\mathtt{BNGuard}$ to identify malicious model updates and exclude them from aggregation, thereby enhancing the backdoor robustness of FL. Extensive experiments across various settings show the effectiveness of $\mathtt{BNGuard}$ on defending against $\mathtt{SoDa}$. The code is available at https://github.com/JiiahaoXU/SoDa-BNGuard.
[421] Single-stream Policy Optimization
Zhongwen Xu, Zihan Ding
Main category: cs.LG
TL;DR: SPO introduces a single-stream policy optimization method that eliminates group-based issues in LLM training, providing more stable learning signals and better scalability than GRPO, resulting in significant accuracy improvements on math benchmarks.
Details
Motivation: Prevailing group-based methods like GRPO suffer from degenerate groups erasing learning signals and synchronization barriers hindering scalability, requiring a more robust approach.Method: SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, enabling group-free operation with higher throughput.
Result: SPO improves average maj@32 by +3.4 percentage points over GRPO on five hard math benchmarks, with substantial gains on challenging datasets (+7.3pp on BRUMO 25, +4.4pp on AIME 25, +3.3pp on HMMT 25).
Conclusion: SPO’s success demonstrates that fundamental principles rather than architectural workarounds can drive progress in LLM reasoning, offering a more robust and efficient path forward.
Abstract: We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO’s gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO’s success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.
[422] Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors
Aniket Didolkar, Nicolas Ballas, Sanjeev Arora, Anirudh Goyal
Main category: cs.LG
TL;DR: LLMs can reduce redundant reasoning by converting recurring reasoning fragments into reusable behaviors stored in a handbook, achieving up to 46% token reduction and 10% accuracy improvement.
Details
Motivation: LLMs often re-derive the same intermediate steps across problems, inflating token usage and latency while reducing context window capacity for exploration.Method: Convert recurring reasoning fragments into concise reusable behaviors via metacognitive analysis, store them in a behavior handbook, and use them in-context during inference or distill via supervised fine-tuning.
Result: 1) 46% reduction in reasoning tokens while matching/improving accuracy; 2) 10% higher accuracy than baseline without parameter updates; 3) More effective conversion of non-reasoning models into reasoning models via behavior-conditioned SFT.
Conclusion: Turning slow derivations into fast procedural hints enables LLMs to remember how to reason, not just what to conclude, improving efficiency and effectiveness.
Abstract: Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token usage and latency. This saturation of the context window leaves less capacity for exploration. We study a simple mechanism that converts recurring reasoning fragments into concise, reusable “behaviors” (name + instruction) via the model’s own metacognitive analysis of prior traces. These behaviors are stored in a “behavior handbook” which supplies them to the model in-context at inference or distills them into parameters via supervised fine-tuning. This approach achieves improved test-time reasoning across three different settings - 1) Behavior-conditioned inference: Providing the LLM relevant behaviors in-context during reasoning reduces number of reasoning tokens by up to 46% while matching or improving baseline accuracy; 2) Behavior-guided self-improvement: Without any parameter updates, the model improves its own future reasoning by leveraging behaviors from its own past problem solving attempts. This yields up to 10% higher accuracy than a naive critique-and-revise baseline; and 3) Behavior-conditioned SFT: SFT on behavior-conditioned reasoning traces is more effective at converting non-reasoning models into reasoning models as compared to vanilla SFT. Together, these results indicate that turning slow derivations into fast procedural hints enables LLMs to remember how to reason, not just what to conclude.
[423] Don’t Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning
Bo Yin, Xingyi Yang, Xinchao Wang
Main category: cs.LG
TL;DR: NoRA is a novel parameter-efficient fine-tuning method that adapts activation functions instead of weight matrices, achieving comparable or better performance than full fine-tuning with only 0.4% parameter updates, and can be combined with LoRA for further improvements.
Details
Motivation: Existing PEFT methods focus on adapting weight matrices while keeping activation functions fixed, leaving untapped potential in activation-space adaptation for more efficient fine-tuning.Method: NoRA replaces fixed activation functions with learnable rational functions and applies structured low-rank updates to numerator and denominator coefficients using a group-wise design for localized adaptation and stability.
Result: On vision transformers, NoRA matches/exceeds full fine-tuning with only 0.4% parameters (0.02M), achieving +0.17-0.27% accuracy gains. Combined with LoRA (NoRA++), it outperforms LoRA and DoRA. On LLaMA3-8B, NoRA++ improves generation quality with +0.3-0.8% MMLU gains.
Conclusion: Activation-space tuning is a complementary and highly parameter-efficient alternative to weight-based PEFT, establishing activation functions as first-class objects for model adaptation with implicit regularization benefits.
Abstract: Existing parameter-efficient fine-tuning (PEFT) methods primarily adapt weight matrices while keeping activation functions fixed. We introduce \textbf{NoRA}, the first PEFT framework that directly adapts nonlinear activation functions in pretrained transformer-based models. NoRA replaces fixed activations with learnable rational functions and applies structured low-rank updates to numerator and denominator coefficients, with a group-wise design that localizes adaptation and improves stability at minimal cost. On vision transformers trained on CIFAR-10 and CIFAR-100, NoRA matches or exceeds full fine-tuning while updating only 0.4% of parameters (0.02M), achieving accuracy gains of +0.17% and +0.27%. When combined with LoRA (\textbf{NoRA++}), it outperforms LoRA and DoRA under matched training budgets by adding fewer trainable parameters. On LLaMA3-8B instruction tuning, NoRA++ consistently improves generation quality, yielding average MMLU gains of +0.3%–0.8%, including +1.6% on STEM (Alpaca) and +1.3% on OpenOrca. We further show that NoRA constrains adaptation to a low-dimensional functional subspace, implicitly regularizing update magnitude and direction. These results establish activation-space tuning as a complementary and highly parameter-efficient alternative to weight-based PEFT, positioning activation functions as first-class objects for model adaptation.
[424] JANUS: A Dual-Constraint Generative Framework for Stealthy Node Injection Attacks
Jiahao Zhang, Xiaobing Pei, Zhaokun Zhong, Wenqiang Hao, Zhenghao Tang
Main category: cs.LG
TL;DR: JANUS is a dual-constraint framework for stealthy node injection attacks on GNNs that addresses local myopia by aligning both local feature manifolds and global semantic patterns through reinforcement learning.
Details
Motivation: Existing node injection attacks on GNNs rely on indirect proxy metrics for stealthiness, lack consideration of fundamental injected content characteristics, and suffer from local myopia by focusing only on local structure imitation.Method: Proposes JANUS framework with local feature manifold alignment for geometric consistency and global structured latent variables with mutual information maximization. Models injection as sequential decision process optimized by reinforcement learning.
Result: Experiments on multiple standard datasets show JANUS significantly outperforms existing methods in both attack effectiveness and stealthiness.
Conclusion: The dual-constraint approach addressing both local and global structural consistency enables more effective and stealthy node injection attacks on GNNs.
Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable performance across various applications, yet they are vulnerable to sophisticated adversarial attacks, particularly node injection attacks. The success of such attacks heavily relies on their stealthiness, the ability to blend in with the original graph and evade detection. However, existing methods often achieve stealthiness by relying on indirect proxy metrics, lacking consideration for the fundamental characteristics of the injected content, or focusing only on imitating local structures, which leads to the problem of local myopia. To overcome these limitations, we propose a dual-constraint stealthy node injection framework, called Joint Alignment of Nodal and Universal Structures (JANUS). At the local level, we introduce a local feature manifold alignment strategy to achieve geometric consistency in the feature space. At the global level, we incorporate structured latent variables and maximize the mutual information with the generated structures, ensuring the injected structures are consistent with the semantic patterns of the original graph. We model the injection attack as a sequential decision process, which is optimized by a reinforcement learning agent. Experiments on multiple standard datasets demonstrate that the JANUS framework significantly outperforms existing methods in terms of both attack effectiveness and stealthiness.
[425] Post-Hoc Split-Point Self-Consistency Verification for Efficient, Unified Quantification of Aleatoric and Epistemic Uncertainty in Deep Learning
Zhizhong Zhao, Ke Chen
Main category: cs.LG
TL;DR: A post-hoc single-forward-pass framework for joint aleatoric and epistemic uncertainty quantification using Split-Point Analysis and Mean Absolute Residuals, with applications to both regression and classification tasks.
Details
Motivation: Existing uncertainty quantification methods are either computationally intensive (Bayesian/ensemble) or provide only partial, task-specific estimates. There's a need for efficient, comprehensive uncertainty estimation without model retraining.Method: Proposes Split-Point Analysis (SPA) to decompose predictive residuals into upper/lower subsets, computes Mean Absolute Residuals (MARs), and uses Self-consistency Discrepancy Score (SDS) for epistemic uncertainty. For regression: side-specific quantile regression with SDS calibration. For classification: SPA-based calibration of softmax outputs.
Result: Extensive experiments show the framework matches or exceeds state-of-the-art UQ methods with minimal computational overhead across diverse regression and classification benchmarks.
Conclusion: The proposed method provides efficient, comprehensive uncertainty quantification without model modification or retraining, offering improved empirical coverage and calibration while maintaining computational efficiency.
Abstract: Uncertainty quantification (UQ) is vital for trustworthy deep learning, yet existing methods are either computationally intensive, such as Bayesian or ensemble methods, or provide only partial, task-specific estimates, such as single-forward-pass techniques. In this paper, we propose a post-hoc single-forward-pass framework that jointly captures aleatoric and epistemic uncertainty without modifying or retraining pretrained models. Our method applies \emph{Split-Point Analysis} (SPA) to decompose predictive residuals into upper and lower subsets, computing \emph{Mean Absolute Residuals} (MARs) on each side. We prove that, under ideal conditions, the total MAR equals the harmonic mean of subset MARs; deviations define a novel \emph{Self-consistency Discrepancy Score} (SDS) for fine-grained epistemic estimation across regression and classification. For regression, side-specific quantile regression yields prediction intervals with improved empirical coverage, which are further calibrated via SDS. For classification, when calibration data are available, we apply SPA-based calibration identities to adjust the softmax outputs and then compute predictive entropy on these calibrated probabilities. Extensive experiments on diverse regression and classification benchmarks demonstrate that our framework matches or exceeds several state-of-the-art UQ methods while incurring minimal overhead. Our source code is available at https://github.com/zzz0527/SPC-UQ.
[426] LLMs for energy and macronutrients estimation using only text data from 24-hour dietary recalls: a parameter-efficient fine-tuning experiment using a 10-shot prompt
Rodrigo M Carrillo-Larco
Main category: cs.LG
TL;DR: Open-source LLMs can accurately predict nutritional values from text-based food descriptions when fine-tuned with PEFT, achieving high accuracy (MAE ~180 for energy, CCC >0.89) compared to poor vanilla model performance.
Details
Motivation: Most AI nutrition tools require image input, but text-based estimation could enable simpler dietary monitoring without photographs. The study explores whether LLMs can accurately predict nutritional values from text descriptions alone.Method: Used 24-hour dietary recalls from NHANES adolescents. Applied open-source quantized LLM with 10-shot chain-of-thought prompting to estimate energy and macronutrients. Evaluated parameter-efficient fine-tuning (PEFT) to improve accuracy, using NHANES-calculated values as ground truth.
Result: Vanilla LLM performed poorly (MAE 652.08 for energy, Lin’s CCC <0.46). Fine-tuned model showed substantial improvement with energy MAEs 171.34-190.90 and Lin’s CCC >0.89 for all outcomes across 11,281 adolescents.
Conclusion: Fine-tuned LLMs with chain-of-thought prompting can accurately predict nutritional values from text-based dietary recalls, offering promise for low-burden text-based dietary monitoring tools.
Abstract: BACKGROUND: Most artificial intelligence tools used to estimate nutritional content rely on image input. However, whether large language models (LLMs) can accurately predict nutritional values based solely on text descriptions of foods consumed remains unknown. If effective, this approach could enable simpler dietary monitoring without the need for photographs. METHODS: We used 24-hour dietary recalls from adolescents aged 12-19 years in the National Health and Nutrition Examination Survey (NHANES). An open-source quantized LLM was prompted using a 10-shot, chain-of-thought approach to estimate energy and five macronutrients based solely on text strings listing foods and their quantities. We then applied parameter-efficient fine-tuning (PEFT) to evaluate whether predictive accuracy improved. NHANES-calculated values served as the ground truth for energy, proteins, carbohydrates, total sugar, dietary fiber and total fat. RESULTS: In a pooled dataset of 11,281 adolescents (49.9% male, mean age 15.4 years), the vanilla LLM yielded poor predictions. The mean absolute error (MAE) was 652.08 for energy and the Lin’s CCC <0.46 across endpoints. In contrast, the fine-tuned model performed substantially better, with energy MAEs ranging from 171.34 to 190.90 across subsets, and Lin’s CCC exceeding 0.89 for all outcomes. CONCLUSIONS: When prompted using a chain-of-thought approach and fine-tuned with PEFT, open-source LLMs exposed solely to text input can accurately predict energy and macronutrient values from 24-hour dietary recalls. This approach holds promise for low-burden, text-based dietary monitoring tools.
[427] The Belief State Transformer
Edward S. Hu, Kwangjun Ahn, Qinghua Liu, Haoran Xu, Manan Tomar, Ada Langford, Jayden Teoh, Bryon Xu, David Yan, Dinesh Jayaraman, Alex Lamb, John Langford
Main category: cs.LG
TL;DR: The Belief State Transformer is a novel architecture that predicts both next and previous tokens simultaneously, learning compact belief states to solve problems where standard transformers fail, particularly in goal-conditioned tasks like story writing.
Details
Motivation: Conventional forward-only transformers struggle with challenging problems that require understanding both context and future goals. The authors aim to create a model that can effectively handle tasks where both prefix and suffix information is available.Method: A transformer-based model that takes both prefix and suffix as inputs, with a dual objective of predicting the next token for the prefix and the previous token for the suffix. This forces the model to learn compact belief states containing all relevant information.
Result: Outperforms Fill-in-the-Middle method for story writing with known goals, shows improved performance even with unknown goals, enables more efficient goal-conditioned decoding and better test-time inference.
Conclusion: The Belief State Transformer successfully learns meaningful representations and solves problems that challenge standard transformers, demonstrating the value of bidirectional prediction objectives for learning compact belief states.
Abstract: We introduce the “Belief State Transformer”, a next-token predictor that takes both a prefix and suffix as inputs, with a novel objective of predicting both the next token for the prefix and the previous token for the suffix. The Belief State Transformer effectively learns to solve challenging problems that conventional forward-only transformers struggle with, in a domain-independent fashion. Key to this success is learning a compact belief state that captures all relevant information necessary for accurate predictions. Empirical ablations show that each component of the model is essential in difficult scenarios where standard Transformers fall short. For the task of story writing with known prefixes and suffixes, our approach outperforms the Fill-in-the-Middle method for reaching known goals and demonstrates improved performance even when the goals are unknown. Altogether, the Belief State Transformer enables more efficient goal-conditioned decoding, better test-time inference, and high-quality text representations on small scale problems. Website: https://edwhu.github.io/bst-website
[428] Empowering Time Series Analysis with Foundation Models: A Comprehensive Survey
Jiexia Ye, Yongzi Yu, Weiqi Zhang, Le Wang, Jia Li, Fugee Tsung
Main category: cs.LG
TL;DR: Survey paper on foundation models for time series analysis, covering modality-specific challenges and solutions across time series, language, and vision pre-training approaches.
Details
Motivation: Traditional time series analysis methods are task-specific with limited transferability, while foundation models have shown remarkable success in NLP and CV, motivating their adaptation to time series modeling challenges.Method: Introduces a modality-aware, challenge-oriented perspective to analyze how foundation models pre-trained on different modalities (time series, language, vision) face distinct hurdles when adapted to time series tasks, with a taxonomy organized by pre-training modality.
Result: Comprehensive synthesis of latest advances in foundation models for time series, including analysis of modality-specific challenges, corresponding solutions, real-world applications, and open-source codes.
Conclusion: Provides a structured overview of the rapidly evolving field of foundation models for time series analysis and identifies potential future research directions to address remaining challenges.
Abstract: Time series data are ubiquitous across diverse real-world applications, making time series analysis critically important. Traditional approaches are largely task-specific, offering limited functionality and poor transferability. In recent years, foundation models have revolutionized NLP and CV with their remarkable cross-task transferability, zero-/few-shot learning capabilities, and multimodal integration capacity. This success has motivated increasing efforts to explore foundation models for addressing time series modeling challenges. Although some tutorials and surveys were published in the early stages of this field, the rapid pace of recent developments necessitates a more comprehensive and in-depth synthesis to cover the latest advances. Our survey aims to fill this gap by introducing a modality-aware, challenge-oriented perspective, which reveals how foundation models pre-trained on different modalities face distinct hurdles when adapted to time series tasks. Building on this perspective, we propose a taxonomy of existing works organized by pre-training modality (time series, language, and vision), analyze modality-specific challenges and categorize corresponding solutions, discussing their advantages and limitations. Beyond this, we review real-world applications to illustrate domain-specific advancements, provide open-source codes, and conclude with potential future research directions in this rapidly evolving field.
[429] LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning
Yining Huang, Bin Li, Keke Tang, Meilian Chen
Main category: cs.LG
TL;DR: LoRA-PAR is a dual-system LoRA framework that partitions data and parameters by System 1 (intuitive) vs System 2 (analytical) demands, using fewer but more focused parameters for each task type through a two-stage SFT+RL fine-tuning approach.
Details
Motivation: Current PEFT methods don't explicitly tailor data and parameters to different response demands (quick intuitive vs multi-step reasoning tasks), inspired by the "Thinking, Fast and Slow" dual-system theory.Method: Classify task data via multi-model role-playing and voting, partition parameters based on importance scoring, then use two-stage fine-tuning: SFT for System 1 tasks and RL for System 2 tasks.
Result: Extensive experiments show the approach lowers active parameter usage while matching or surpassing state-of-the-art PEFT baselines.
Conclusion: The dual-system framework effectively specializes parameters for different cognitive demands, achieving efficient performance with focused parameter usage.
Abstract: Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer-wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought-System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)-we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.
[430] EMOE: A Framework for Out-of-distribution Uncertainty Based Rejection via Model-Agnostic Expansive Matching of Experts
Yunni Qu, James Wellnitz, Dzung Dinh, Bhargav Vaduri, Alexander Tropsha, Junier Oliva
Main category: cs.LG
TL;DR: EMOE is a novel framework that uses support-expanding pseudo-labeling with multiple experts to improve out-of-distribution (OOD) prediction and uncertainty-based rejection without requiring OOD data or modality-specific augmentations.
Details
Motivation: To address limitations of prior OOD generalization methods that rely on modality-specific augmentations or assume access to OOD data, and to create a more flexible approach that works with various model types.Method: Uses multiple base experts as pseudo-labelers on augmented data, trains multiple MLP heads (one per expert) with shared embedding using a novel per-head matching loss, and employs extrapolatory pseudo-labeling on latent-space augmentations.
Result: EMOE achieves superior performance compared to state-of-the-art methods on diverse datasets in single-source domain generalization settings.
Conclusion: EMOE provides a model-agnostic framework for robust OOD generalization that works effectively with various model types from simple tree-based models to complex OOD generalization models.
Abstract: Expansive Matching of Experts (EMOE) is a novel framework that utilizes support-expanding, extrapolatory pseudo-labeling to improve prediction and uncertainty based rejection on out-of-distribution(OOD) points. EMOE utilizes a diverse set of multiple base experts as pseudo-labelers on the augmented data to improve OOD performance through multiple MLP heads (one per expert) with shared embedding train with a novel per-head matching loss. Unlike prior methods that rely on modality-specific augmentations or assume access to OOD data, EMOE introduces extrapolatory pseudo-labeling on latent-space augmentations, enabling robust OOD generalization with any real-valued vector data. In contrast to prior modality agnostic methods with neural backbones, EMOE is model-agnostic, working effectively with methods from simple tree-based models to complex OOD generalization models. We demonstrate that EMOE achieves superior performance compared to state-of-the-art method on diverse datasets in single-source domain generalization setting.
[431] Informed Correctors for Discrete Diffusion Models
Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, Scott Linderman
Main category: cs.LG
TL;DR: Proposes informed corrector sampling scheme for discrete diffusion models to improve sampling efficiency and quality with fewer steps, using architectural modifications and tailored training objectives.
Details
Motivation: Existing discrete diffusion sampling strategies struggle to balance computation and sample quality when reducing sampling steps, even with well-trained models.Method: Predictor-corrector sampling scheme with diffusion model-informed corrector, hollow transformers architecture, and tailored training objective that leverages more training signals.
Result: Superior samples with fewer errors on text8 and improved FID scores on tokenized ImageNet 256x256 datasets compared to existing samplers.
Conclusion: Informed correctors enable fast and high-fidelity generation for discrete diffusion models, demonstrating significant potential for efficient sampling.
Abstract: Discrete diffusion has emerged as a powerful framework for generative modeling in discrete domains, yet efficiently sampling from these models remains challenging. Existing sampling strategies often struggle to balance computation and sample quality when the number of sampling steps is reduced, even when the model has learned the data distribution well. To address these limitations, we propose a predictor-corrector sampling scheme where the corrector is informed by the diffusion model to more reliably counter the accumulating approximation errors. To further enhance the effectiveness of our informed corrector, we introduce complementary architectural modifications based on hollow transformers and a simple tailored training objective that leverages more training signal. We use a synthetic example to illustrate the failure modes of existing samplers and show how informed correctors alleviate these problems. On the text8 and tokenized ImageNet 256x256 datasets, our informed corrector consistently produces superior samples with fewer errors or improved FID scores for discrete diffusion models. These results underscore the potential of informed correctors for fast and high-fidelity generation using discrete diffusion. Our code is available at https://github.com/lindermanlab/informed-correctors.
[432] Understanding Boolean Function Learnability on Deep Neural Networks: PAC Learning Meets Neurosymbolic Models
Marcio Nicolau, Anderson R. Tavares, Zhiwei Zhang, Pedro Avelar, João M. Flach, Luis C. Lamb, Moshe Y. Vardi
Main category: cs.LG
TL;DR: Neural networks outperform rule-based and symbolic systems in learning boolean formulas from various domains including model-sampling benchmarks, combinatorial optimization problems, and random 3-CNFs with different constraint levels.
Details
Motivation: To bridge the gap between theoretical computational learning theory and practical implementation by investigating how deep neural networks can effectively learn boolean formulas in real-world scenarios.Method: Experimental analysis of boolean formulas from model-sampling benchmarks, combinatorial optimization problems, and random 3-CNFs with varying constrainedness, comparing neural learning against rule-based and symbolic approaches.
Result: Neural networks generalize better than pure rule-based/symbolic systems, small shallow networks effectively approximate combinatorial optimization formulas, smaller formulas are harder to learn due to fewer positive examples, and underconstrained 3-CNFs are more challenging than overconstrained ones.
Conclusion: These findings provide insights for developing better interpretable neurosymbolic AI methods by understanding the practical learning capabilities of neural networks on boolean formulas.
Abstract: Computational learning theory states that many classes of boolean formulas are learnable in polynomial time. This paper addresses the understudied subject of how, in practice, such formulas can be learned by deep neural networks. Specifically, we analyze boolean formulas associated with model-sampling benchmarks, combinatorial optimization problems, and random 3-CNFs with varying degrees of constrainedness. Our experiments indicate that: (i) neural learning generalizes better than pure rule-based systems and pure symbolic approach; (ii) relatively small and shallow neural networks are very good approximators of formulas associated with combinatorial optimization problems; (iii) smaller formulas seem harder to learn, possibly due to the fewer positive (satisfying) examples available; and (iv) interestingly, underconstrained 3-CNF formulas are more challenging to learn than overconstrained ones. Such findings pave the way for a better understanding, construction, and use of interpretable neurosymbolic AI methods.
[433] Solving Truly Massive Budgeted Monotonic POMDPs with Oracle-Guided Meta-Reinforcement Learning
Manav Vora, Jonas Liang, Michael N. Grussing, Melkior Ornik
Main category: cs.LG
TL;DR: A two-step approach using budget allocation approximation and oracle-guided PPO to solve computationally intractable multi-component monotonic POMDPs with budget constraints.
Details
Motivation: Current methods cannot solve large-scale multi-component monotonic POMDPs due to exponential state space growth with increasing components, making them computationally intractable.Method: 1) Approximate optimal budget allocation using random forest models of component POMDP value functions; 2) Use oracle-guided meta-trained PPO algorithm to solve individual budget-constrained single-component POMDPs with oracle policies from MDP value iteration.
Result: The approach provides scalability for solving massive multi-component monotonic POMDPs, demonstrated through a real-world building maintenance scenario with computational complexity analysis.
Conclusion: The proposed two-step method effectively addresses the computational challenges of budget-constrained multi-component monotonic POMDPs, enabling practical application to large-scale sequential repair problems.
Abstract: Monotonic Partially Observable Markov Decision Processes (POMDPs), where the system state progressively decreases until a restorative action is performed, can be used to model sequential repair problems effectively. This paper considers the problem of solving budget-constrained multi-component monotonic POMDPs, where a finite budget limits the maximal number of restorative actions. For a large number of components, solving such a POMDP using current methods is computationally intractable due to the exponential growth in the state space with an increasing number of components. To address this challenge, we propose a two-step approach. Since the individual components of a budget-constrained multi-component monotonic POMDP are only connected via the shared budget, we first approximate the optimal budget allocation among these components using an approximation of each component POMDP’s optimal value function which is obtained through a random forest model. Subsequently, we introduce an oracle-guided meta-trained Proximal Policy Optimization (PPO) algorithm to solve each of the independent budget-constrained single-component monotonic POMDPs. The oracle policy is obtained by performing value iteration on the corresponding monotonic Markov Decision Process (MDP). This two-step method provides scalability in solving truly massive multi-component monotonic POMDPs. To demonstrate the efficacy of our approach, we consider a real-world maintenance scenario that involves inspection and repair of an administrative building by a team of agents within a maintenance budget. Finally, we perform a computational complexity analysis for a varying number of components to show the scalability of the proposed approach.
[434] On the equivalence of Occam algorithms
Zaman Keinath-Esmail
Main category: cs.LG
TL;DR: The paper extends Board and Pitt’s partial converse theorem to show that PAC learnable concept classes closed under exception lists are learnable by Occam algorithms with δ-independent complexities, not just δ-dependent ones.
Details
Motivation: Previous work by Board and Pitt showed that PAC learnable concept classes closed under exception lists are learnable by Occam algorithms, but their algorithm produced hypotheses with δ-dependent complexity, which is a significant limitation. This paper aims to address this limitation.Method: The authors extend the theoretical framework to demonstrate that the partial converse theorem applies to Occam algorithms with δ-independent complexities as well, building upon the existing work by Blumer et al. and Board and Pitt.
Result: The paper successfully shows that PAC learnable concept classes closed under exception lists can be learned by Occam algorithms that output hypotheses with δ-independent complexities, overcoming the previous limitation.
Conclusion: This work provides a posteriori justification for various theoretical results and algorithm design methods that relied on the partial converse theorem, as it now applies to the more practical case of δ-independent complexities.
Abstract: Blumer et al. (1987, 1989) showed that any concept class that is learnable by Occam algorithms is PAC learnable. Board and Pitt (1990) showed a partial converse of this theorem: for concept classes that are closed under exception lists, any class that is PAC learnable is learnable by an Occam algorithm. However, their Occam algorithm outputs a hypothesis whose complexity is $\delta$-dependent, which is an important limitation. In this paper, we show that their partial converse applies to Occam algorithms with $\delta$-independent complexities as well. Thus, we provide a posteriori justification of various theoretical results and algorithm design methods which use the partial converse as a basis for their work.
[435] Finite Neural Networks as Mixtures of Gaussian Processes: From Provable Error Bounds to Prior Selection
Steven Adams, Andrea Patanè, Morteza Lahijanian, Luca Laurenti
Main category: cs.LG
TL;DR: A framework to approximate finite neural networks with mixtures of Gaussian processes using Wasserstein distance and optimal transport, providing error bounds and enabling NN-GP functional matching.
Details
Motivation: Neural networks are only equivalent to Gaussian processes in the infinite limit, but finite NNs lack methods for Gaussian approximation with error bounds, limiting uncertainty quantification and analysis.Method: Iterative layer-wise approximation using optimal transport and Gaussian processes to represent each layer’s output distribution as a mixture of GPs, with Wasserstein distance for error quantification.
Result: For any finite NN and ε>0, the method returns a mixture of GPs that is ε-close at finite input points, and enables tuning NN parameters to mimic given GP behavior.
Conclusion: This represents a significant step towards understanding NN predictions and formally quantifying their uncertainty, with empirical validation on regression and classification tasks.
Abstract: Infinitely wide or deep neural networks (NNs) with independent and identically distributed (i.i.d.) parameters have been shown to be equivalent to Gaussian processes. Because of the favorable properties of Gaussian processes, this equivalence is commonly employed to analyze neural networks and has led to various breakthroughs over the years. However, neural networks and Gaussian processes are equivalent only in the limit; in the finite case there are currently no methods available to approximate a trained neural network with a Gaussian model with bounds on the approximation error. In this work, we present an algorithmic framework to approximate a neural network of finite width and depth, and with not necessarily i.i.d. parameters, with a mixture of Gaussian processes with error bounds on the approximation error. In particular, we consider the Wasserstein distance to quantify the closeness between probabilistic models and, by relying on tools from optimal transport and Gaussian processes, we iteratively approximate the output distribution of each layer of the neural network as a mixture of Gaussian processes. Crucially, for any NN and $\epsilon >0$ our approach is able to return a mixture of Gaussian processes that is $\epsilon$-close to the NN at a finite set of input points. Furthermore, we rely on the differentiability of the resulting error bound to show how our approach can be employed to tune the parameters of a NN to mimic the functional behavior of a given Gaussian process, e.g., for prior selection in the context of Bayesian inference. We empirically investigate the effectiveness of our results on both regression and classification problems with various neural network architectures. Our experiments highlight how our results can represent an important step towards understanding neural network predictions and formally quantifying their uncertainty.
[436] An Adaptive Tensor-Train Decomposition Approach for Efficient Deep Neural Network Compression
Shiyi Luo, Mingshuo Liu, Yifeng Yu, Shangping Ren, Yu Bai
Main category: cs.LG
TL;DR: A novel automatic budget-aware rank selection method called Layer-Wise Imprinting Quantitation (LWIQ) that improves tensor decomposition efficiency for model compression without repetitive rank recalculations.
Details
Motivation: Current rank selection methods for tensor decomposition in model compression are inefficient - manual selection requires trial-and-error while optimization-based methods increase computational complexity. There's a need for an automatic method that balances compression rate and efficiency without heavy computation.Method: Uses Layer-Wise Imprinting Quantitation (LWIQ) with a proxy classifier to quantify each layer’s significance and impact on model performance. Includes a scaling factor to handle different computational budget constraints, eliminating the need for repeated rank recalculations.
Result: On CIFAR-10 with ResNet-56: 63.2% improvement in rank search efficiency, only 0.86% accuracy drop, and 3.2x smaller model size compared to state-of-the-art proxy-based automatic tensor rank selection methods.
Conclusion: LWIQ provides an efficient, automatic, and budget-aware solution for tensor rank selection in model compression, significantly reducing computational overhead while maintaining model performance and achieving substantial compression.
Abstract: In the field of model compression, choosing an appropriate rank for tensor decomposition is pivotal for balancing model compression rate and efficiency. However, this selection, whether done manually or through optimization-based automatic methods, often increases computational complexity. Manual rank selection lacks efficiency and scalability, often requiring extensive trial-and-error, while optimization-based automatic methods significantly increase the computational burden. To address this, we introduce a novel, automatic, and budget-aware rank selection method for efficient model compression, which employs Layer-Wise Imprinting Quantitation (LWIQ). LWIQ quantifies each layer’s significance within a neural network by integrating a proxy classifier. This classifier assesses the layer’s impact on overall model performance, allowing for a more informed adjustment of tensor rank. Furthermore, our approach includes a scaling factor to cater to varying computational budget constraints. This budget awareness eliminates the need for repetitive rank recalculations for different budget scenarios. Experimental results on the CIFAR-10 dataset show that our LWIQ improved by 63.2% in rank search efficiency, and the accuracy only dropped by 0.86% with 3.2x less model size on the ResNet-56 model as compared to the state-of-the-art proxy-based automatic tensor rank selection method.
[437] MillStone: How Open-Minded Are LLMs?
Harold Triedman, Vitaly Shmatikov
Main category: cs.LG
TL;DR: MillStone benchmark measures how LLM stances on controversial issues are influenced by external arguments, finding LLMs are generally open-minded but easily swayed by authoritative sources.
Details
Motivation: As users increasingly rely on LLMs for information on controversial topics, it's crucial to understand how their outputs are influenced by external documents and sources they retrieve.Method: Developed MillStone benchmark to systematically test how nine leading LLMs respond to arguments supporting opposite sides of controversial issues, measuring persuasiveness and stance changes.
Result: LLMs are generally open-minded on most issues but can be easily swayed by authoritative sources, highlighting manipulation risks in LLM-based information systems.
Conclusion: Source selection is critical for LLM-based information systems due to their susceptibility to manipulation through authoritative arguments, requiring careful consideration of information retrieval mechanisms.
Abstract: Large language models equipped with Web search, information retrieval tools, and other agentic capabilities are beginning to supplant traditional search engines. As users start to rely on LLMs for information on many topics, including controversial and debatable issues, it is important to understand how the stances and opinions expressed in LLM outputs are influenced by the documents they use as their information sources. In this paper, we present MillStone, the first benchmark that aims to systematically measure the effect of external arguments on the stances that LLMs take on controversial issues (not all of them political). We apply MillStone to nine leading LLMs and measure how ``open-minded’’ they are to arguments supporting opposite sides of these issues, whether different LLMs agree with each other, which arguments LLMs find most persuasive, and whether these arguments are the same for different LLMs. In general, we find that LLMs are open-minded on most issues. An authoritative source of information can easily sway an LLM’s stance, highlighting the importance of source selection and the risk that LLM-based information retrieval and search systems can be manipulated.
[438] Efficient Estimation of Unique Components in Independent Component Analysis by Matrix Representation
Yoshitatsu Matsuda, Kazunori Yamaguch
Main category: cs.LG
TL;DR: Accelerated ICA algorithm through matrix reformulation and redundant calculation reduction
Details
Motivation: ICA has uniqueness issues due to many local optima, and previous global optimum estimation methods were computationally expensive through manual thread computationMethod: Reformulated ICA algorithm using matrix representation and reduced redundant calculations to accelerate the unique global optimum estimation
Result: Experimental verification on artificial datasets and EEG data showed improved efficiency
Conclusion: The proposed method successfully accelerates ICA’s unique solution estimation while maintaining accuracy
Abstract: Independent component analysis (ICA) is a widely used method in various applications of signal processing and feature extraction. It extends principal component analysis (PCA) and can extract important and complicated components with small variances. One of the major problems of ICA is that the uniqueness of the solution is not guaranteed, unlike PCA. That is because there are many local optima in optimizing the objective function of ICA. It has been shown previously that the unique global optimum of ICA can be estimated from many random initializations by handcrafted thread computation. In this paper, the unique estimation of ICA is highly accelerated by reformulating the algorithm in matrix representation and reducing redundant calculations. Experimental results on artificial datasets and EEG data verified the efficiency of the proposed method.
[439] Convex Regularization and Convergence of Policy Gradient Flows under Safety Constraints
Pekka Malo, Lauri Viitasaari, Antti Suominen, Eeva Vilkkumaa, Olli Tahvonen
Main category: cs.LG
TL;DR: A doubly-regularized RL framework combining reward and parameter regularization for infinite-horizon decision processes with safety constraints in continuous spaces, using mean-field theory and Wasserstein gradient flows.
Details
Motivation: Address safety constraints in reinforcement learning applications like autonomous systems, finance, and resource management where almost-sure safety is crucial.Method: Formulate as convex regularized objective with parametrized policies in mean-field regime, model policies on infinite-dimensional statistical manifold, use Wasserstein gradient flows for policy updates with smooth bounded approximations.
Result: Provides solvability conditions for safety-constrained problems, exponential convergence guarantees under sufficient regularization, and general regularization conditions supporting practical particle method implementations.
Conclusion: The framework offers robust theoretical insights and guarantees for safe reinforcement learning in complex, high-dimensional continuous settings.
Abstract: This paper examines reinforcement learning (RL) in infinite-horizon decision processes with almost-sure safety constraints, crucial for applications like autonomous systems, finance, and resource management. We propose a doubly-regularized RL framework combining reward and parameter regularization to address safety constraints in continuous state-action spaces. The problem is formulated as a convex regularized objective with parametrized policies in the mean-field regime. Leveraging mean-field theory and Wasserstein gradient flows, policies are modeled on an infinite-dimensional statistical manifold, with updates governed by parameter distribution gradient flows. Key contributions include solvability conditions for safety-constrained problems, smooth bounded approximations for gradient flows, and exponential convergence guarantees under sufficient regularization. General regularization conditions, including entropy regularization, support practical particle method implementations. This framework provides robust theoretical insights and guarantees for safe RL in complex, high-dimensional settings.
[440] Optimization of GNN Training Through Half-precision
Arnab Kanti Tarafder, Yidong Gong, Pradeep Kumar
Main category: cs.LG
TL;DR: HalfGNN enables efficient half-precision training for Graph Neural Networks, achieving 2.3x speedup and 2.67x memory savings while maintaining accuracy through novel vector operations and discretized SpMM techniques.
Details
Motivation: Current GNN systems underperform with half-precision training due to value overflow issues, poor hardware utilization, and abnormal accuracy, preventing the benefits seen in other deep learning domains.Method: Proposes HalfGNN with novel half-precision vector operations for improved data loading/reduction, and discretized SpMM to overcome value overflow and provide native workload balancing while eliminating atomic writes.
Result: Achieves average 2.30x training speedup over DGL (float-based) for GAT, GCN, and GIN models, maintains similar accuracy, and reduces memory usage by 2.67x.
Conclusion: HalfGNN successfully enables half-precision training for GNNs, overcoming previous limitations and delivering significant performance and memory efficiency gains without sacrificing accuracy.
Abstract: Recent trends in lower precision, e.g. half-precision floating point, training have shown improved system performance and reduced memory usage for Deep Learning while maintaining accuracy. However, current GNN systems cannot achieve such goals for GNN, as our analyses show that they massively underperform while showing abnormal accuracy when using half-precision. These systems suffer from value overflow issues due to lowered precision, under-utilization of hardware resources, and poor training performance. To mitigate this, we introduce HalfGNN, a half-precision based GNN system. HalfGNN proposes novel techniques: new vector operations for half-precision data types that improve data load and reduction performance, and discretized SpMM that overcomes the value overflow and natively provides workload balancing. Such techniques improve hardware utilization, reduce memory usage, and remove atomic writes. Evaluations show that HalfGNN achieves on average of 2.30X speedup in training time over DGL (float-based) for GAT, GCN, and GIN respectively while achieving similar accuracy, and saving 2.67X memory.
[441] Hybrid Two-Stage Reconstruction of Multiscale Subsurface Flow with Physics-informed Residual Connected Neural Operator
Peiqi Li, Jie Chen
Main category: cs.LG
TL;DR: A hybrid two-stage neural framework combining multiscale basis functions and physics-guided deep learning for accurate Darcy flow simulation in high-contrast fractured porous media.
Details
Motivation: To develop neural operators that can accurately solve single-phase flow problems in subsurface porous media with high-contrast coefficients while strictly adhering to physical laws.Method: Two-stage framework: 1) Data-driven model reconstructs multiscale basis functions from permeability field for dimensionality reduction, 2) Physics-informed neural network with Transformer-based global information extractor reconstructs pressure field while enforcing Darcy equation constraints.
Result: Achieves R2 values above 0.9 for both basis function fitting and pressure reconstruction, with residual indicator on the order of 1×10^-4, demonstrating high accuracy and physical consistency.
Conclusion: The proposed framework successfully achieves accurate reconstruction of Darcy flow in complex porous media while maintaining strict physical consistency, validating its effectiveness for high-contrast subsurface flow problems.
Abstract: The novel neural networks show great potential in solving partial differential equations. For single-phase flow problems in subsurface porous media with high-contrast coefficients, the key is to develop neural operators with accurate reconstruction capability and strict adherence to physical laws. In this study, we proposed a hybrid two-stage framework that uses multiscale basis functions and physics-guided deep learning to solve the Darcy flow problem in high-contrast fractured porous media. In the first stage, a data-driven model is used to reconstruct the multiscale basis function based on the permeability field to achieve effective dimensionality reduction while preserving the necessary multiscale features. In the second stage, the physics-informed neural network, together with Transformer-based global information extractor is used to reconstruct the pressure field by integrating the physical constraints derived from the Darcy equation, ensuring consistency with the physical laws of the real world. The model was evaluated on datasets with different combinations of permeability and basis functions and performed well in terms of reconstruction accuracy. Specifically, the framework achieves R2 values above 0.9 in terms of basis function fitting and pressure reconstruction, and the residual indicator is on the order of $1\times 10^{-4}$. These results validate the ability of the proposed framework to achieve accurate reconstruction while maintaining physical consistency.
[442] Understanding Generalization in Physics Informed Models through Affine Variety Dimensions
Takeshi Koshizuka, Issei Sato
Main category: cs.LG
TL;DR: Physics-informed machine learning analysis extended to hybrid settings with incomplete physical knowledge and nonlinear systems, showing generalization depends on constraint dimension rather than parameter count.
Details
Motivation: Current theoretical analyses assume complete prior knowledge and linear systems, failing to address real-world hybrid learning with observational data and nonlinear applications.Method: Introduced unified residual form combining collocation and variational methods, enabling analysis of incomplete physical constraints in hybrid settings. Established generalization bound based on dimension of affine variety from physical constraints.
Result: Generalization performance governed by constraint dimension rather than parameter count, enabling unified analysis for both linear and nonlinear equations. Developed method to approximate this dimension with experimental validation.
Conclusion: The framework provides theoretical foundation for physics-informed regression in practical hybrid settings with incomplete knowledge and nonlinear systems, demonstrating improved sample efficiency through physical constraint integration.
Abstract: Physics-informed machine learning is gaining significant traction for enhancing statistical performance and sample efficiency through the integration of physical knowledge. However, current theoretical analyses often presume complete prior knowledge in non-hybrid settings, overlooking the crucial integration of observational data, and are frequently limited to linear systems, unlike the prevalent nonlinear nature of many real-world applications. To address these limitations, we introduce a unified residual form that unifies collocation and variational methods, enabling the incorporation of incomplete and complex physical constraints in hybrid learning settings. Within this formulation, we establish that the generalization performance of physics-informed regression in such hybrid settings is governed by the dimension of the affine variety associated with the physical constraint, rather than by the number of parameters. This enables a unified analysis that is applicable to both linear and nonlinear equations. We also present a method to approximate this dimension and provide experimental validation of our theoretical findings.
[443] Safe Learning Under Irreversible Dynamics via Asking for Help
Benjamin Plaut, Juan Liévano-Karim, Hanlin Zhu, Stuart Russell
Main category: cs.LG
TL;DR: An algorithm that enables safe and effective learning in MDPs with irreversible dynamics by allowing agents to ask for mentor help and transfer knowledge between states, achieving sublinear regret and mentor queries.
Details
Motivation: Traditional learning algorithms require trying all possible behaviors, which is problematic in environments with irreversible errors. There's a need for safe learning approaches that can handle high-stakes environments without resets.Method: Combines mentor assistance (asking for help) with knowledge transfer between similar states. Uses a sequence of three reductions under standard online learning assumptions for any MDP, including those with irreversible dynamics.
Result: Developed an algorithm with both regret and number of mentor queries that are sublinear in the time horizon. The approach works for any MDP, including those with irreversible dynamics.
Conclusion: This represents the first formal proof that an agent can achieve high reward while becoming self-sufficient in unknown, unbounded, high-stakes environments without requiring resets, through mentor assistance and knowledge transfer.
Abstract: Most learning algorithms with formal regret guarantees essentially rely on trying all possible behaviors, which is problematic when some errors cannot be recovered from. Instead, we allow the learning agent to ask for help from a mentor and to transfer knowledge between similar states. We show that this combination enables the agent to learn both safely and effectively. Under standard online learning assumptions, we provide an algorithm whose regret and number of mentor queries are both sublinear in the time horizon for any Markov Decision Process (MDP), including MDPs with irreversible dynamics. Our proof involves a sequence of three reductions which may be of independent interest. Conceptually, our result may be the first formal proof that it is possible for an agent to obtain high reward while becoming self-sufficient in an unknown, unbounded, and high-stakes environment without resets.
[444] InfoGain Wavelets: Furthering the Design of Graph Diffusion Wavelets
David R. Johnson, Smita Krishnaswamy, Michael Perlmutter
Main category: cs.LG
TL;DR: Unsupervised method for selecting diffusion scales in diffusion wavelets using information theory, applied to wavelet-based GNNs for improved graph classification.
Details
Motivation: Traditional diffusion wavelets use dyadic integer scales (2^j), which may not be optimal. The paper aims to develop a better, data-driven approach for scale selection.Method: Proposes an unsupervised method based on information theory principles to select diffusion scales, then incorporates this into wavelet-based graph neural networks modeled after geometric scattering transform.
Result: The method is validated through graph classification experiments, demonstrating its effectiveness compared to traditional dyadic scale selection.
Conclusion: Information theory-based scale selection provides a superior alternative to traditional dyadic scales for diffusion wavelets in graph signal processing and GNN applications.
Abstract: Diffusion wavelets extract information from graph signals at different scales of resolution by utilizing graph diffusion operators raised to various powers, known as diffusion scales. Traditionally, these scales are chosen to be dyadic integers, $2^j$. Here, we propose a novel, unsupervised method for selecting the diffusion scales based on ideas from information theory. We then show that our method can be incorporated into wavelet-based GNNs, which are modeled after the geometric scattering transform, via graph classification experiments.
[445] Improved Impossible Tuning and Lipschitz-Adaptive Universal Online Learning with Gradient Variations
Kei Takemura, Ryuta Matsuno, Keita Sakuma
Main category: cs.LG
TL;DR: Proposes a novel optimistic online mirror descent algorithm that solves the impossible tuning issue in online learning, enabling simultaneous achievement of optimal gradient variation bounds and Lipschitz adaptivity with only loglog T factors.
Details
Motivation: Existing online learning algorithms suffer from suboptimal performance due to limitations in prediction with expert advice algorithms, particularly the impossible tuning issue that causes an excess sqrt(log T) factor in regret bounds compared to the lower bound.Method: Developed an optimistic online mirror descent algorithm with an auxiliary initial round using large learning rates, which enables refined analysis where a generated negative term cancels the gap-related factor.
Result: The algorithm resolves the impossible tuning issue up to loglog T factors and serves as a meta-algorithm to develop the first universal online learning algorithm that simultaneously achieves state-of-the-art gradient variation bounds and Lipschitz adaptivity under standard assumptions.
Conclusion: The proposed approach overcomes key limitations of prior works and resolves the open problem of conflict between Lipschitz adaptivity mechanisms and regret analysis for gradient variation bounds.
Abstract: A central goal in online learning is to achieve adaptivity to unknown problem characteristics, such as environmental changes captured by gradient variation (GV), function curvature (universal online learning, UOL), and gradient scales (Lipschitz adaptivity, LA). Simultaneously achieving these with optimal performance is a major challenge, partly due to limitations in algorithms for prediction with expert advice. These algorithms often serve as meta-algorithms in online ensemble frameworks, and their sub-optimality hinders overall UOL performance. Specifically, existing algorithms addressing the ``impossible tuning’’ issue incur an excess $\sqrt{\log T}$ factor in their regret bound compared to the lower bound. To solve this problem, we propose a novel optimistic online mirror descent algorithm with an auxiliary initial round using large learning rates. This design enables a refined analysis where a generated negative term cancels the gap-related factor, resolving the impossible tuning issue up to $\log\log T$ factors. Leveraging our improved algorithm as a meta-algorithm, we develop the first UOL algorithm that simultaneously achieves state-of-the-art GV bounds and LA under standard assumptions. Our UOL result overcomes key limitations of prior works, notably resolving the conflict between LA mechanisms and regret analysis for GV bounds – an open problem highlighted by Xie et al.
[446] Memorization Sinks: Isolating Memorization during LLM Training
Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan
Main category: cs.LG
TL;DR: MemSinks introduces a new paradigm to isolate memorized content in LLMs by design using sequence identifiers, enabling easier removal of memorized information without compromising general language capabilities.
Details
Motivation: Large language models tend to memorize repeated sequences, creating privacy and copyright concerns. Existing post-hoc removal approaches have limited success because memorized natural sequences become mechanistically entangled with general language abilities.Method: Uses sequence identifiers that activate unique memorization neurons for each repeated sequence. Analyzes learning and forgetting dynamics to promote isolation of memorized content by design rather than post-hoc removal.
Result: Implemented at billion-parameter and billion-token scale, MemSinks demonstrates effective isolation of memorized content while maintaining strong generalization capabilities - the first proof-of-concept showing simultaneous generalization and isolation is achievable on real data.
Conclusion: MemSinks provides a viable approach to address LLM memorization concerns by designing isolation mechanisms upfront, making memorized content easier to remove without affecting general language performance.
Abstract: Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns. A popular mitigation strategy is to remove memorized information from specific neurons post-hoc. However, such approaches have shown limited success so far. In a controlled setting, we show that the memorization of natural sequences (those that resemble linguistically plausible text) become mechanistically entangled with general language abilities, thereby becoming challenging to remove post-hoc. In this work, we put forward a new paradigm of MemSinks that promotes isolation of memorization by design. We leverage a sequence identifier that activates a unique set of memorization neurons for each sequence across repetitions. By analyzing the dynamics of learning and forgetting, we argue that MemSinks facilitates isolation of memorized content, making it easier to remove without compromising general language capabilities. We implement MemSinks at the billion-parameter and billion-token scale, and observe both effective isolation and strong generalization. To our knowledge, this is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable. We open-source our code at http://github.com/grghosal/MemSinks.
[447] Analysis of Fourier Neural Operators via Effective Field Theory
Taeyoung Kim
Main category: cs.LG
TL;DR: Systematic effective field theory analysis of Fourier Neural Operators (FNOs) reveals how nonlinear activations enable frequency transfer, provides criticality conditions for stable initialization, and explains why scale-invariant activations and residual connections improve performance.
Details
Motivation: FNOs are powerful surrogates for solver operators but lack principled understanding of their stability, generalization, and frequency behavior. The paper aims to provide a systematic theoretical framework to explain these properties.Method: Used effective field theory analysis in infinite dimensional function space, deriving closed recursion relations for layer kernel and four point vertex. Examined three settings: analytic activations, scale invariant cases, and architectures with residual connections. Conducted experiments to validate theoretical predictions.
Result: Nonlinear activations couple frequency inputs to high frequency modes (frequency transfer). Derived criticality conditions for weight initialization that ensure uniform perturbation scaling. Experiments confirmed theoretical predictions about kernel perturbation ratios and frequency transfer behavior.
Conclusion: The analysis quantifies how nonlinearity enables FNOs to capture non-trivial features, provides criteria for hyperparameter selection via criticality analysis, and explains why scale-invariant activations and residual connections enhance feature learning in FNOs.
Abstract: Fourier Neural Operators (FNOs) have emerged as leading surrogates for solver operators for various functional problems, yet their stability, generalization and frequency behavior lack a principled explanation. We present a systematic effective field theory analysis of FNOs in an infinite dimensional function space, deriving closed recursion relations for the layer kernel and four point vertex and then examining three practically important settings-analytic activations, scale invariant cases and architectures with residual connections. The theory shows that nonlinear activations inevitably couple frequency inputs to high frequency modes that are otherwise discarded by spectral truncation, and experiments confirm this frequency transfer. For wide networks, we derive explicit criticality conditions on the weight initialization ensemble that ensure small input perturbations maintain a uniform scale across depth, and we confirm experimentally that the theoretically predicted ratio of kernel perturbations matches the measurements. Taken together, our results quantify how nonlinearity enables neural operators to capture non-trivial features, supply criteria for hyperparameter selection via criticality analysis, and explain why scale invariant activations and residual connections enhance feature learning in FNOs.
[448] EB-gMCR: Energy-Based Generative Modeling for Signal Unmixing and Multivariate Curve Resolution
Yu-Tang Chang, Shih-Fang Chen
Main category: cs.LG
TL;DR: EB-gMCR reformulates multivariate curve resolution as a generative process with an energy-based solver that automatically discovers the smallest component set and their concentrations for signal unmixing, outperforming traditional matrix factorization approaches.
Details
Motivation: Classical MCR methods require user-specified component numbers (often unknown) and face scalability challenges with increasing data or components. There's a need for automated component discovery and better scalability.Method: Reformulates MCR as a data generative process (gMCR) and introduces an Energy-Based solver (EB-gMCR) that automatically discovers the smallest component set and their concentrations. Domain priors enter as plug-in modules without altering core selection learning.
Result: On synthetic benchmarks with up to 256 components, EB-gMCR achieves high reconstruction fidelity, recovers component count within 5% at 20dB noise and near-exact at 30dB. On public spectral datasets, it identifies correct component count and improves separation over MF-based approaches.
Conclusion: EB-gMCR is a general solver for fixed-pattern signal unmixing that enables adaptation to new instruments/domains through plug-in modules while maintaining core selection learning, providing automated component discovery with better scalability.
Abstract: Signal unmixing analysis decomposes data into basic patterns and is widely applied in chemical and biological research. Multivariate curve resolution (MCR), a branch of signal unmixing, separates mixed signals into components (base patterns) and their concentrations (intensity), playing a key role in understanding composition. Classical MCR is typically framed as matrix factorization (MF) and requires a user-specified number of components, usually unknown in real data. Once data or component number increases, the scalability of these MCR approaches face significant challenges. This study reformulates MCR as a data generative process (gMCR), and introduces an Energy-Based solver, EB-gMCR, that automatically discovers the smallest component set and their concentrations for reconstructing the mixed signals faithfully. On synthetic benchmarks with up to 256 components, EB-gMCR attains high reconstruction fidelity and recovers the component count within 5% at 20dB noise and near-exact at 30dB. On two public spectral datasets, it identifies the correct component count and improves component separation over MF-based MCR approaches (NMF variants, ICA, MCR-ALS). EB-gMCR is a general solver for fixed-pattern signal unmixing (components remain invariant across mixtures). Domain priors (non-negativity, nonlinear mixing) enter as plug-in modules, enabling adaptation to new instruments or domains without altering the core selection learning step. The source code is available at https://github.com/b05611038/ebgmcr_solver.
[449] Stochastic Optimal Control via Measure Relaxations
Etienne Buehrle, Christoph Stiller
Main category: cs.LG
TL;DR: Proposes a convex optimization approach over occupation measures for stochastic optimal control, avoiding scalability issues of robust/scenario methods, with cost function learning via Christoffel polynomials.
Details
Motivation: Existing robust and scenario-based optimization methods for stochastic optimal control are challenging to scale to long optimization horizons, creating a need for more scalable approaches.Method: Casts the optimal control problem as a convex optimization problem over occupation measures and learns cost functions from data using Christoffel polynomials.
Result: The method is demonstrated on both synthetic and real-world scenarios, showing practical applicability and scalability.
Conclusion: The occupation measure approach provides a scalable convex optimization framework for stochastic optimal control problems, with successful implementation demonstrated through experiments.
Abstract: The optimal control problem of stochastic systems is commonly solved via robust or scenario-based optimization methods, which are both challenging to scale to long optimization horizons. We cast the optimal control problem of a stochastic system as a convex optimization problem over occupation measures. We demonstrate our method on a set of synthetic and real-world scenarios, learning cost functions from data via Christoffel polynomials. The code for our experiments is available at https://github.com/ebuehrle/dpoc.
[450] Quantifying The Limits of AI Reasoning: Systematic Neural Network Representations of Algorithms
Anastasis Kratsios, Dennis Zvigelsky, Bradd Hart
Main category: cs.LG
TL;DR: Neural networks can exactly emulate any circuit computation using ReLU activations, demonstrating they can perform any form of reasoning defined by circuit gates without approximation.
Details
Motivation: To quantify what forms of reasoning neural networks can perform when perfectly trained, by interpreting reasoning tasks as circuit emulation problems.Method: A meta-algorithm that converts any circuit into a feedforward neural network with ReLU activations by iteratively replacing each gate with a canonical ReLU MLP emulator.
Result: Exact emulation of circuits without approximation, including Boolean gates, tropical circuits, arithmetic gates, and hybrids. Neural networks can emulate shortest-path algorithms, Turing machines, and randomized Boolean circuits.
Conclusion: No reasoning task lies beyond neural networks’ reach - they trade algorithmic runtime for space complexity (number of neurons), and this result is strictly more powerful than classical universal approximation theorems.
Abstract: A main open question in contemporary AI research is quantifying the forms of reasoning neural networks can perform when perfectly trained. This paper answers this by interpreting reasoning tasks as circuit emulation, where the gates define the type of reasoning; e.g. Boolean gates for predicate logic, tropical circuits for dynamic programming, arithmetic and analytic gates for symbolic mathematical representation, and hybrids thereof for deeper reasoning; e.g. higher-order logic. We present a systematic meta-algorithm that converts essentially any circuit into a feedforward neural network (NN) with ReLU activations by iteratively replacing each gate with a canonical ReLU MLP emulator. We show that, on any digital computer, our construction emulates the circuit exactly–no approximation, no rounding, modular overflow included–demonstrating that no reasoning task lies beyond the reach of neural networks. The number of neurons in the resulting network (parametric complexity) scales with the circuit’s complexity, and the network’s computational graph (structure) mirrors that of the emulated circuit. This formalizes the folklore that NNs networks trade algorithmic run-time (circuit runtime) for space complexity (number of neurons). We derive a range of applications of our main result, from emulating shortest-path algorithms on graphs with cubic–size NNs, to simulating stopped Turing machines with roughly quadratically–large NNs, and even the emulation of randomized Boolean circuits. Lastly, we demonstrate that our result is strictly more powerful than a classical universal approximation theorem: any universal function approximator can be encoded as a circuit and directly emulated by a NN.
[451] PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang, Guohua Liu
Main category: cs.LG
TL;DR: PVPO is a critic-free reinforcement learning method that uses an advantage reference anchor and data pre-sampling to address local optimum and computational cost issues in group policy methods.
Details
Motivation: Existing critic-free RL methods rely heavily on multiple sampling and intra-group comparisons for advantage estimation, which can cause policies to fall into local optima and increase computational costs.Method: Uses a reference model to rollout in advance and employs the calculated reward score as a reference anchor. Also implements data pre-sampling where the reference model assesses sample difficulty to select high-gain data.
Result: Achieves State-Of-The-Art performance on nine datasets across two domains, demonstrating robust generalization across multiple tasks and scalable performance across models of varying scales.
Conclusion: PVPO effectively corrects cumulative bias from intra-group comparisons, reduces reliance on rollouts during training, and is orthogonal to other advanced critic-free RL algorithms, making it compatible and complementary to existing methods.
Abstract: Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts during training. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Moreover, PVPO is orthogonal to other advanced critic-free RL algorithms, making it compatible with and complementary to these methods. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.
[452] Second-Order Tensorial Partial Differential Equations on Graphs
Aref Einizade, Fragkiskos D. Malliaros, Jhony H. Giraldo
Main category: cs.LG
TL;DR: Second-order tensorial PDEs on graphs for continuous GNNs that preserve high frequencies and enable fast information propagation
Details
Motivation: Existing graph processing methods rely on discrete filtering or first-order continuous models, which dampen high frequencies and slow information propagationMethod: Second-order tensorial partial differential equations on graphs (SoTPDEG) that exploit separability of cosine kernels in Cartesian product graphs for efficient spectral decomposition
Result: Experimental results on spatiotemporal traffic forecasting show superiority over compared methods
Conclusion: The proposed framework provides the first theoretically grounded second-order continuous product graph neural networks with rigorous over-smoothing and stability analysis
Abstract: Processing data on multiple interacting graphs is crucial for many applications, but existing approaches rely mostly on discrete filtering or first-order continuous models, dampening high frequencies and slow information propagation. In this paper, we introduce second-order tensorial partial differential equations on graphs (SoTPDEG) and propose the first theoretically grounded framework for second-order continuous product graph neural networks (GNNs). Our method exploits the separability of cosine kernels in Cartesian product graphs to enable efficient spectral decomposition while preserving high-frequency components. We further provide rigorous over-smoothing and stability analysis under graph perturbations, establishing a solid theoretical foundation. Experimental results on spatiotemporal traffic forecasting illustrate the superiority over the compared methods.
[453] Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports
Jian Chen, Jiabao Dou, Jinbao Tian, Yunqi Yang, Zhou Li
Main category: cs.LG
TL;DR: ABEX-RAT is a novel framework that combines generative data augmentation with adversarial training to address class imbalance in occupational accident report classification, achieving state-of-the-art performance with 90.32% macro-F1 score.
Details
Motivation: Severe class imbalance in occupational accident datasets compromises model performance, especially for rare but severe incident types, hindering reliable automated classification systems for workplace safety.Method: Two-step approach: 1) ABEX pipeline uses LLM to distill core incident semantics and generative model to create diverse synthetic samples for underrepresented classes, 2) Lightweight classifier trained with random adversarial training (RAT) that applies stochastic perturbations for enhanced generalization.
Result: Achieved new SOTA performance on OSHA dataset with 90.32% macro-F1 score, significantly outperforming previous SOTA and fine-tuned large model baselines.
Conclusion: The synergistic strategy of generative data augmentation with adversarial training is highly effective and efficient alternative to brute-force fine-tuning for specialized, imbalanced classification tasks.
Abstract: The automatic classification of occupational accident reports is a critical research area for enhancing workplace safety and enabling large-scale risk analysis. However, the severe class imbalance inherent in these real-world datasets often compromises the performance of analytical models, particularly for rare but severe incident types, hindering the development of reliable automated systems. To address this challenge, we propose ABEX-RAT, a novel and efficient framework that synergizes generative data augmentation with robust adversarial training. Our approach first employs a twostep abstractive-expansive (ABEX) pipeline, which leverages a large language model to distill core incident semantics and then uses a generative model to create diverse, highquality synthetic samples for underrepresented classes. Subsequently, a lightweight classifier is trained on the augmented data using a computationally efficient random adversarial training (RAT) protocol, which stochastically applies perturbations to enhance model generalization and robustness without significant overhead. Experimental results on the public OSHA dataset demonstrate that our method achieves new state-of-the-art performance, reaching a macro-F1 score of 90.32% and significantly outperforming previous SOTA and fine-tuned large model baselines. Our work validates that this synergistic strategy is a highly effective and efficient alternative to brute-force fine-tuning for specialized, imbalanced classification tasks. The code is publicly available at:https://github.com/nxcc-lab/ABEX-RAT.
[454] Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models
Chanon Puttanawarut, Natcha Fongsrisin, Porntep Amornritvanich, Panu Looareesuwan, Cholatid Ratanatharathorn
Main category: cs.LG
TL;DR: Deep learning models successfully generated high-fidelity synthetic heart failure datasets that preserve privacy while maintaining statistical utility for research.
Details
Motivation: Heart failure research faces limitations due to privacy regulations and institutional barriers that restrict access to large, shareable datasets. Synthetic data generation offers a solution to overcome these challenges while maintaining patient confidentiality.Method: Generated synthetic HF datasets from 12,552 patients using five deep learning models: tabular variational autoencoder (TVAE), normalizing flow, ADSGAN, SurvivalGAN, and tabular denoising diffusion probabilistic models (TabDDPM). Evaluated through statistical similarity metrics, survival prediction using machine learning, and privacy assessments.
Result: SurvivalGAN and TabDDPM showed high fidelity with similar variable distributions and survival curves. SurvivalGAN (C-indices: 0.71-0.76) and TVAE (C-indices: 0.73-0.76) achieved strong survival prediction performance matching real data (C-indices: 0.73-0.76). Privacy evaluation confirmed protection against re-identification attacks.
Conclusion: Deep learning-based synthetic data generation can produce high-fidelity, privacy-preserving HF datasets suitable for research applications, addressing critical data sharing barriers and providing valuable resources for advancing HF research and predictive modeling.
Abstract: Background: Heart failure (HF) research is constrained by limited access to large, shareable datasets due to privacy regulations and institutional barriers. Synthetic data generation offers a promising solution to overcome these challenges while preserving patient confidentiality. Methods: We generated synthetic HF datasets from institutional data comprising 12,552 unique patients using five deep learning models: tabular variational autoencoder (TVAE), normalizing flow, ADSGAN, SurvivalGAN, and tabular denoising diffusion probabilistic models (TabDDPM). We comprehensively evaluated synthetic data utility through statistical similarity metrics, survival prediction using machine learning and privacy assessments. Results: SurvivalGAN and TabDDPM demonstrated high fidelity to the original dataset, exhibiting similar variable distributions and survival curves after applying histogram equalization. SurvivalGAN (C-indices: 0.71-0.76) and TVAE (C-indices: 0.73-0.76) achieved the strongest performance in survival prediction evaluation, closely matched real data performance (C-indices: 0.73-0.76). Privacy evaluation confirmed protection against re-identification attacks. Conclusions: Deep learning-based synthetic data generation can produce high-fidelity, privacy-preserving HF datasets suitable for research applications. This publicly available synthetic dataset addresses critical data sharing barriers and provides a valuable resource for advancing HF research and predictive modeling.
[455] MetaLLMix : An XAI Aided LLM-Meta-learning Based Approach for Hyper-parameters Optimization
Mohammed Tiouti, Mohamed Bal-Ghaoui
Main category: cs.LG
TL;DR: MetaLLMiX is a zero-shot hyperparameter optimization framework that combines meta-learning, explainable AI, and LLM reasoning to recommend optimal hyperparameters and pretrained models without additional trials, achieving competitive performance with drastically reduced computational costs.
Details
Motivation: Current AutoML and LLM-based approaches for model/hyperparameter selection rely on expensive trial-and-error methods with limited interpretability and generalizability, requiring extensive expertise and computation.Method: Leverages historical experiment outcomes with SHAP explanations for meta-learning, uses efficient LLM reasoning for zero-shot recommendations, and employs LLM-as-judge evaluation to control output quality and format.
Result: Achieves competitive/superior performance to traditional HPO methods, optimal results on 5/8 medical imaging tasks, 99.6-99.9% response time reduction, 2.4-15.7x faster training on 6 datasets, with accuracy within 1-5% of best baselines.
Conclusion: MetaLLMiX provides an efficient, interpretable, and cost-effective alternative to traditional hyperparameter optimization methods, enabling automated model selection without expensive trial runs while maintaining competitive performance.
Abstract: Effective model and hyperparameter selection remains a major challenge in deep learning, often requiring extensive expertise and computation. While AutoML and large language models (LLMs) promise automation, current LLM-based approaches rely on trial and error and expensive APIs, which provide limited interpretability and generalizability. We propose MetaLLMiX, a zero-shot hyperparameter optimization framework combining meta-learning, explainable AI, and efficient LLM reasoning. By leveraging historical experiment outcomes with SHAP explanations, MetaLLMiX recommends optimal hyperparameters and pretrained models without additional trials. We further employ an LLM-as-judge evaluation to control output format, accuracy, and completeness. Experiments on eight medical imaging datasets using nine open-source lightweight LLMs show that MetaLLMiX achieves competitive or superior performance to traditional HPO methods while drastically reducing computational cost. Our local deployment outperforms prior API-based approaches, achieving optimal results on 5 of 8 tasks, response time reductions of 99.6-99.9%, and the fastest training times on 6 datasets (2.4-15.7x faster), maintaining accuracy within 1-5% of best-performing baselines.
[456] Vendi Information Gain for Active Learning and its Application to Ecology
Quan Nguyen, Adji Bousso Dieng
Main category: cs.LG
TL;DR: Vendi information gain (VIG) is a new active learning method that selects images based on dataset-wide prediction uncertainty, achieving 75% accuracy with only 3% of data and 88% accuracy with 10% of data on biodiversity monitoring tasks.
Details
Motivation: Camera trap biodiversity monitoring faces a major bottleneck in species identification due to limited labeling resources. Traditional active learning methods focus on individual prediction uncertainty without considering dataset-wide uncertainty.Method: Vendi information gain (VIG) policy selects images based on their impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity. Applied to Snapshot Serengeti dataset and compared against common active learning methods.
Result: VIG needs only 3% of data to reach 75% accuracy (baselines require >10%). With 10% data, VIG attains 88% accuracy (12% higher than best baseline). Consistent improvement across metrics and batch sizes, collects more diverse data.
Conclusion: VIG offers significant improvements for biodiversity monitoring in data-limited environments and has broad applicability beyond ecology.
Abstract: While monitoring biodiversity through camera traps has become an important endeavor for ecological research, identifying species in the captured image data remains a major bottleneck due to limited labeling resources. Active learning – a machine learning paradigm that selects the most informative data to label and train a predictive model – offers a promising solution, but typically focuses on uncertainty in the individual predictions without considering uncertainty across the entire dataset. We introduce a new active learning policy, Vendi information gain (VIG), that selects images based on their impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity. We applied VIG to the Snapshot Serengeti dataset and compared it against common active learning methods. VIG needs only 3% of the available data to reach 75% accuracy, a level that baselines require more than 10% of the data to achieve. With 10% of the data, VIG attains 88% predictive accuracy, 12% higher than the best of the baselines. This improvement in performance is consistent across metrics and batch sizes, and we show that VIG also collects more diverse data in the feature space. VIG has broad applicability beyond ecology, and our results highlight its value for biodiversity monitoring in data-limited environments.
cs.MA
[457] PromptSculptor: Multi-Agent Based Text-to-Image Prompt Optimization
Dawei Xiang, Wenyan Xu, Kexin Chu, Zixu Shen, Tianqi Ding, Wei Zhang
Main category: cs.MA
TL;DR: PromptSculptor is a multi-agent framework that automates iterative prompt optimization for text-to-image models, transforming vague user prompts into detailed, high-quality prompts through collaborative specialized agents.
Details
Motivation: Current text-to-image models require users to craft detailed prompts through multiple refinement rounds, which is time-consuming and requires expertise. There's a need to automate this iterative optimization process to democratize access to high-quality image generation.Method: A multi-agent framework with four specialized agents: decomposition agents that transform vague prompts into comprehensive ones using Chain-of-Thought reasoning, a self-evaluation agent that aligns modified prompts with original input, and a feedback-tuning agent that incorporates user feedback for iterative refinement.
Result: Experimental results show PromptSculptor significantly enhances output quality and reduces the number of iterations needed for user satisfaction. The framework is model-agnostic and can be integrated with various text-to-image models.
Conclusion: PromptSculptor provides an effective automated solution for prompt optimization, making high-quality image generation more accessible while being adaptable for industrial applications across different text-to-image models.
Abstract: The rapid advancement of generative AI has democratized access to powerful tools such as Text-to-Image models. However, to generate high-quality images, users must still craft detailed prompts specifying scene, style, and context-often through multiple rounds of refinement. We propose PromptSculptor, a novel multi-agent framework that automates this iterative prompt optimization process. Our system decomposes the task into four specialized agents that work collaboratively to transform a short, vague user prompt into a comprehensive, refined prompt. By leveraging Chain-of-Thought reasoning, our framework effectively infers hidden context and enriches scene and background details. To iteratively refine the prompt, a self-evaluation agent aligns the modified prompt with the original input, while a feedback-tuning agent incorporates user feedback for further refinement. Experimental results demonstrate that PromptSculptor significantly enhances output quality and reduces the number of iterations needed for user satisfaction. Moreover, its model-agnostic design allows seamless integration with various T2I models, paving the way for industrial applications.
[458] Between proportionnality and envy-freeness: k-proportionality
Guillaume Chèze
Main category: cs.MA
TL;DR: Introduces k-proportionality as a scale between proportional and envy-free cake division, showing impossibility results for k ≤ n-1 in connected pieces scenarios.
Details
Motivation: To bridge the gap between proportional and envy-free division concepts and understand where difficulties in fair division lie by introducing a scalable intermediate concept.Method: Develops the concept of k-proportionality where k ranges from 2 (envy-free) to n (proportional), then proves impossibility theorems for k ≤ n-1 in connected pieces scenarios.
Result: Shows no k-proportional and equitable division exists for k ≤ n-1 with connected pieces, and no Pareto-optimal k-proportional division exists for k ≤ n-1 - extending known results from just envy-free (k=2) to a broader range.
Conclusion: k-proportionality provides a useful framework for understanding the transition between proportional and envy-free division, revealing that impossibility results occur even with weaker fairness notions than envy-freeness.
Abstract: This article deals with the cake cutting problem. In this setting, there exists two notions of fair division: proportional division (when there are n players, each player thinks to get at least 1/n of the cake) and envy-free division (each player wants to keep his or her share because he or she does not envy the portion given to another player). Some results are valid for proportional division but not for envy-free division. Here, we introduce and study a scale between the proportional division and the envy-free division. The goal is to understand where is the gap between statements about proportional division and envy-free division. This scale comes from the notion introduced in this article: k-proportionality. When k = n this notion corresponds to the proportional division and when k = 2 it corresponds to envy-free division. With k-proportionality we can understand where some difficulties in fair division lie. First, we show that there are situations in which there is no k-proportional and equitable division of a pie with connected pieces when k $\le$ n -1. This result was known only for envy-free division, ie k = 2. Next, we prove that there are situations in which there is no Pareto-optimal k-proportional division of a cake with connected pieces when k $\le$ n -1. This result was known only for k = 2. These theorems say that we can get an impossibility result even if we do not consider an envy-free division but a weaker notion. Finally, k-proportionality allows to give a generalization with a uniform statement of theorems about strong envy-free and strong proportional divisions.
[459] Strategic Concealment of Environment Representations in Competitive Games
Yue Guan, Dipankar Maity, Panagiotis Tsiotras
Main category: cs.MA
TL;DR: Strategic concealment of environment representations in competitive games where Defender infers Attacker’s representation from trajectory to place barriers, while Attacker obfuscates to mislead Defender.
Details
Motivation: To investigate how players strategically conceal environment representations in competitive scenarios and how this concealment affects decision-making and outcomes.Method: Model the interaction as a Bayesian game, solve for Perfect Bayesian Nash Equilibrium via bilinear program integrating Bayesian inference, strategic planning, and belief manipulation.
Result: Purposeful concealment naturally emerges - Attacker randomizes trajectory to manipulate Defender’s belief, inducing suboptimal barrier selections and gaining strategic advantage.
Conclusion: Strategic concealment and obfuscation of environment representations provides competitive advantages in adversarial settings through belief manipulation.
Abstract: This paper investigates the strategic concealment of environment representations used by players in competitive games. We consider a defense scenario in which one player (the Defender) seeks to infer and exploit the representation used by the other player (the Attacker). The interaction between the two players is modeled as a Bayesian game: the Defender infers the Attacker’s representation from its trajectory and places barriers to obstruct the Attacker’s path towards its goal, while the Attacker obfuscates its representation type to mislead the Defender. We solve for the Perfect Bayesian Nash Equilibrium via a bilinear program that integrates Bayesian inference, strategic planning, and belief manipulation. Simulations show that purposeful concealment naturally emerges: the Attacker randomizes its trajectory to manipulate the Defender’s belief, inducing suboptimal barrier selections and thereby gaining a strategic advantage.
[460] Teamwork as Linear Interpersonal Dynamics
Andrew Jun Lee, Grace Qiyuan Miao, Rick Dale, Alexia Galati, Hongjing Lu
Main category: cs.MA
TL;DR: The paper introduces the context matrix as a unified representation for interpersonal dynamics that captures both synchrony and directional influence between individuals, validated through simulations and eye-tracking data.
Details
Motivation: Existing measures of interpersonal dynamics (CRQA, correlation, Granger causality, transfer entropy) only capture single dimensions (synchrony or influence) but lack a psychologically meaningful unified representation that varies systematically with behavior.Method: Proposes the context matrix within a linear dynamical system framework, with psychologically interpretable entries showing how individuals’ current behavior relates to their own and others’ past behaviors. Developed a sequential Bayesian model to infer context matrices from timeseries data and validated it through noisy simulations and human eye-tracking experiments.
Result: The model accurately recovered context matrices in simulations. When applied to eye-tracking data, summary features of inferred context matrices captured expected task-based differences in interpersonal dynamics, predicted task accuracy in psychologically reasonable ways, and showed correspondence with existing measures (CRQA and Granger causality).
Conclusion: The context matrix provides a psychologically meaningful unified representation of interpersonal dynamics that captures both synchrony and directional influence, offering a foundation for broader modeling of interpersonal coordination and influence patterns.
Abstract: Successful teamwork depends on interpersonal dynamics, the ways in which individuals coordinate, influence, and adapt to one another over time. Existing measures of interpersonal dynamics, such as CRQA, correlation, Granger causality, and transfer entropy, typically capture only a single dimension: either the synchrony/coordination or the direction of influence between individuals. What is missing is a psychologically meaningful representation that unifies these dimensions and varies systematically with behavior. We propose the context matrix as one such representation. The context matrix, modeled within a linear dynamical system, has psychologically interpretable entries specifying how much each individual’s current behavior is attributable to their own versus every other group member’s past behaviors. Critically, these entries can be distilled into summary features that represent synchrony and directional influence. Evidence for the context matrix as psychologically meaningful is provided in two steps. First, we develop a sequential Bayesian model that infers context matrices from timeseries data and show that it accurately recovers them in noisy simulations. Second, applying the model to human eyetracking data, we show that summary features of the inferred context matrices capture expected task-based differences in interpersonal dynamics (or lack thereof), predict task accuracy in psychologically reasonable ways, and show some correspondence with existing measures (CRQA and Granger causality). We conclude by situating the context matrix within a broader agenda for modeling interpersonal dynamics.
cs.MM
[461] Evaluation of Objective Image Quality Metrics for High-Fidelity Image Compression
Shima Mohammadi, Mohsen Jenadeleh, Jon Sneyers, Dietmar Saupe, João Ascenso
Main category: cs.MM
TL;DR: Evaluation of image quality assessment metrics for high-fidelity compression where subtle artifacts matter, using JPEG AIC-3 dataset and novel statistical methods.
Details
Motivation: Current objective image quality metrics haven't been thoroughly tested in high-fidelity ranges where preserving subtle details is critical for professional applications, and their reliability near Just Noticeable Difference thresholds is unknown.Method: Proposed evaluation methodologies using JPEG AIC-3 dataset, introduced Z-RMSE to account for subjective score uncertainty, applied novel statistical tests, analyzed full range and subsets, examined cropping impact in subjective tests.
Result: Comprehensive evaluation performed across different fidelity ranges, with public dataset, benchmarks and evaluation tools released to support further research.
Conclusion: The study addresses the critical need for reliable quality assessment in high-fidelity compression and provides methodologies and tools for evaluating metrics’ performance in detecting subtle artifacts.
Abstract: Nowadays, image compression solutions are increasingly designed to operate within high-fidelity quality ranges, where preserving even the most subtle details of the original image is essential. In this context, the ability to detect and quantify subtle compression artifacts becomes critically important, as even slight degradations can impact perceptual quality in professional or quality sensitive applications, such as digital archiving, professional editing and web delivery. However, the performance of current objective image quality assessment metrics in this range has not been thoroughly investigated. In particular, it is not well understood how reliably these metrics estimate distortions at or below the threshold of Just Noticeable Difference (JND). This study directly addresses this issue by proposing evaluation methodologies for assessing the performance of objective quality metrics and performing a comprehensive evaluation using the JPEG AIC-3 dataset which is designed for high-fidelity image compression. Beyond conventional criteria, the study introduces Z-RMSE to incorporate subjective score uncertainty and applies novel statistical tests to assess significant differences between metrics. The analysis spans the full JPEG AIC-3 range and its high- and medium-fidelity subsets, examines the impact of cropping in subjective tests, and a public dataset with benchmarks and evaluation tools is released to support further research.
eess.AS
[462] Multi-Modal Embedding-based Target Speaker Enhancement
Zhan Jin
Main category: eess.AS
TL;DR: This paper studies multimodal fusion strategies for target speaker extraction, showing that training with high modality dropout (80%) dramatically improves robustness to missing modalities at test time, while voice embeddings are consistently robust and expression embeddings provide complementary information.
Details
Motivation: Real-world applications of multimodal target speaker extraction often suffer from intermittent modality dropout, but current systems are typically trained under ideal conditions without accounting for this practical challenge.Method: Built on state-of-the-art audio-visual speech enhancement system with four speaker identity cues: lip embeddings, voice speaker embedding via cross-attention, static face embedding, and novel dynamic expression embedding. Systematically evaluated under zero dropout vs 80% modality dropout training regimes.
Result: Full multimodal ensemble achieves optimal performance under zero dropout but degrades significantly with test-time dropout. Training with 80% dropout dramatically enhances robustness, maintaining superior performance even with severe missing modalities. Voice embeddings show consistent robustness, expression embeddings provide valuable complementary information.
Conclusion: Training strategies must account for real-world imperfections rather than pure performance maximization. High dropout training enables practical reliability in multimodal speech enhancement systems, with voice and expression embeddings being particularly valuable components.
Abstract: Target Speaker Extraction (TSE) is a critical challenge in cocktail party scenarios. While leveraging multiple modalities, such as voice, lip, face, and expression embeddings, can enhance performance, real-world applications often suffer from intermittent modality dropout. This paper presents a comprehensive study on the interactions and robustness of various multimodal fusion strategies under varying degrees of modality dropout. We build upon a state-of-the-art audio-visual speech enhancement system and integrate four distinct speaker identity cues: lip embeddings for synchronized contextual information, a voice speaker embedding extracted via cross-attention for acoustic consistency, a static face embedding for speaker identity, and a novel dynamic expression embedding for frame-wise emotional features. We systematically evaluate different combinations of these modalities under two key training regimes: zero dropout and 80% modality dropout. Extensive experiments demonstrate that while a full multimodal ensemble achieves optimal performance under ideal (zero dropout) conditions, its effectiveness diminishes significantly when test-time dropout occurs without prior exposure during training. Crucially, we show that training with a high (80%) modality dropout rate dramatically enhances model robustness, enabling the system to maintain superior performance even under severe test-time missing modalities. Our findings highlight that voice embeddings exhibit consistent robustness, while the proposed expression embedding provides valuable complementary information. This work underscores the importance of training strategies that account for real-world imperfection, moving beyond pure performance maximization to achieve practical reliability in multimodal speech enhancement systems.
[463] Investigating the Potential of Multi-Stage Score Fusion in Spoofing-Aware Speaker Verification
Oguzhan Kurnaz, Tomi Kinnunen, Cemal Hanilci
Main category: eess.AS
TL;DR: A multi-stage spoof-aware speaker verification framework combining ASV and countermeasure systems achieves 24% relative improvement over baseline with 1.30% EER.
Details
Motivation: Current automatic speaker verification systems remain vulnerable to spoofing attacks, necessitating better integration of verification and countermeasure subsystems.Method: Multi-stage approach using ECAPA-TDNN (ASV) and AASIST (CM) subsystems with SVM and logistic regression classifiers, integrating outputs with original scores and adding RawGAT CM auxiliary score.
Result: Achieved 1.30% equal error rate on SASV2022 evaluation dataset, representing 24% relative improvement over baseline system.
Conclusion: The multi-stage integration approach effectively enhances spoof-aware speaker verification performance compared to conventional single-stage fusion methods.
Abstract: Despite improvements in automatic speaker verification (ASV), vulnerability against spoofing attacks remains a major concern. In this study, we investigate the integration of ASV and countermeasure (CM) subsystems into a modular spoof-aware speaker verification (SASV) framework. Unlike conventional single-stage score-level fusion methods, we explore the potential of a multi-stage approach that utilizes the ASV and CM systems in multiple stages. By leveraging ECAPA-TDNN (ASV) and AASIST (CM) subsystems, we consider support vector machine and logistic regression classifiers to achieve SASV. In the second stage, we integrate their outputs with the original score to revise fusion back-end classifiers. Additionally, we incorporate another auxiliary score from RawGAT (CM) to further enhance our SASV framework. Our approach yields an equal error rate (EER) of 1.30% on the evaluation dataset of the SASV2022 challenge, representing a 24% relative improvement over the baseline system.
[464] MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement
Jingyu Li, Guangyan Zhang, Zhen Ye, Yiwen Guo
Main category: eess.AS
TL;DR: A low-bitrate multi-scale residual audio codec that encodes speech into four disentangled streams (semantic, timbre, prosody, residual) for high-fidelity reconstruction and effective TTS/voice conversion applications.
Details
Motivation: Audio codecs are critical for modern speech generation systems, and there's a need for low-bitrate solutions that can achieve high-fidelity reconstruction while enabling information disentanglement for various speech processing tasks.Method: Introduces a multi-scale residual codec architecture that encodes speech into four distinct streams. Uses this codec to construct a two-stage language model for text-to-speech synthesis with lightweight design and minimal data requirements.
Result: Achieves state-of-the-art Word Error Rate (WER) and superior speaker similarity compared to larger models. The codec design proves highly effective for voice conversion, enabling independent manipulation of speaker timbre and prosody.
Conclusion: The proposed codec architecture successfully achieves competitive low-bitrate speech encoding with inherent information disentanglement capabilities, making it effective for both TTS synthesis and voice conversion applications with minimal computational requirements.
Abstract: Audio codecs are a critical component of modern speech generation systems. This paper introduces a low-bitrate, multi-scale residual codec that encodes speech into four distinct streams: semantic, timbre, prosody, and residual. This architecture achieves high-fidelity speech reconstruction at competitive low bitrates while demonstrating an inherent ability for information disentanglement. We construct a two-stage language model for text-to-speech (TTS) synthesis using this codec, which, despite its lightweight design and minimal data requirements, achieves a state-of-the-art Word Error Rate (WER) and superior speaker similarity compared to several larger models. Furthermore, the codec’s design proves highly effective for voice conversion, enabling independent manipulation of speaker timbre and prosody.
[465] Token-based Attractors and Cross-attention in Spoof Diarization
Kyo-Won Koo, Chan-yeong Lim, Jee-weon Jung, Hye-jin Shim, Ha-Jin Yu
Main category: eess.AS
TL;DR: The paper introduces learnable tokens to improve spoof diarization by better distinguishing between bona fide and spoofed speech regions through discriminative representation learning.
Details
Motivation: Prior two-branch models for spoof diarization have limited ability to capture complex spoofing patterns and lack explicit reference points for distinguishing between bona fide and various spoofing types.Method: Proposes using learnable tokens that represent acoustic features of bona fide and spoofed speech. These attractors interact with frame-level embeddings to extract discriminative representations for improved separation.
Result: The approach consistently outperforms existing methods on the PartialSpoof dataset in both bona fide detection and spoofing method clustering tasks.
Conclusion: Learnable tokens provide an effective mechanism for improving spoof diarization by enhancing the discrimination between genuine and generated speech through better feature representation.
Abstract: Spoof diarization identifies ``what spoofed when" in a given speech by temporally locating spoofed regions and determining their manipulation techniques. As a first step toward this task, prior work proposed a two-branch model for localization and spoof type clustering, which laid the foundation for spoof diarization. However, its simple structure limits the ability to capture complex spoofing patterns and lacks explicit reference points for distinguishing between bona fide and various spoofing types. To address these limitations, our approach introduces learnable tokens where each token represents acoustic features of bona fide and spoofed speech. These attractors interact with frame-level embeddings to extract discriminative representations, improving separation between genuine and generated speech. Vast experiments on PartialSpoof dataset consistently demonstrate that our approach outperforms existing methods in bona fide detection and spoofing method clustering.
[466] Importance-Weighted Domain Adaptation for Sound Source Tracking
Bingxiang Zhong, Thomas Dietzen
Main category: eess.AS
TL;DR: Proposes unsupervised domain adaptation for sound source tracking to bridge synthetic-to-real domain gap using fixed-dimensional RNN features and importance-weighted adversarial training.
Details
Motivation: Deep learning for sound source localization requires large labeled datasets, but real recordings are costly to annotate. Synthetic data causes domain shift, and existing UDA approaches don't address the specific challenges of sound source tracking.Method: Uses final hidden state of RNN for fixed-dimensional feature representation to handle variable-length sequences, and importance-weighted adversarial training to address directional diversity mismatch by prioritizing synthetic samples similar to real domain.
Result: Experimental results show successful adaptation of synthetic-trained models to real environments, improving sound source tracking performance.
Conclusion: The proposed UDA approach effectively addresses the specific challenges of sound source tracking, enabling better performance when adapting from synthetic to real data domains.
Abstract: In recent years, deep learning has significantly advanced sound source localization (SSL). However, training such models requires large labeled datasets, and real recordings are costly to annotate in particular if sources move. While synthetic data using simulated room impulse responses (RIRs) and noise offers a practical alternative, models trained on synthetic data suffer from domain shift in real environments. Unsupervised domain adaptation (UDA) can address this by aligning synthetic and real domains without relying on labels from the latter. The few existing UDA approaches however focus on static SSL and do not account for the problem of sound source tracking (SST), which presents two specific domain adaptation challenges. First, variable-length input sequences create mismatches in feature dimensionality across domains. Second, the angular coverages of the synthetic and the real data may not be well aligned either due to partial domain overlap or due to batch size constraints, which we refer to as directional diversity mismatch. To address these, we propose a novel UDA approach tailored for SST based on two key features. We employ the final hidden state of a recurrent neural network as a fixed-dimensional feature representation to handle variable-length sequences. Further, we use importance-weighted adversarial training to tackle directional diversity mismatch by prioritizing synthetic samples similar to the real domain. Experimental results demonstrate that our approach successfully adapts synthetic-trained models to real environments, improving SST performance.
[467] Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for VoiceMOS 2024
Marie Kunešová, Aleš Pražák, Jan Lehečka
Main category: eess.AS
TL;DR: A system using wav2vec 2.0 with two-stage transfer learning for non-intrusive speech quality prediction, achieving top results in VoiceMOS 2024 Challenge Track 3 with limited training data.
Details
Motivation: To develop a system that can predict ITU-T P.835 speech quality metrics (SIG, BAK, OVRL) without reference signals and with only 100 labeled training samples, addressing the challenge of limited subjective data.Method: Uses wav2vec 2.0 with two-stage transfer learning: initial fine-tuning on automatically labeled noisy data, followed by adaptation to the challenge data. Post-challenge experiments added artificially degraded data to improve performance.
Result: Achieved best performance on BAK prediction (LCC=0.867) and close second in OVRL (LCC=0.711). Post-challenge improvements raised SIG prediction correlation from 0.207 to 0.516.
Conclusion: Transfer learning with targeted data generation is effective for predicting P.835 scores under severe data constraints, demonstrating the value of multi-stage training approaches with synthetic data augmentation.
Abstract: We present a system for non-intrusive prediction of speech quality in noisy and enhanced speech, developed for Track 3 of the VoiceMOS 2024 Challenge. The task required estimating the ITU-T P.835 metrics SIG, BAK, and OVRL without reference signals and with only 100 subjectively labeled utterances for training. Our approach uses wav2vec 2.0 with a two-stage transfer learning strategy: initial fine-tuning on automatically labeled noisy data, followed by adaptation to the challenge data. The system achieved the best performance on BAK prediction (LCC=0.867) and a very close second place in OVRL (LCC=0.711) in the official evaluation. Post-challenge experiments show that adding artificially degraded data to the first fine-tuning stage substantially improves SIG prediction, raising correlation with ground truth scores from 0.207 to 0.516. These results demonstrate that transfer learning with targeted data generation is effective for predicting P.835 scores under severe data constraints.
eess.IV
[468] Physics-Informed Neural Networks vs. Physics Models for Non-Invasive Glucose Monitoring: A Comparative Study Under Realistic Synthetic Conditions
Riyaadh Gani
Main category: eess.IV
TL;DR: First ultra-realistic NIR glucose simulator with comprehensive noise modeling shows traditional Beer-Lambert method outperforms complex neural networks for non-invasive glucose monitoring.
Details
Motivation: Existing non-invasive glucose monitoring datasets fail to account for real-world hardware noise, environmental factors, and physiological variations, limiting practical deployment outside lab conditions.Method: Developed an ultra-realistic near-infrared simulator incorporating 12-bit ADC quantization, LED aging, photodiode dark noise, temperature/humidity variations, contact pressure changes, skin melanin types, and diurnal glucose patterns. Benchmarked six methods including Enhanced Beer-Lambert, three physics-informed neural networks, selective radiative-transfer PINN, and shallow DNN.
Result: Enhanced Beer-Lambert achieved best performance with 13.6 mg/dL RMSE, 95.8% Clarke-A accuracy, and 93.8% +/-15% accuracy using only 56 parameters and 0.01 ms inference time, outperforming PINNs (14.6 mg/dL) and SDNN baseline (35.1 mg/dL).
Conclusion: Traditional physics-based methods outperform complex neural networks for glucose monitoring, challenging the assumption that deeper PINNs are superior. Provides open reference stack for rapid prototyping of embedded optical glucose sensors.
Abstract: Non-invasive glucose monitors often fail outside the lab because existing datasets ignore hardware noise, environmental drift, and person-to-person physiology. We introduce the first ultra-realistic near-infrared (NIR) simulator that injects 12-bit ADC quantisation, +/-0.1% LED ageing, photodiode dark noise, 15-45 C temperature, 30-90% relative humidity, contact-pressure variation, Fitzpatrick I-VI melanin, and diurnal glucose excursions (dawn phenomenon). Using this platform (rho glucose-NIR = 0.21), we benchmark six methods: Enhanced Beer-Lambert (physics-engineered ridge regression), three physics-informed neural networks (PINNs), a selective radiative-transfer PINN, and a shallow DNN. Beer-Lambert achieves 13.6 mg/dL RMSE, 95.8% Clarke-A and 93.8% +/-15% accuracy with only 56 parameters and 0.01 ms inference, outperforming the best PINN (14.6 mg/dL) and the SDNN baseline (35.1 mg/dL). Results overturn the assumption that deeper PINNs dominate and supply an open, end-to-end reference stack for rapid prototyping of embedded optical glucose sensors.
[469] Enhancing Radiographic Disease Detection with MetaCheX, a Context-Aware Multimodal Model
Nathan He, Cody Chen
Main category: eess.IV
TL;DR: MetaCheX is a multimodal framework that combines chest X-ray images with patient metadata to improve diagnostic accuracy and reduce bias in radiology AI systems.
Details
Motivation: Existing deep learning models for chest radiology often neglect patient metadata, which limits diagnostic accuracy and fairness in clinical decision-making.Method: Combines a CNN backbone for image processing with a multilayer perceptron for metadata processing, integrated through a shared classifier. Evaluated on CheXpert Plus dataset.
Result: Outperformed radiograph-only baseline models across multiple CNN architectures, with significant improvement in AUROC. Reduced algorithmic bias and enhanced generalizability across diverse patient populations.
Conclusion: MetaCheX advances clinical AI toward robust, context-aware radiographic disease detection by effectively integrating patient metadata with imaging data.
Abstract: Existing deep learning models for chest radiology often neglect patient metadata, limiting diagnostic accuracy and fairness. To bridge this gap, we introduce MetaCheX, a novel multimodal framework that integrates chest X-ray images with structured patient metadata to replicate clinical decision-making. Our approach combines a convolutional neural network (CNN) backbone with metadata processed by a multilayer perceptron through a shared classifier. Evaluated on the CheXpert Plus dataset, MetaCheX consistently outperformed radiograph-only baseline models across multiple CNN architectures. By integrating metadata, the overall diagnostic accuracy was significantly improved, measured by an increase in AUROC. The results of this study demonstrate that metadata reduces algorithmic bias and enhances model generalizability across diverse patient populations. MetaCheX advances clinical artificial intelligence toward robust, context-aware radiographic disease detection.
[470] DinoAtten3D: Slice-Level Attention Aggregation of DinoV2 for 3D Brain MRI Anomaly Classification
Fazle Rafsani, Jay Shah, Catherine D. Chong, Todd J. Schwedt, Teresa Wu
Main category: eess.IV
TL;DR: Attention-based framework using DINOv2 for 3D medical image anomaly classification with adaptive slice weighting and composite loss to handle data scarcity and class imbalance.
Details
Motivation: Address challenges in medical imaging anomaly detection including limited annotated data, class imbalance, and high labeling costs by leveraging pretrained vision foundation models.Method: Uses DINOv2 as pretrained feature extractor on 2D axial MRI slices with soft attention mechanism for adaptive slice-level importance weights. Employs composite loss combining supervised contrastive learning with class-variance regularization.
Result: Demonstrated strong anomaly classification performance on ADNI dataset and institutional multi-class headache cohort despite limited data and significant class imbalance.
Conclusion: Pretrained 2D foundation models combined with attention-based slice aggregation provide effective solution for robust volumetric anomaly detection in medical imaging.
Abstract: Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling. Emerging vision foundation models such as DINOv2, pretrained on extensive, unlabeled datasets, offer generalized representations that can potentially alleviate these limitations. In this study, we propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification. Leveraging the self-supervised DINOv2 model as a pretrained feature extractor, our method processes individual 2D axial slices of brain MRIs, assigning adaptive slice-level importance weights through a soft attention mechanism. To further address data scarcity, we employ a composite loss function combining supervised contrastive learning with class-variance regularization, enhancing inter-class separability and intra-class consistency. We validate our framework on the ADNI dataset and an institutional multi-class headache cohort, demonstrating strong anomaly classification performance despite limited data availability and significant class imbalance. Our results highlight the efficacy of utilizing pretrained 2D foundation models combined with attention-based slice aggregation for robust volumetric anomaly detection in medical imaging. Our implementation is publicly available at https://github.com/Rafsani/DinoAtten3D.git.
[471] DeepEyeNet: Generating Medical Report for Retinal Images
Jia-Hong Huang
Main category: eess.IV
TL;DR: AI-based automated medical report generation for retinal images using multi-modal deep learning to address ophthalmologist shortages and improve diagnostic efficiency.
Details
Motivation: Address the growing imbalance between retinal disease prevalence and limited ophthalmologist workforce, reducing time-consuming manual report generation and enabling faster diagnosis.Method: Multi-modal deep learning approach capturing text-image interactions, improved medical keyword representation, strategies to overcome RNN limitations for long-range dependencies, and interpretability enhancement techniques.
Result: Achieves state-of-the-art performance with rigorous evaluation using various metrics, demonstrating superior automated report generation capabilities.
Conclusion: AI has revolutionary potential to automate retinal disease diagnosis through medical report generation, improving clinical efficiency, diagnostic accuracy, and patient care outcomes.
Abstract: The increasing prevalence of retinal diseases poses a significant challenge to the healthcare system, as the demand for ophthalmologists surpasses the available workforce. This imbalance creates a bottleneck in diagnosis and treatment, potentially delaying critical care. Traditional methods of generating medical reports from retinal images rely on manual interpretation, which is time-consuming and prone to errors, further straining ophthalmologists’ limited resources. This thesis investigates the potential of Artificial Intelligence (AI) to automate medical report generation for retinal images. AI can quickly analyze large volumes of image data, identifying subtle patterns essential for accurate diagnosis. By automating this process, AI systems can greatly enhance the efficiency of retinal disease diagnosis, reducing doctors’ workloads and enabling them to focus on more complex cases. The proposed AI-based methods address key challenges in automated report generation: (1) A multi-modal deep learning approach captures interactions between textual keywords and retinal images, resulting in more comprehensive medical reports; (2) Improved methods for medical keyword representation enhance the system’s ability to capture nuances in medical terminology; (3) Strategies to overcome RNN-based models’ limitations, particularly in capturing long-range dependencies within medical descriptions; (4) Techniques to enhance the interpretability of the AI-based report generation system, fostering trust and acceptance in clinical practice. These methods are rigorously evaluated using various metrics and achieve state-of-the-art performance. This thesis demonstrates AI’s potential to revolutionize retinal disease diagnosis by automating medical report generation, ultimately improving clinical efficiency, diagnostic accuracy, and patient care.
[472] A Computational Pipeline for Patient-Specific Modeling of Thoracic Aortic Aneurysm: From Medical Image to Finite Element Analysis
Jiasong Chen, Linchen Qian, Ruonan Gong, Christina Sun, Tongran Qin, Thuy Pham, Caitlin Martin, Mohammad Zafar, John Elefteriades, Wei Sun, Liang Liang
Main category: eess.IV
TL;DR: This paper presents a framework for predicting thoracic aortic aneurysm (TAA) rupture risk using deep learning segmentation and finite element analysis to create patient-specific biomechanical models from 3D CT scans.
Details
Motivation: Thoracic aortic aneurysms are a leading cause of death in adults, and current diagnostic methods need improved predictive capabilities for rupture risk assessment. Patient-specific modeling can provide more accurate risk evaluation.Method: Combines deep learning-based image segmentation of 3D CT scans with finite element analysis. Anatomical structures are segmented, converted to hexahedral meshes, and used for biomechanical simulations to assess wall stresses.
Result: The approach enables detailed patient-specific assessment of aortic wall stresses and biomechanical behaviors, supporting more accurate rupture risk prediction compared to standard geometric measurements alone.
Conclusion: Patient-specific finite element modeling based on deep learning segmentation provides a promising biomechanical framework for improved TAA rupture risk prediction and personalized treatment planning.
Abstract: The aorta is the body’s largest arterial vessel, serving as the primary pathway for oxygenated blood within the systemic circulation. Aortic aneurysms consistently rank among the top twenty causes of mortality in the United States. Thoracic aortic aneurysm (TAA) arises from abnormal dilation of the thoracic aorta and remains a clinically significant disease, ranking as one of the leading causes of death in adults. A thoracic aortic aneurysm ruptures when the integrity of all aortic wall layers is compromised due to elevated blood pressure. Currently, three-dimensional computed tomography (3D CT) is considered the gold standard for diagnosing TAA. The geometric characteristics of the aorta, which can be quantified from medical imaging, and stresses on the aortic wall, which can be obtained by finite element analysis (FEA), are critical in evaluating the risk of rupture and dissection. Deep learning based image segmentation has emerged as a reliable method for extracting anatomical regions of interest from medical images. Voxel based segmentation masks of anatomical structures are typically converted into structured mesh representation to enable accurate simulation. Hexahedral meshes are commonly used in finite element simulations of the aorta due to their computational efficiency and superior simulation accuracy. Due to anatomical variability, patient specific modeling enables detailed assessment of individual anatomical and biomechanics behaviors, supporting precise simulations, accurate diagnoses, and personalized treatment strategies. Finite element (FE) simulations provide valuable insights into the biomechanical behaviors of tissues and organs in clinical studies. Developing accurate FE models represents a crucial initial step in establishing a patient-specific, biomechanically based framework for predicting the risk of TAA.
[473] MEGAN: Mixture of Experts for Robust Uncertainty Estimation in Endoscopy Videos
Damola Agbelese, Krishna Chaitanya, Pushpak Pati, Chaitanya Parmar, Pooya Mobadersany, Shreyas Fadnavis, Lindsey Surace, Shadi Yarandi, Louis R. Ghanem, Molly Lucas, Tommaso Mansi, Oana Gabriela Cula, Pablo F. Damasceno, Kristopher Standish
Main category: eess.IV
TL;DR: MEGAN is a Multi-Expert Gating Network that aggregates uncertainty estimates and predictions from multiple AI experts trained with diverse ground truths, improving prediction confidence and calibration in medical AI applications.
Details
Motivation: Traditional uncertainty quantification methods in medical AI rely on single expert annotations, ignoring inter-rater variability which is prevalent in healthcare settings like ulcerative colitis severity assessment.Method: MEGAN uses a gating network to optimally combine predictions and uncertainties from multiple Evidential Deep Learning models trained with diverse ground truths and modeling strategies.
Result: In large-scale UC clinical trials, MEGAN achieved 3.5% F1-score improvement and 30.5% reduction in Expected Calibration Error compared to existing methods, enabling uncertainty-guided sample stratification.
Conclusion: MEGAN effectively addresses inter-rater variability in medical AI, enhances prediction reliability, reduces annotation burden, and increases efficiency in clinical trials through improved uncertainty quantification.
Abstract: Reliable uncertainty quantification (UQ) is essential in medical AI. Evidential Deep Learning (EDL) offers a computationally efficient way to quantify model uncertainty alongside predictions, unlike traditional methods such as Monte Carlo (MC) Dropout and Deep Ensembles (DE). However, all these methods often rely on a single expert’s annotations as ground truth for model training, overlooking the inter-rater variability in healthcare. To address this issue, we propose MEGAN, a Multi-Expert Gating Network that aggregates uncertainty estimates and predictions from multiple AI experts via EDL models trained with diverse ground truths and modeling strategies. MEGAN’s gating network optimally combines predictions and uncertainties from each EDL model, enhancing overall prediction confidence and calibration. We extensively benchmark MEGAN on endoscopy videos for Ulcerative colitis (UC) disease severity estimation, assessed by visual labeling of Mayo Endoscopic Subscore (MES), where inter-rater variability is prevalent. In large-scale prospective UC clinical trial, MEGAN achieved a 3.5% improvement in F1-score and a 30.5% reduction in Expected Calibration Error (ECE) compared to existing methods. Furthermore, MEGAN facilitated uncertainty-guided sample stratification, reducing the annotation burden and potentially increasing efficiency and consistency in UC trials.
[474] Pitfalls of defacing whole-head MRI: re-identification risk with diffusion models and compromised research potential
Chenyu Gao, Kaiwen Xu, Michael E. Kim, Lianrui Zuo, Zhiyuan Li, Derek B. Archer, Timothy J. Hohman, Ann Zenobia Moore, Luigi Ferrucci, Lori L. Beason-Held, Susan M. Resnick, Christos Davatzikos, Jerry L. Prince, Bennett A. Landman
Main category: eess.IV
TL;DR: Defacing of head MRIs fails to protect privacy as diffusion models can accurately reconstruct original faces, and also removes valuable anatomical information useful for downstream tasks like muscle radiodensity prediction.
Details
Motivation: To evaluate whether MRI defacing techniques truly protect privacy given advances in deep generative models, and to assess if the altered facial voxels contain universally useful information beyond privacy concerns.Method: Developed a refacing pipeline using cascaded diffusion probabilistic models (DPMs) trained on 180 subjects and tested on 484 unseen subjects. Also predicted CT-derived skeletal muscle radiodensity from facial voxels in both defaced and original MRIs.
Result: DPMs generated high-fidelity faces resembling originals with significantly smaller surface distances than population average. Defaced images showed significantly weaker correlation for muscle radiodensity predictions, with some correlations becoming statistically insignificant after defacing.
Conclusion: Defacing fails to protect privacy against modern generative models and eliminates valuable anatomical information, suggesting current defacing methods may be inadequate for both privacy protection and research utility.
Abstract: Defacing is often applied to head magnetic resonance image (MRI) datasets prior to public release to address privacy concerns. The alteration of facial and nearby voxels has provoked discussions about the true capability of these techniques to ensure privacy as well as their impact on downstream tasks. With advancements in deep generative models, the extent to which defacing can protect privacy is uncertain. Additionally, while the altered voxels are known to contain valuable anatomical information, their potential to support research beyond the anatomical regions directly affected by defacing remains uncertain. To evaluate these considerations, we develop a refacing pipeline that recovers faces in defaced head MRIs using cascaded diffusion probabilistic models (DPMs). The DPMs are trained on images from 180 subjects and tested on images from 484 unseen subjects, 469 of whom are from a different dataset. To assess whether the altered voxels in defacing contain universally useful information, we also predict computed tomography (CT)-derived skeletal muscle radiodensity from facial voxels in both defaced and original MRIs. The results show that DPMs can generate high-fidelity faces that resemble the original faces from defaced images, with surface distances to the original faces significantly smaller than those of a population average face (p < 0.05). This performance also generalizes well to previously unseen datasets. For skeletal muscle radiodensity predictions, using defaced images results in significantly weaker Spearman’s rank correlation coefficients compared to using original images (p < 10-4). For shin muscle, the correlation is statistically significant (p < 0.05) when using original images but not statistically significant (p > 0.05) when any defacing method is applied, suggesting that defacing might not only fail to protect privacy but also eliminate valuable information.
[475] WaterFlow: Learning Fast & Robust Watermarks using Stable Diffusion
Vinay Shukla, Prachee Sharma, Ryan Rossi, Sungchul Kim, Tong Yu, Aditya Grover
Main category: eess.IV
TL;DR: WaterFlow (WF) is a fast, robust visual watermarking method that uses a pretrained latent diffusion model and invertible flow layers to embed watermarks in Fourier domain of latents, achieving state-of-the-art performance against combination attacks.
Details
Motivation: Current watermarking techniques suffer from slow execution speeds and trade-offs between speed and robustness/quality. The rise of generated imagery exacerbates the need for efficient, high-fidelity watermarking solutions.Method: Utilizes pretrained latent diffusion model to encode images into latent space, plants learned watermarks in Fourier domain of latents using invertible flow layers to enhance expressivity while preserving quality.
Result: State-of-the-art performance on general robustness, first method to effectively defend against difficult combination attacks. Validated on MS-COCO, DiffusionDB, and WikiArt datasets.
Conclusion: WaterFlow provides a fast, extremely robust approach for high fidelity visual watermarking that addresses computational and statistical challenges of current methods while maintaining excellent perceptual quality.
Abstract: The ability to embed watermarks in images is a fundamental problem of interest for computer vision, and is exacerbated by the rapid rise of generated imagery in recent times. Current state-of-the-art techniques suffer from computational and statistical challenges such as the slow execution speed for practical deployments. In addition, other works trade off fast watermarking speeds but suffer greatly in their robustness or perceptual quality. In this work, we propose WaterFlow (WF), a fast and extremely robust approach for high fidelity visual watermarking based on a learned latent-dependent watermark. Our approach utilizes a pretrained latent diffusion model to encode an arbitrary image into a latent space and produces a learned watermark that is then planted into the Fourier Domain of the latent. The transformation is specified via invertible flow layers that enhance the expressivity of the latent space of the pre-trained model to better preserve image quality while permitting robust and tractable detection. Most notably, WaterFlow demonstrates state-of-the-art performance on general robustness and is the first method capable of effectively defending against difficult combination attacks. We validate our findings on three widely used real and generated datasets: MS-COCO, DiffusionDB, and WikiArt.
[476] Diagnosis for Less-Prevalent Thyroid Carcinoma Subtype Using a Dual-Branch Attention Deep Network with Ultrasound Images
Peiqi Li, Yincheng Gao, Renxing Li, Haojie Yang, Yunyun Liu, Boji Liu, Jiahui Ni, Ying Zhang, Yulu Wu, Xiaowei Fang, Lehang Guo, Liping Sun, Jiangang Chen
Main category: eess.IV
TL;DR: A novel multitask learning framework called CSASN combines EfficientNet and ViT with channel-spatial attention to improve rare thyroid carcinoma classification from ultrasound images, achieving superior performance on imbalanced data.
Details
Motivation: Heterogeneous morphological features and data imbalance pose significant challenges in rare thyroid carcinoma classification using ultrasound imaging, requiring advanced methods to handle these issues.Method: Proposed Channel-Spatial Attention Synergy Network (CSASN) with dual-branch feature extractor (EfficientNet for local spatial encoding + ViT for global semantic modeling), cascaded channel-spatial attention refinement, residual multiscale classifier, and dynamically weighted loss function.
Result: CSASN outperforms existing single-stream CNN or Transformer-based models, achieving superior balance between precision and recall under class-imbalanced conditions, particularly for rare subtypes like FTC and MTC carcinomas.
Conclusion: The framework provides a promising strategy for AI-assisted thyroid cancer diagnosis by effectively addressing data imbalance and heterogeneous features through multitask learning and attention mechanisms.
Abstract: Heterogeneous morphological features and data imbalance pose significant challenges in rare thyroid carcinoma classification using ultrasound imaging. To address this issue, we propose a novel multitask learning framework, Channel-Spatial Attention Synergy Network (CSASN), which integrates a dual-branch feature extractor - combining EfficientNet for local spatial encoding and ViT for global semantic modeling, with a cascaded channel-spatial attention refinement module. A residual multiscale classifier and dynamically weighted loss function further enhance classification stability and accuracy. Trained on a multicenter dataset comprising more than 2000 patients from four clinical institutions, our framework leverages a residual multiscale classifier and dynamically weighted loss function to enhance classification stability and accuracy. Extensive ablation studies demonstrate that each module contributes significantly to model performance, particularly in recognizing rare subtypes such as FTC and MTC carcinomas. Experimental results show that CSASN outperforms existing single-stream CNN or Transformer-based models, achieving a superior balance between precision and recall under class-imbalanced conditions. This framework provides a promising strategy for AI-assisted thyroid cancer diagnosis.
[477] Robust Recursive Fusion of Multiresolution Multispectral Images with Location-Aware Neural Networks
Haoqing Li, Ricardo Borsoi, Tales Imbiriba, Pau Closas
Main category: eess.IV
TL;DR: Robust recursive image fusion framework using location-aware neural networks to handle outliers like clouds in satellite imagery, improving accuracy and robustness while maintaining performance in clear conditions.
Details
Motivation: Address the trade-off between temporal and spatial resolution in satellite imaging, particularly the performance degradation caused by outliers such as clouds, and the lack of strategies integrating robustness, recursive operation, and learned models.Method: Proposes a robust recursive image fusion framework using location-aware neural networks to model image dynamics, representing outlier contamination probability, and employing Bayesian variational inference for high-resolution image estimation.
Result: Experiments with Landsat 8 and MODIS instruments show significantly improved robustness against cloud cover while maintaining performance in cloud-free conditions.
Conclusion: The proposed approach effectively handles outliers like clouds in satellite image fusion, providing both accuracy and robustness through neural network modeling and recursive Bayesian estimation.
Abstract: Multiresolution image fusion is a key problem for real-time satellite imaging and plays a central role in detecting and monitoring natural phenomena such as floods. It aims to solve the trade-off between temporal and spatial resolution in remote sensing instruments. Although several algorithms have been proposed for this problem, the presence of outliers such as clouds downgrades their performance. Moreover, strategies that integrate robustness, recursive operation and learned models are missing. In this paper, a robust recursive image fusion framework leveraging location-aware neural networks (NN) to model the image dynamics is proposed. Outliers are modeled by representing the probability of contamination of a given pixel and band. A NN model trained on a small dataset provides accurate predictions of the stochastic image time evolution, which improves both the accuracy and robustness of the method. A recursive solution is proposed to estimate the high-resolution images using a Bayesian variational inference framework. Experiments fusing images from the Landsat 8 and MODIS instruments show that the proposed approach is significantly more robust against cloud cover, without losing performance when no clouds are present.
[478] Can Generalist Vision Language Models (VLMs) Rival Specialist Medical VLMs? Benchmarking and Strategic Insights
Yuan Zhong, Ruinan Jin, Qi Dou, Xiaoxiao Li
Main category: eess.IV
TL;DR: Generalist VLMs can match or outperform specialist medical VLMs in most clinical tasks, especially for unseen modalities, offering a more scalable AI development approach.
Details
Motivation: To determine when generalist vs specialist medical VLMs perform best in clinical settings, as developing specialist models requires substantial resources and curated data.Method: Comparative analysis of efficiently fine-tuned generalist VLMs versus specialist medical VLMs across various clinical tasks and modalities.
Result: Fine-tuned generalist VLMs achieve comparable or superior performance to specialists in most tasks, particularly for out-of-distribution and rare medical modalities.
Conclusion: Generalist VLMs provide a scalable, cost-effective alternative to specialist medical models for clinical AI development, with strong performance transfer capabilities.
Abstract: Vision Language Models (VLMs) have shown promise in automating image diagnosis and interpretation in clinical settings. However, developing specialist medical VLMs requires substantial computational resources and carefully curated datasets, and it remains unclear under which conditions generalist and specialist medical VLMs each perform best. This study highlights the complementary strengths of specialist medical and generalist VLMs. Specialists remain valuable in modality-aligned use cases, but we find that efficiently fine-tuned generalist VLMs can achieve comparable or even superior performance in most tasks, particularly when transferring to unseen or rare OOD medical modalities. These results suggest that generalist VLMs, rather than being constrained by their lack of specialist medical pretraining, may offer a scalable and cost-effective pathway for advancing clinical AI development.
[479] Energy-based models for inverse imaging problems
Andreas Habring, Martin Holler, Thomas Pock, Martin Zach
Main category: eess.IV
TL;DR: This paper provides a comprehensive overview of energy-based models (EBMs) for inverse imaging problems, covering theoretical foundations, learning techniques, sampling algorithms, and numerical validation.
Details
Motivation: To establish a rigorous theoretical framework for using energy-based models in Bayesian inverse imaging problems and provide practical methods for implementation and validation.Method: Presents theoretical introduction to Bayesian inverse problems with well-posedness and stability analysis, discusses EBM learning techniques, and covers sampling algorithms including Metropolis-Hastings, Gibbs sampling, Langevin Monte Carlo, and Hamiltonian Monte Carlo.
Result: Numerical results demonstrate successful resolution of several inverse imaging problems using EBMs with explicit verification of required modeling properties.
Conclusion: EBMs provide a powerful framework for Bayesian inverse imaging problems when properly implemented with rigorous theoretical foundations and validated through appropriate sampling techniques.
Abstract: In this chapter we provide a thorough overview of the use of energy-based models (EBMs) in the context of inverse imaging problems. EBMs are probability distributions modeled via Gibbs densities $p(x) \propto \exp{-E(x)}$ with an appropriate energy functional $E$. Within this chapter we present a rigorous theoretical introduction to Bayesian inverse problems that includes results on well-posedness and stability in the finite-dimensional and infinite-dimensional setting. Afterwards we discuss the use of EBMs for Bayesian inverse problems and explain the most relevant techniques for learning EBMs from data. As a crucial part of Bayesian inverse problems, we cover several popular algorithms for sampling from EBMs, namely the Metropolis-Hastings algorithm, Gibbs sampling, Langevin Monte Carlo, and Hamiltonian Monte Carlo. Moreover, we present numerical results for the resolution of several inverse imaging problems obtained by leveraging an EBM that allows for the explicit verification of those properties that are needed for valid energy-based modeling.
[480] CryoSplat: Gaussian Splatting for Cryo-EM Homogeneous Reconstruction
Suyi Chen, Haibin Ling
Main category: eess.IV
TL;DR: cryoGS is a novel GMM-based method that enables stable 3D molecular reconstruction from cryo-EM particle images using random initialization, eliminating the need for external consensus maps or atomic models.
Details
Motivation: Existing GMM-based cryo-EM reconstruction methods require external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines. There's a need for methods that can work directly from raw particle images without external references.Method: Developed cryoGS which integrates Gaussian splatting with cryo-EM physics, featuring orthogonal projection-aware Gaussian splatting, a normalization term, and FFT-aligned coordinate system specifically designed for cryo-EM imaging.
Result: Experimental results on real datasets demonstrate cryoGS’s effectiveness and robustness over representative baselines, enabling stable and efficient homogeneous reconstruction directly from raw particle images.
Conclusion: cryoGS provides a self-contained solution for cryo-EM reconstruction that works with random initialization, making GMM-based approaches more accessible and practical for structural biology applications.
Abstract: As a critical modality for structural biology, cryogenic electron microscopy (cryo-EM) facilitates the determination of macromolecular structures at near-atomic resolution. The core computational task in single-particle cryo-EM is to reconstruct the 3D electrostatic potential of a molecule from a large collection of noisy 2D projections acquired at unknown orientations. Gaussian mixture models (GMMs) provide a continuous, compact, and physically interpretable representation for molecular density and have recently gained interest in cryo-EM reconstruction. However, existing methods rely on external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines. Addressing this issue, we introduce cryoGS, a GMM-based method that integrates Gaussian splatting with the physics of cryo-EM image formation. In particular, we develop an orthogonal projection-aware Gaussian splatting, with adaptations such as a normalization term and FFT-aligned coordinate system tailored for cryo-EM imaging. All these innovations enable stable and efficient homogeneous reconstruction directly from raw cryo-EM particle images using random initialization. Experimental results on real datasets validate the effectiveness and robustness of cryoGS over representative baselines. The code will be released upon publication.