Daily arXiv Papers - 2025-09-11

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Bilingual Word Level Language Identification for Omotic Languages

Mesay Gemeda Yigezu, Girma Yohannis Bade, Atnafu Lambebo Tonja, Olga Kolesnikova, Grigori Sidorov, Alexander Gelbukh

Main category: cs.CL

TL;DR: BLID for Wolaita and Gofa languages using BERT+LSTM achieves 0.72 F1 score, addressing language identification challenges in multilingual Ethiopian texts.

DetailsMotivation: Language identification is crucial for multilingual communities where texts often contain multiple languages. The similarity between Wolaita and Gofa languages in southern Ethiopia makes this task particularly challenging and important for real-world applications like social media monitoring.

Method: Employed various experimental approaches, with the best performance coming from a combination of BERT-based pretrained language model and LSTM architecture.

Result: The BERT+LSTM approach achieved the best performance with an F1 score of 0.72 on the test set, effectively distinguishing between the two similar languages.

Conclusion: This work provides an effective solution for bilingual language identification in similar language pairs and serves as a foundation for further research, particularly in addressing social media issues in multilingual communities.

Abstract: Language identification is the task of determining the languages for a given text. In many real world scenarios, text may contain more than one language, particularly in multilingual communities. Bilingual Language Identification (BLID) is the task of identifying and distinguishing between two languages in a given text. This paper presents BLID for languages spoken in the southern part of Ethiopia, namely Wolaita and Gofa. The presence of words similarities and differences between the two languages makes the language identification task challenging. To overcome this challenge, we employed various experiments on various approaches. Then, the combination of the BERT based pretrained language model and LSTM approach performed better, with an F1 score of 0.72 on the test set. As a result, the work will be effective in tackling unwanted social media issues and providing a foundation for further research in this area.

[2] AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Debdeep Sanyal, Manodeep Ray, Murari Mandal

Main category: cs.CL

TL;DR: AntiDote is a bi-level optimization method that trains LLMs to resist malicious fine-tuning attacks while preserving utility, achieving 27.4% better robustness with minimal performance degradation.

DetailsMotivation: Open-weight LLMs face tension between research accessibility and misuse prevention, as current safety measures fail against adversaries with full model access who can erase safeguards through fine-tuning.

Method: Uses bi-level optimization with an auxiliary adversary hypernetwork that generates malicious LoRA weights conditioned on defender model’s activations. Defender LLM is trained to nullify these adversarial weight additions.

Result: 27.4% more robust against 52 diverse red-teaming attacks (jailbreak prompting, latent space manipulation, weight-space attacks) with <0.5% performance degradation on MMLU, HellaSwag, and GSM8K benchmarks.

Conclusion: Provides a practical and compute-efficient methodology for building open-weight models where safety becomes a more integral and resilient property against tampering attacks.

Abstract: The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model’s weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model’s internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.

[3] MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values

Yao Liang, Dongcheng Zhao, Feifei Zhao, Guobin Shen, Yuwei Wang, Dongqi Liang, Yi Zeng

Main category: cs.CL

TL;DR: MVPBench is a comprehensive benchmark for evaluating LLM value alignment across 75 countries with 24,020 annotated instances, revealing geographic/demographic disparities and showing lightweight fine-tuning methods can improve alignment.

DetailsMotivation: Existing benchmarks neglect cultural and demographic diversity, limiting understanding of how value alignment generalizes globally across different populations.

Method: Developed MVPBench with 24,020 high-quality instances annotated with value labels, personalized questions, and demographic metadata. Evaluated state-of-the-art LLMs and tested lightweight fine-tuning methods like LoRA and DPO.

Result: Revealed substantial disparities in alignment performance across geographic and demographic lines. Lightweight fine-tuning significantly enhanced value alignment in both in-domain and out-of-domain settings.

Conclusion: Highlights the necessity for population-aware alignment evaluation and provides actionable insights for building culturally adaptive, value-sensitive LLMs. MVPBench serves as a foundation for global alignment research and equitable AI development.

Abstract: The alignment of large language models (LLMs) with human values is critical for their safe and effective deployment across diverse user populations. However, existing benchmarks often neglect cultural and demographic diversity, leading to limited understanding of how value alignment generalizes globally. In this work, we introduce MVPBench, a novel benchmark that systematically evaluates LLMs’ alignment with multi-dimensional human value preferences across 75 countries. MVPBench contains 24,020 high-quality instances annotated with fine-grained value labels, personalized questions, and rich demographic metadata, making it the most comprehensive resource of its kind to date. Using MVPBench, we conduct an in-depth analysis of several state-of-the-art LLMs, revealing substantial disparities in alignment performance across geographic and demographic lines. We further demonstrate that lightweight fine-tuning methods, such as Low-Rank Adaptation (LoRA) and Direct Preference Optimization (DPO), can significantly enhance value alignment in both in-domain and out-of-domain settings. Our findings underscore the necessity for population-aware alignment evaluation and provide actionable insights for building culturally adaptive and value-sensitive LLMs. MVPBench serves as a practical foundation for future research on global alignment, personalized value modeling, and equitable AI development.

Hoang-Trung Nguyen, Tan-Minh Nguyen, Xuan-Bach Le, Tuan-Kiet Le, Khanh-Huyen Nguyen, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong, Le-Minh Nguyen

Main category: cs.CL

TL;DR: NOWJ team’s COLIEE 2025 participation achieved 1st place in Legal Case Entailment (Task 2) with F1=0.3195 using hybrid retrieval systems combining traditional IR methods with advanced LLMs across all five competition tasks.

DetailsMotivation: To advance legal information processing by demonstrating the effectiveness of integrating traditional information retrieval techniques with modern large language models in legal AI competitions.

Method: Two-stage retrieval system combining pre-ranking models (BM25, BERT, monoT5), embedding-based semantic representations (BGE-m3, LLM2Vec), and advanced LLMs (Qwen-2, QwQ-32B, DeepSeek-V3) for summarization, relevance scoring, and contextual re-ranking across all five COLIEE tasks.

Result: Achieved first place in Task 2 (Legal Case Entailment) with F1 score of 0.3195, and demonstrated robust performance in other tasks including Legal Case Retrieval, Statute Law Retrieval, Legal Textual Entailment, and Legal Judgment Prediction.

Conclusion: Hybrid models integrating traditional IR techniques with contemporary generative models show significant potential for legal information processing, providing a valuable framework for future advancements in the field.

Abstract: This paper presents the methodologies and results of the NOWJ team’s participation across all five tasks at the COLIEE 2025 competition, emphasizing advancements in the Legal Case Entailment task (Task 2). Our comprehensive approach systematically integrates pre-ranking models (BM25, BERT, monoT5), embedding-based semantic representations (BGE-m3, LLM2Vec), and advanced Large Language Models (Qwen-2, QwQ-32B, DeepSeek-V3) for summarization, relevance scoring, and contextual re-ranking. Specifically, in Task 2, our two-stage retrieval system combined lexical-semantic filtering with contextualized LLM analysis, achieving first place with an F1 score of 0.3195. Additionally, in other tasks–including Legal Case Retrieval, Statute Law Retrieval, Legal Textual Entailment, and Legal Judgment Prediction–we demonstrated robust performance through carefully engineered ensembles and effective prompt-based reasoning strategies. Our findings highlight the potential of hybrid models integrating traditional IR techniques with contemporary generative models, providing a valuable reference for future advancements in legal information processing.

[5] SciGPT: A Large Language Model for Scientific Literature Understanding and Knowledge Discovery

Fengyu She, Nan Wang, Hongfei Wu, Ziyi Wan, Jingmian Wang, Chang Wang

Main category: cs.CL

TL;DR: SciGPT is a domain-adapted foundation model for scientific literature that outperforms GPT-4o on scientific tasks while reducing memory consumption by 55% for long documents.

DetailsMotivation: Address the limitations of general-purpose LLMs in capturing scientific domain-specific nuances and handling complex scientific tasks for interdisciplinary research.

Method: Built on Qwen3 architecture with three innovations: low-cost domain distillation pipeline, Sparse Mixture-of-Experts attention mechanism, and knowledge-aware adaptation with domain ontologies.

Result: Outperforms GPT-4o on ScienceBench in core scientific tasks including sequence labeling, generation, and inference, with strong robustness in unseen scientific tasks.

Conclusion: SciGPT demonstrates potential to facilitate AI-augmented scientific discovery by effectively handling scientific literature with improved efficiency and domain-specific understanding.

Abstract: Scientific literature is growing exponentially, creating a critical bottleneck for researchers to efficiently synthesize knowledge. While general-purpose Large Language Models (LLMs) show potential in text processing, they often fail to capture scientific domain-specific nuances (e.g., technical jargon, methodological rigor) and struggle with complex scientific tasks, limiting their utility for interdisciplinary research. To address these gaps, this paper presents SciGPT, a domain-adapted foundation model for scientific literature understanding and ScienceBench, an open source benchmark tailored to evaluate scientific LLMs. Built on the Qwen3 architecture, SciGPT incorporates three key innovations: (1) low-cost domain distillation via a two-stage pipeline to balance performance and efficiency; (2) a Sparse Mixture-of-Experts (SMoE) attention mechanism that cuts memory consumption by 55% for 32,000-token long-document reasoning; and (3) knowledge-aware adaptation integrating domain ontologies to bridge interdisciplinary knowledge gaps. Experimental results on ScienceBench show that SciGPT outperforms GPT-4o in core scientific tasks including sequence labeling, generation, and inference. It also exhibits strong robustness in unseen scientific tasks, validating its potential to facilitate AI-augmented scientific discovery.

[6] CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework

Jinzhong Ning, Paerhati Tulajiang, Yingying Le, Yijia Zhang, Yuanyuan Sun, Hongfei Lin, Haifeng Liu

Main category: cs.CL

TL;DR: This paper introduces CommonVoice-SpeechRE, a large-scale real-human speech dataset for Speech Relation Extraction, and proposes RPG-MoGe framework with multi-order generation and relation prompts to improve performance.

DetailsMotivation: Existing SpeechRE datasets rely heavily on synthetic data and lack diversity, while current models suffer from rigid generation templates and weak semantic alignment, limiting their real-world performance.

Method: Proposed RPG-MoGe framework with: 1) multi-order triplet generation ensemble strategy using diverse element orders, and 2) CNN-based latent relation prediction heads that generate explicit relation prompts for cross-modal alignment.

Result: The approach outperforms state-of-the-art methods, providing both a benchmark dataset (nearly 20,000 real-human speech samples) and an effective solution for real-world SpeechRE.

Conclusion: The work establishes a new benchmark for SpeechRE research with a large-scale real-human dataset and demonstrates superior performance through the proposed multi-order generative ensemble framework with relation prompt guidance.

Abstract: Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.

[7] No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models

Flor Miriam Plaza-del-Arco, Paul Röttger, Nino Scherrer, Emanuele Borgonovo, Elmar Plischke, Dirk Hovy

Main category: cs.CL

TL;DR: Persona prompting in LLMs can cause false refusals, but the effect is smaller in more capable models and influenced by model choice and task type rather than sociodemographic personas alone.

DetailsMotivation: To quantify the impact of sociodemographic personas on false refusal rates in LLMs, as previous work suggested persona prompting leads to false request refusals but lacked comprehensive measurement.

Method: Tested 15 sociodemographic personas (gender, race, religion, disability) across 16 models, 3 tasks (NLI, politeness, offensiveness classification), and 9 prompt paraphrases using a Monte Carlo-based method for efficient sampling.

Result: More capable models show less persona impact on refusal rates. Model choice and task significantly influence false refusals, especially in sensitive content tasks. Certain personas increase false refusal in some models, suggesting biases in alignment strategies.

Conclusion: Persona effects on false refusals have been overestimated and are more influenced by model capabilities and task context than sociodemographic factors alone.

Abstract: Large language models (LLMs) are increasingly integrated into our daily lives and personalized. However, LLM personalization might also increase unintended side effects. Recent work suggests that persona prompting can lead models to falsely refuse user requests. However, no work has fully quantified the extent of this issue. To address this gap, we measure the impact of 15 sociodemographic personas (based on gender, race, religion, and disability) on false refusal. To control for other factors, we also test 16 different models, 3 tasks (Natural Language Inference, politeness, and offensiveness classification), and nine prompt paraphrases. We propose a Monte Carlo-based method to quantify this issue in a sample-efficient manner. Our results show that as models become more capable, personas impact the refusal rate less and less. Certain sociodemographic personas increase false refusal in some models, which suggests underlying biases in the alignment strategies or safety mechanisms. However, we find that the model choice and task significantly influence false refusals, especially in sensitive content tasks. Our findings suggest that persona effects have been overestimated, and might be due to other factors.

[8] Culturally transmitted color categories in LLMs reflect a learning bias toward efficient compression

Nathaniel Imel, Noga Zaslavsky

Main category: cs.CL

TL;DR: LLMs can evolve human-like semantic categorization systems through Information Bottleneck efficiency, similar to how human languages achieve near-optimal compression.

DetailsMotivation: To investigate whether LLMs, which are not explicitly trained for optimal semantic compression, can develop efficient human-like semantic categorization systems like those found in natural languages.

Method: Replicated two human behavioral studies using Gemini 2.0-flash and Llama 3.3-70B-Instruct: 1) English color-naming study comparing LLM performance with native speakers, 2) Simulated cultural evolution of pseudo color-naming systems via iterated in-context language learning.

Result: Gemini aligned well with English speakers’ naming patterns and achieved high IB-efficiency, while Llama showed efficient but lower complexity. Both LLMs iteratively restructured random systems toward greater IB-efficiency and cross-linguistic alignment, similar to humans.

Conclusion: LLMs are capable of evolving perceptually grounded, human-like semantic systems driven by the same Information Bottleneck efficiency principle that governs semantic efficiency across human languages.

Abstract: Converging evidence suggests that systems of semantic categories across human languages achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy principle. Large language models (LLMs) are not trained for this objective, which raises the question: are LLMs capable of evolving efficient human-like semantic systems? To address this question, we focus on the domain of color as a key testbed of cognitive theories of categorization and replicate with LLMs (Gemini 2.0-flash and Llama 3.3-70B-Instruct) two influential human behavioral studies. First, we conduct an English color-naming study, showing that Gemini aligns well with the naming patterns of native English speakers and achieves a significantly high IB-efficiency score, while Llama exhibits an efficient but lower complexity system compared to English. Second, to test whether LLMs simply mimic patterns in their training data or actually exhibit a human-like inductive bias toward IB-efficiency, we simulate cultural evolution of pseudo color-naming systems in LLMs via iterated in-context language learning. We find that akin to humans, LLMs iteratively restructure initially random systems towards greater IB-efficiency and increased alignment with patterns observed across the world’s languages. These findings demonstrate that LLMs are capable of evolving perceptually grounded, human-like semantic systems, driven by the same fundamental principle that governs semantic efficiency across human languages.

[9] MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion

Kosei Uemura, David Guzmán, Quang Phuoc Nguyen, Jesujoba Oluwadara Alabi, En-shiun Annie Lee, David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: MERLIN is a two-stage model-stacking framework that uses curriculum learning and DoRA weights to significantly improve reasoning accuracy in low-resource languages, outperforming existing methods and GPT-4o-mini.

DetailsMotivation: Large language models struggle with complex reasoning in low-resource languages (LRLs), and existing encoder-plus-decoder methods leave a large performance gap on LRLs despite working well on mid/high-resource languages.

Method: Two-stage model-stacking framework applying curriculum learning strategy (from general bilingual bitext to task-specific data) and adapting only a small set of DoRA weights.

Result: +12.9 pp improvement over MindMerger on AfriMGSM benchmark, outperforms GPT-4o-mini. Consistent gains on MGSM (+0.9 pp) and MSVAMP (+2.8 pp) across both low and high-resource settings.

Conclusion: MERLIN demonstrates effective cross-lingual reasoning capabilities, particularly for low-resource languages, through its curriculum learning approach and parameter-efficient adaptation strategy.

Abstract: Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy – from general bilingual bitext to task-specific data – and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.

[10] Bias after Prompting: Persistent Discrimination in Large Language Models

Nivedha Sivakumar, Natalie Mackraz, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

Main category: cs.CL

TL;DR: Biases in pre-trained LLMs can transfer through prompt adaptations, and existing prompt-based mitigation methods fail to consistently prevent this bias transfer across models, tasks, and demographics.

DetailsMotivation: To challenge the assumption that biases do not transfer from pre-trained LLMs to adapted models, particularly focusing on prompt adaptations which are widely used in real-world applications.

Method: Studied bias transfer hypothesis in causal models under prompt adaptations, examining correlation between intrinsic biases and those after prompt adaptation across demographics (gender, age, religion) and tasks (co-reference resolution, question answering). Evaluated various few-shot composition parameters and prompt-based debiasing strategies.

Result: Found strong correlations between intrinsic biases and post-adaptation biases (rho >= 0.94 for gender, >= 0.98 for age, >= 0.69 for religion). Biases remained strongly correlated across different few-shot parameters (rho >= 0.90). No prompt-based debiasing strategy consistently reduced bias transfer.

Conclusion: Correcting bias in intrinsic models may be necessary to prevent bias propagation to downstream tasks, as prompt adaptations alone cannot reliably mitigate bias transfer.

Abstract: A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks – for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.

[11] Verbalized Algorithms

Supriya Lall, Christian Farrell, Hari Pathanjaly, Marko Pavic, Sarvesh Chezhian, Masataro Asai

Main category: cs.CL

TL;DR: Verbalized Algorithms (VAs) use classical algorithms with LLMs as simple operation oracles instead of one-shot querying for complex reasoning tasks.

DetailsMotivation: To improve reliability of LLM reasoning by decomposing complex tasks into simple operations that LLMs can handle more reliably, leveraging well-established classical algorithms.

Method: Decompose tasks into elementary operations on natural language strings, use LLMs as specialized oracles (e.g., binary comparison for sorting), and implement known algorithms like bitonic sorting networks.

Result: Demonstrated effectiveness on sorting and clustering tasks, showing improved reliability over one-shot LLM querying.

Conclusion: Verbalized Algorithms provide a more reliable approach to LLM-based reasoning by constraining LLMs to simple operations within proven algorithmic frameworks.

Abstract: Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.

[12] Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions

Eve Fleisig, Matthias Orlikowski, Philipp Cimiano, Dan Klein

Main category: cs.CL

TL;DR: Existing annotator filtering methods designed for single ground-truth tasks often remove legitimate disagreement instead of spam, harming label diversity in subjective tasks where variation should be preserved.

DetailsMotivation: To balance annotator reliability and representation in machine learning datasets, ensuring diverse opinions are preserved while filtering out actual spam or low-quality responses.

Method: Empirical evaluation of various heuristics for annotator filtering on subjective tasks, analyzing how they affect label variation preservation and testing performance on synthetic spam data.

Result: Conservative filtering (<5% removal) works best; most methods increase error beyond this point. Spammers are often distributionally indistinguishable from real annotators, with distinguishable spammers giving fixed (not random) answers.

Conclusion: Spam removal methods need to account for label diversity since spammers tend to be less random than non-spammers, reversing the intuition of existing filtering approaches.

Abstract: For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (<5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are less random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.

[13] Towards Knowledge-Aware Document Systems: Modeling Semantic Coverage Relations via Answerability Detection

Yehudit Aperstein, Alon Gottlib, Gal Benita, Alexander Apartsin

Main category: cs.CL

TL;DR: A framework for modeling Semantic Coverage Relations (SCR) using QA-based approach to classify document pairs into equivalence, inclusion, or semantic overlap categories, with discriminative models outperforming generative approaches.

DetailsMotivation: Understanding how information is shared across documents regardless of format is critical for information retrieval, summarization, and content alignment tasks.

Method: QA-based approach using answerability of shared questions across documents as indicator of semantic coverage. Synthetic dataset from SQuAD corpus with paraphrased passages and controlled information omission. Benchmarking generative models and training transformer-based classifiers.

Result: Discriminative models significantly outperform generative approaches. RoBERTa-base achieved highest accuracy (61.4%) and Random Forest showed best balance (macro-F1 52.9%). QA provides effective lens for assessing semantic relations across diverse texts.

Conclusion: The approach effectively models semantic coverage relations, showing current models’ capacity to reason about information beyond surface similarity. Dataset and code are publicly available for reproducibility.

Abstract: Understanding how information is shared across documents, regardless of the format in which it is expressed, is critical for tasks such as information retrieval, summarization, and content alignment. In this work, we introduce a novel framework for modelling Semantic Coverage Relations (SCR), which classifies document pairs based on how their informational content aligns. We define three core relation types: equivalence, where both texts convey the same information using different textual forms or styles; inclusion, where one document fully contains the information of another and adds more; and semantic overlap, where each document presents partially overlapping content. To capture these relations, we adopt a question answering (QA)-based approach, using the answerability of shared questions across documents as an indicator of semantic coverage. We construct a synthetic dataset derived from the SQuAD corpus by paraphrasing source passages and selectively omitting information, enabling precise control over content overlap. This dataset allows us to benchmark generative language models and train transformer-based classifiers for SCR prediction. Our findings demonstrate that discriminative models significantly outperform generative approaches, with the RoBERTa-base model achieving the highest accuracy of 61.4% and the Random Forest-based model showing the best balance with a macro-F1 score of 52.9%. The results show that QA provides an effective lens for assessing semantic relations across stylistically diverse texts, offering insights into the capacity of current models to reason about information beyond surface similarity. The dataset and code developed in this study are publicly available to support reproducibility.

[14] Toward Subtrait-Level Model Explainability in Automated Writing Evaluation

Alejandro Andrade-Lotero, Lee Becker, Joshua Southerland, Scott Hellman

Main category: cs.CL

TL;DR: Using generative language models to assess writing subtraits (latent components) for explainable automated scoring, showing modest correlations between human and automated subtrait scores.

DetailsMotivation: To enhance transparency of automated writing scores by providing detailed explanations through subtrait assessment, helping educators and students understand scoring decisions.

Method: Prototyping explainability and subtrait scoring using generative language models to analyze latent trait components in writing assessments.

Result: Modest correlation observed between human subtrait and trait scores, and between automated and human subtrait scores.

Conclusion: The approach successfully provides detailed scoring explanations to demystify automated writing assessments for educational stakeholders.

Abstract: Subtrait (latent-trait components) assessment presents a promising path toward enhancing transparency of automated writing scores. We prototype explainability and subtrait scoring with generative language models and show modest correlation between human subtrait and trait scores, and between automated and human subtrait scores. Our approach provides details to demystify scores for educators and students.

[15] Automatic Detection of Inauthentic Templated Responses in English Language Assessments

Yashad Samant, Lee Becker, Scott Hellman, Bradley Behan, Sarah Hughes, Joshua Southerland

Main category: cs.CL

TL;DR: Automated detection of templated responses in English language assessments using machine learning to prevent gaming of automated scoring systems

DetailsMotivation: Low-skill test takers use memorized templates to fool automated scoring systems in high-stakes English language assessments

Method: Machine learning-based approach for detecting inauthentic, templated responses (AuDITR task)

Result: Not specified in the abstract

Conclusion: Highlights the importance of regularly updating detection models in production environments

Abstract: In high-stakes English Language Assessments, low-skill test takers may employ memorized materials called templates'' on essay questions to game’’ or fool the automated scoring system. In this study, we introduce the automated detection of inauthentic, templated responses (AuDITR) task, describe a machine learning-based approach to this task and illustrate the importance of regularly updating these models in production.

[16] So let’s replace this phrase with insult… Lessons learned from generation of toxic texts with LLMs

Sergey Pletenev, Daniil Moskovskiy, Alexander Panchenko

Main category: cs.CL

TL;DR: LLM-generated synthetic toxic data performs worse than human data for text detoxification training, with up to 30% performance drop due to limited lexical diversity in generated toxic content.

DetailsMotivation: To explore whether LLM-generated synthetic toxic data can effectively replace human-generated data for training detoxification models, particularly in sensitive domains where human data collection is challenging.

Method: Used Llama 3 and Qwen activation-patched models to generate synthetic toxic counterparts for neutral texts from ParaDetox and SST-2 datasets, then compared performance of models fine-tuned on synthetic vs human data.

Result: Models trained on synthetic data consistently underperform those trained on human data, with up to 30% performance drop in joint metrics, due to LLMs generating toxic content with limited, repetitive vocabulary that lacks the nuance of human toxicity.

Conclusion: Current LLMs have significant limitations in generating diverse toxic content, highlighting the continued importance of human-annotated data for building robust detoxification systems.

Abstract: Modern Large Language Models (LLMs) are excellent at generating synthetic data. However, their performance in sensitive domains such as text detoxification has not received proper attention from the scientific community. This paper explores the possibility of using LLM-generated synthetic toxic data as an alternative to human-generated data for training models for detoxification. Using Llama 3 and Qwen activation-patched models, we generated synthetic toxic counterparts for neutral texts from ParaDetox and SST-2 datasets. Our experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data, with a drop in performance of up to 30% in joint metrics. The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity. These findings highlight the limitations of current LLMs in this domain and emphasize the continued importance of diverse, human-annotated data for building robust detoxification systems.

[17] Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a Billion-Parameter Instruction-Tuned Model

Yu Cheng Chih, Yong Hao Hou

Main category: cs.CL

TL;DR: ETLCH is a billion-parameter LLaMA-based model fine-tuned with LoRA on small datasets (few hundred to 1k samples per task) for structured data extraction tasks like JSON extraction, knowledge graph extraction, and NER, outperforming larger models while being computationally efficient.

DetailsMotivation: Deploying large LLMs for structured data extraction is impractical for smaller teams due to high costs and dataset preparation difficulties. There's limited evidence on whether smaller models can work reliably under low-resource, multi-task conditions.

Method: Developed ETLCH - a billion-parameter LLaMA-based model fine-tuned using low-rank adaptation (LoRA) on only a few hundred to one thousand samples per task for JSON extraction, knowledge graph extraction, and named entity recognition.

Result: ETLCH outperforms strong baselines across most evaluation metrics, with substantial gains observed even at the lowest data scale. It delivers stable and accurate structured outputs at a fraction of the computational cost.

Conclusion: Well-tuned small models can deliver reliable structured data extraction capabilities in resource-constrained environments, enabling cost-effective information extraction pipelines without requiring large architectures or massive datasets.

Abstract: Deploying large language models (LLMs) for structured data extraction in domains such as financial compliance reporting, legal document analytics, and multilingual knowledge base construction is often impractical for smaller teams due to the high cost of running large architectures and the difficulty of preparing large, high-quality datasets. Most recent instruction-tuning studies focus on seven-billion-parameter or larger models, leaving limited evidence on whether much smaller models can work reliably under low-resource, multi-task conditions. This work presents ETLCH, a billion-parameter LLaMA-based model fine-tuned with low-rank adaptation on only a few hundred to one thousand samples per task for JSON extraction, knowledge graph extraction, and named entity recognition. Despite its small scale, ETLCH outperforms strong baselines across most evaluation metrics, with substantial gains observed even at the lowest data scale. These findings demonstrate that well-tuned small models can deliver stable and accurate structured outputs at a fraction of the computational cost, enabling cost-effective and reliable information extraction pipelines in resource-constrained environments.

[18] Adversarial Attacks Against Automated Fact-Checking: A Survey

Fanzhen Liu, Alsharif Abuadbba, Kristen Moore, Surya Nepal, Cecile Paris, Jia Wu, Jian Yang, Quan Z. Sheng

Main category: cs.CL

TL;DR: Survey paper on adversarial attacks against automated fact-checking systems, analyzing attack strategies, model vulnerabilities, defense mechanisms, and open research challenges.

DetailsMotivation: Automated fact-checking systems are vulnerable to adversarial attacks that manipulate claims, evidence, or claim-evidence pairs, undermining their reliability in combating misinformation.

Method: Comprehensive review and categorization of existing adversarial attack methodologies targeting fact-checking systems, evaluation of their impact, and examination of adversary-aware defense mechanisms.

Result: Identifies critical vulnerabilities in current fact-checking models and highlights the urgent need for more resilient frameworks that can withstand adversarial manipulations.

Conclusion: There is a pressing need for robust fact-checking frameworks that maintain high verification accuracy against adversarial attacks to preserve information reliability in the misinformation era.

Abstract: In an era where misinformation spreads freely, fact-checking (FC) plays a crucial role in verifying claims and promoting reliable information. While automated fact-checking (AFC) has advanced significantly, existing systems remain vulnerable to adversarial attacks that manipulate or generate claims, evidence, or claim-evidence pairs. These attacks can distort the truth, mislead decision-makers, and ultimately undermine the reliability of FC models. Despite growing research interest in adversarial attacks against AFC systems, a comprehensive, holistic overview of key challenges remains lacking. These challenges include understanding attack strategies, assessing the resilience of current models, and identifying ways to enhance robustness. This survey provides the first in-depth review of adversarial attacks targeting FC, categorizing existing attack methodologies and evaluating their impact on AFC systems. Additionally, we examine recent advancements in adversary-aware defenses and highlight open research questions that require further exploration. Our findings underscore the urgent need for resilient FC frameworks capable of withstanding adversarial manipulations in pursuit of preserving high verification accuracy.

[19] Acquiescence Bias in Large Language Models

Daniel Braun

Main category: cs.CL

TL;DR: LLMs show a bias towards answering ’no’ in surveys, opposite to human acquiescence bias where people tend to agree regardless of beliefs.

DetailsMotivation: Since LLMs are trained on human data and are influenceable by input changes, researchers wanted to investigate if they exhibit the same acquiescence bias (tendency to agree) that humans show in surveys.

Method: Conducted a study across different LLM models, tasks, and languages (English, German, Polish) to test for acquiescence bias patterns.

Result: Contrary to human behavior, LLMs displayed a consistent bias towards answering ’no’ regardless of whether it indicated agreement or disagreement with statements.

Conclusion: LLMs exhibit an opposite bias pattern to human acquiescence, showing a systematic tendency to respond negatively rather than positively in survey-like contexts.

Abstract: Acquiescence bias, i.e. the tendency of humans to agree with statements in surveys, independent of their actual beliefs, is well researched and documented. Since Large Language Models (LLMs) have been shown to be very influenceable by relatively small changes in input and are trained on human-generated data, it is reasonable to assume that they could show a similar tendency. We present a study investigating the presence of acquiescence bias in LLMs across different models, tasks, and languages (English, German, and Polish). Our results indicate that, contrary to humans, LLMs display a bias towards answering no, regardless of whether it indicates agreement or disagreement.

[20] Simulating Identity, Propagating Bias: Abstraction and Stereotypes in LLM-Generated Text

Pia Sommerauer, Giulia Rambelli, Tommaso Caselli

Main category: cs.CL

TL;DR: Persona-prompting in LLMs doesn’t effectively modulate linguistic abstraction levels and may propagate stereotypes even when simulating marginalized groups.

DetailsMotivation: To investigate whether persona-prompting leads to different levels of linguistic abstraction (a marker of stereotyping) when LLMs generate texts about socio-demographic categories.

Method: Analyzed outputs from six open-weight LLMs under three prompting conditions, comparing 11 persona-driven responses to generic AI assistant responses using the Linguistic Expectancy Bias framework. Introduced Self-Stereo dataset from Reddit. Measured abstraction through concreteness, specificity, and negation metrics.

Result: Persona-prompting has limited effectiveness in modulating abstraction in language. The approach confirms criticisms about personas as representative of socio-demographic groups and raises concerns about stereotype propagation even when simulating marginalized voices.

Conclusion: Persona-prompting shows limitations in controlling linguistic stereotyping markers and may inadvertently reinforce stereotypes, questioning the ecological validity of using personas to represent social groups in LLMs.

Abstract: Persona-prompting is a growing strategy to steer LLMs toward simulating particular perspectives or linguistic styles through the lens of a specified identity. While this method is often used to personalize outputs, its impact on how LLMs represent social groups remains underexplored. In this paper, we investigate whether persona-prompting leads to different levels of linguistic abstraction - an established marker of stereotyping - when generating short texts linking socio-demographic categories with stereotypical or non-stereotypical attributes. Drawing on the Linguistic Expectancy Bias framework, we analyze outputs from six open-weight LLMs under three prompting conditions, comparing 11 persona-driven responses to those of a generic AI assistant. To support this analysis, we introduce Self-Stereo, a new dataset of self-reported stereotypes from Reddit. We measure abstraction through three metrics: concreteness, specificity, and negation. Our results highlight the limits of persona-prompting in modulating abstraction in language, confirming criticisms about the ecology of personas as representative of socio-demographic groups and raising concerns about the risk of propagating stereotypes even when seemingly evoking the voice of a marginalized group.

[21] Too Helpful, Too Harmless, Too Honest or Just Right?

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem

Main category: cs.CL

TL;DR: TrinityX is a modular alignment framework that uses Mixture of Calibrated Experts to improve LLM alignment across Helpfulness, Harmlessness, and Honesty dimensions simultaneously, achieving significant performance gains while reducing computational costs.

DetailsMotivation: Existing methods optimize individual alignment dimensions in isolation, leading to trade-offs and inconsistent behavior. MoE architectures offer modularity but suffer from poorly calibrated routing, limiting effectiveness in alignment tasks.

Method: Incorporates Mixture of Calibrated Experts (MoCaE) within Transformer architecture with separately trained experts for each HHH dimension, using calibrated task-adaptive routing mechanism to combine expert signals into unified alignment-aware representation.

Result: Outperforms baselines with 32.5% win rate improvement, 33.9% safety score improvement, and 28.4% truthfulness improvement. Reduces memory usage and inference latency by over 40% compared to prior MoE approaches.

Conclusion: TrinityX effectively addresses multi-dimensional alignment challenges through calibrated expert routing, demonstrating strong performance improvements, computational efficiency, and generalization across diverse LLM backbones.

Abstract: Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks, yet aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing methods often optimize for individual alignment dimensions in isolation, leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE) architectures offer modularity, they suffer from poorly calibrated routing, limiting their effectiveness in alignment tasks. We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture. TrinityX leverages separately trained experts for each HHH dimension, integrating their outputs through a calibrated, task-adaptive routing mechanism that combines expert signals into a unified, alignment-aware representation. Extensive experiments on three standard alignment benchmarks-Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty)-demonstrate that TrinityX outperforms strong baselines, achieving relative improvements of 32.5% in win rate, 33.9% in safety score, and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and inference latency by over 40% compared to prior MoE-based approaches. Ablation studies highlight the importance of calibrated routing, and cross-model evaluations confirm TrinityX’s generalization across diverse LLM backbones.

[22] CM-Align: Consistency-based Multilingual Alignment for Large Language Models

Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou

Main category: cs.CL

TL;DR: CM-Align improves multilingual alignment in LLMs by using consistency-based data selection to create high-quality preference pairs, addressing issues with noisy English references and biased construction methods.

DetailsMotivation: Current methods show performance gap between English and other languages in LLM alignment, using potentially low-quality English responses as references and biased approaches for multilingual preference construction.

Method: Consistency-guided English reference selection and cross-lingual consistency-based multilingual preference data construction to filter high-quality training data.

Result: Experimental results on three LLMs and three tasks demonstrate effectiveness and superiority of the method.

Conclusion: The approach shows necessity of constructing high-quality preference data for improved multilingual alignment performance.

Abstract: Current large language models (LLMs) generally show a significant performance gap in alignment between English and other languages. To bridge this gap, existing research typically leverages the model’s responses in English as a reference to select the best/worst responses in other languages, which are then used for Direct Preference Optimization (DPO) training. However, we argue that there are two limitations in the current methods that result in noisy multilingual preference data and further limited alignment performance: 1) Not all English responses are of high quality, and using a response with low quality may mislead the alignment for other languages. 2) Current methods usually use biased or heuristic approaches to construct multilingual preference pairs. To address these limitations, we design a consistency-based data selection method to construct high-quality multilingual preference data for improving multilingual alignment (CM-Align). Specifically, our method includes two parts: consistency-guided English reference selection and cross-lingual consistency-based multilingual preference data construction. Experimental results on three LLMs and three common tasks demonstrate the effectiveness and superiority of our method, which further indicates the necessity of constructing high-quality preference data.

[23] LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge

Dima Galat, Diego Molla-Aliod

Main category: cs.CL

TL;DR: Ensemble of zero-shot LLMs achieves state-of-the-art performance on biomedical Yes/No QA without fine-tuning, outperforming individual models and rivaling domain-tuned systems through effective RAG pipelines.

DetailsMotivation: Biomedical QA requires precise interpretation of specialized knowledge from vast, complex, and rapidly evolving corpus, posing significant challenges that need scalable solutions without costly fine-tuning.

Method: Uses ensemble approach aggregating outputs from multiple LLM variants (including Anthropic and Google models) with Retrieval-Augmented Generation (RAG) pipelines, focusing on zero-shot learning without labeled data or fine-tuning.

Result: Ensembles outperform individual LLMs and in some cases rival or surpass domain-tuned systems on BioASQ challenge tasks, while maintaining generalizability. Identified relationship between context length and performance - expanded contexts risk information dilution and model disorientation.

Conclusion: Ensemble-based zero-shot approaches with effective RAG pipelines provide practical and scalable alternative to domain-tuned systems for biomedical QA, emphasizing precise retrieval as critical foundation for ensuring LLMs operate within relevant information boundaries.

Abstract: Biomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.

[24] Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

Anran Li, Lingfei Qian, Mengmeng Du, Yu Yin, Yan Hu, Zihao Sun, Yihang Fu, Erica Stutz, Xuguang Ai, Qianqian Xie, Rui Zhu, Jimin Huang, Yifan Yang, Siru Liu, Yih-Chung Tham, Lucila Ohno-Machado, Hyunghoon Cho, Zhiyong Lu, Hua Xu, Qingyu Chen

Main category: cs.CL

TL;DR: First comprehensive evaluation of LLM memorization in medicine, showing it’s more prevalent than in general domain and can be beneficial, uninformative, or harmful.

DetailsMotivation: To understand the extent and impact of LLM memorization in medical applications, as memorization affects both development and adoption of LLMs in medicine.

Method: Systematic analysis of three adaptation scenarios: continued pretraining on medical corpora, fine-tuning on standard medical benchmarks, and fine-tuning on real-world clinical data (13,000+ inpatient records).

Result: Memorization is prevalent across all adaptation scenarios and significantly higher than in general domain. Three types identified: beneficial (clinical guidelines), uninformative (disclaimers), and harmful (sensitive patient data).

Conclusion: Practical recommendations provided to facilitate beneficial memorization, minimize uninformative memorization, and mitigate harmful memorization to prevent patient data leakage.

Abstract: Large Language Models (LLMs) have demonstrated significant potential in medicine. To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. However, a key open question remains: to what extent do LLMs memorize medical training data. In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain. Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content). Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.

[25] OTESGN:Optimal Transport Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis

Xinfeng Liao, Xuanqi Chen, Lianxi Wang, Jiahuan Yang, Zhuowei Chen, Ziying Rong

Main category: cs.CL

TL;DR: OTESGN model uses Optimal Transport and syntactic-semantic collaboration to improve aspect-based sentiment analysis by better capturing nonlinear relationships and filtering noise.

DetailsMotivation: Existing ABSA methods struggle with complex semantic relationships and nonlinear associations, allowing noisy similarity from irrelevant words to obscure key opinion terms.

Method: Proposes Optimal Transport Enhanced Syntactic-Semantic Graph Network (OTESGN) with Syntactic-Semantic Collaborative Attention, including Syntactic Graph-Aware Attention and Semantic Optimal Transport Attention, plus Adaptive Attention Fusion and contrastive regularization.

Result: Achieves state-of-the-art results: +1.01% F1 on Twitter and +1.30% F1 on Laptop14 benchmarks compared to previous best models.

Conclusion: OTESGN effectively captures sentiment signals obscured by irrelevant tokens, demonstrates precise opinion word localization and noise resistance through ablative studies and visual analyses.

Abstract: Aspect-based sentiment analysis (ABSA) aims to identify aspect terms and determine their sentiment polarity. While dependency trees combined with contextual semantics effectively identify aspect sentiment, existing methods relying on syntax trees and aspect-aware attention struggle to model complex semantic relationships. Their dependence on linear dot-product features fails to capture nonlinear associations, allowing noisy similarity from irrelevant words to obscure key opinion terms. Motivated by Differentiable Optimal Matching, we propose the Optimal Transport Enhanced Syntactic-Semantic Graph Network (OTESGN), which introduces a Syntactic-Semantic Collaborative Attention. It comprises a Syntactic Graph-Aware Attention for mining latent syntactic dependencies and modeling global syntactic topology, as well as a Semantic Optimal Transport Attention designed to uncover fine-grained semantic alignments amidst textual noise, thereby accurately capturing sentiment signals obscured by irrelevant tokens. A Adaptive Attention Fusion module integrates these heterogeneous features, and contrastive regularization further improves robustness. Experiments demonstrate that OTESGN achieves state-of-the-art results, outperforming previous best models by +1.01% F1 on Twitter and +1.30% F1 on Laptop14 benchmarks. Ablative studies and visual analyses corroborate its efficacy in precise localization of opinion words and noise resistance.

[26] X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

Main category: cs.CL

TL;DR: Automated framework using evolutionary search to discover and optimize multi-turn-to-single-turn red-teaming templates, achieving 44.8% success rate on GPT-4.1 with transferable structural improvements.

DetailsMotivation: Prior work on M2S compression relied on manually written templates, which limits scalability and effectiveness. There's a need for automated methods to discover optimal prompt structures for red-teaming.

Method: X-Teaming Evolutionary M2S framework that uses language-model-guided evolution to discover and optimize M2S templates. Combines smart sampling from 12 sources with LLM-as-judge evaluation inspired by StrongREJECT, maintaining selection pressure with success threshold θ=0.70.

Result: Achieved 44.8% overall success (103/230) on GPT-4.1, generated five evolutionary generations and two new template families. Cross-model evaluation showed structural gains transfer but vary by target model, with some models scoring zero. Found positive correlation between prompt length and success score.

Conclusion: Structure-level search provides reproducible route to stronger single-turn probes, highlighting importance of threshold calibration and cross-model evaluation for robust red-teaming template optimization.

Abstract: Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $\theta = 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.

[27] Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, Alexandre Défossez

Main category: cs.CL

TL;DR: DSM is a streaming sequence-to-sequence method that uses delayed alignment between multimodal streams to enable real-time inference with state-of-the-art performance on ASR and TTS tasks.

DetailsMotivation: Traditional sequence-to-sequence models operate offline, consuming complete inputs before generating outputs. Streaming approaches require complex policies for input/output timing. DSM aims to provide a simpler, more flexible streaming solution that handles arbitrary multimodal sequences.

Method: Uses decoder-only language model with time-aligned streams. Moves alignment to pre-processing by introducing appropriate delays between input and output streams. Enables streaming inference for any input combination without complex timing policies.

Result: Achieves state-of-the-art performance and latency on automatic speech recognition (ASR) and text-to-speech (TTS) tasks. Supports arbitrary long sequences and remains competitive with offline baselines.

Conclusion: DSM provides a flexible, high-performance streaming solution for multimodal sequence-to-sequence tasks, eliminating the need for complex timing policies while maintaining competitive results with offline methods.

Abstract: We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

[28] Do All Autoregressive Transformers Remember Facts the Same Way? A Cross-Architecture Analysis of Recall Mechanisms

Minyeong Choe, Haehyun Cho, Changho Seo, Hyunil Kim

Main category: cs.CL

TL;DR: Qwen-based models show different factual recall patterns than GPT-style models, with attention modules in early layers contributing more than MLPs, indicating architectural variations lead to different mechanisms even within autoregressive Transformers.

DetailsMotivation: To understand how factual associations are stored and retrieved in Transformer models and determine if prior findings about GPT-style models (MLP dominance in early layers) generalize across different autoregressive architectures.

Method: Comprehensive evaluation across multiple models (GPT, LLaMA, Qwen, DeepSeek) analyzing where and how factual information is encoded and accessed.

Result: Qwen-based models behave differently - attention modules in the earliest layers contribute more to factual recall than MLP modules, unlike the pattern observed in GPT-style models.

Conclusion: Architectural variations within the autoregressive Transformer family can lead to fundamentally different mechanisms of factual recall, challenging the generalization of findings across different model architectures.

Abstract: Understanding how Transformer-based language models store and retrieve factual associations is critical for improving interpretability and enabling targeted model editing. Prior work, primarily on GPT-style models, has identified MLP modules in early layers as key contributors to factual recall. However, it remains unclear whether these findings generalize across different autoregressive architectures. To address this, we conduct a comprehensive evaluation of factual recall across several models – including GPT, LLaMA, Qwen, and DeepSeek – analyzing where and how factual information is encoded and accessed. Consequently, we find that Qwen-based models behave differently from previous patterns: attention modules in the earliest layers contribute more to factual recall than MLP modules. Our findings suggest that even within the autoregressive Transformer family, architectural variations can lead to fundamentally different mechanisms of factual recall.

[29] Evaluating LLMs Without Oracle Feedback: Agentic Annotation Evaluation Through Unsupervised Consistency Signals

Cheng Chen, Haiyan Yin, Ivor Tsang

Main category: cs.CL

TL;DR: Proposes CAI Ratio metric for unsupervised evaluation of LLM annotation quality using student-teacher collaboration and majority voting without oracle feedback.

DetailsMotivation: Evaluating LLM annotation quality is challenging in dynamic, unsupervised environments where oracle feedback is scarce and conventional methods fail.

Method: Agentic annotation paradigm with student model collaborating with LLM teacher, using user preference-based majority voting to assess consistency of LLM outputs.

Result: CAI Ratio shows strong positive correlation with LLM accuracy across ten NLP datasets and four LLMs, enabling reliable model selection.

Conclusion: CAI Ratio is an essential tool for unsupervised evaluation and model selection in real-world settings where traditional evaluation methods are impractical.

Abstract: Large Language Models (LLMs), when paired with prompt-based tasks, have significantly reduced data annotation costs and reliance on human annotators. However, evaluating the quality of their annotations remains challenging in dynamic, unsupervised environments where oracle feedback is scarce and conventional methods fail. To address this challenge, we propose a novel agentic annotation paradigm, where a student model collaborates with a noisy teacher (the LLM) to assess and refine annotation quality without relying on oracle feedback. The student model, acting as an unsupervised feedback mechanism, employs a user preference-based majority voting strategy to evaluate the consistency of the LLM outputs. To systematically measure the reliability of LLM-generated annotations, we introduce the Consistent and Inconsistent (CAI) Ratio, a novel unsupervised evaluation metric. The CAI Ratio not only quantifies the annotation quality of the noisy teacher under limited user preferences but also plays a critical role in model selection, enabling the identification of robust LLMs in dynamic, unsupervised environments. Applied to ten open-domain NLP datasets across four LLMs, the CAI Ratio demonstrates a strong positive correlation with LLM accuracy, establishing it as an essential tool for unsupervised evaluation and model selection in real-world settings.

[30] MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

Hailay Kidu Teklehaymanot, Dren Fazlija, Wolfgang Nejdl

Main category: cs.CL

TL;DR: MoVoC is a morpheme-aware subword tokenization method that integrates supervised morphological analysis with BPE to preserve morphological boundaries in low-resource Geez script languages, showing improved linguistic fidelity despite no significant translation gains.

DetailsMotivation: Subword tokenization methods often fail to preserve morphological boundaries, which is particularly problematic for low-resource, morphologically complex languages like those using the Geez script.

Method: Developed MoVoC (Morpheme-aware Subword Vocabulary Construction) and trained MoVoC-Tok tokenizer that combines morpheme-based segmentation with Byte Pair Encoding (BPE) tokens to maintain morphological integrity while preserving lexical meaning.

Result: While no significant improvements in automatic translation quality were observed, the method showed consistent improvements in intrinsic metrics (MorphoScore and Boundary Precision), demonstrating enhanced linguistic fidelity and token efficiency.

Conclusion: Morphology-aware segmentation provides value for linguistic fidelity in low-resource languages. The authors release manually annotated morpheme data for four Geez script languages and vocabulary for two languages to support further research.

Abstract: Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Geez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Geez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer will be publicly available to support further research in low-resource, morphologically rich languages. Our code and data are available on GitHub: https://github.com/hailaykidu/MoVoC

[31] Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora

Thales Sales Almeida, Rodrigo Nogueira, Helio Pedrini

Main category: cs.CL

TL;DR: Scalable methods for building web-based multilingual corpora, with Portuguese case study showing 120B token corpus achieves competitive results through language-specific filtering and continual pretraining.

DetailsMotivation: Address the gap in understanding how to construct effective training corpora for non-English languages in LLM development, as most existing work focuses on English.

Method: Developed scalable web-based corpus building methods, applied to Portuguese with language-specific filtering pipelines (including STEM and toxic content classifiers), and used continual pretraining setup to transition English-trained models to target language.

Result: Built 120B token Portuguese corpus achieving competitive results to industrial-grade corpus. Showed that language-specific filtering and adapting models to target language leads to performance improvements.

Conclusion: High-quality, language-specific data is crucial for multilingual LLM performance. The methods developed are applicable to other languages and provide valuable insights for multilingual LLM development.

Abstract: The performance of large language models (LLMs) is deeply influenced by the quality and composition of their training data. While much of the existing work has centered on English, there remains a gap in understanding how to construct effective training corpora for other languages. We explore scalable methods for building web-based corpora for LLMs. We apply them to build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus. Using a continual pretraining setup, we study how different data selection and preprocessing strategies affect LLM performance when transitioning a model originally trained in English to another language. Our findings demonstrate the value of language-specific filtering pipelines, including classifiers for education, science, technology, engineering, and mathematics (STEM), as well as toxic content. We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data. While our case study focuses on Portuguese, our methods are applicable to other languages, offering insights for multilingual LLM development.

[32] Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Joachim Baumann, Paul Röttger, Aleksandra Urman, Albert Wendsjö, Flor Miriam Plaza-del-Arco, Johannes B. Gruber, Dirk Hovy

Main category: cs.CL

TL;DR: LLM hacking refers to systematic biases and errors introduced by researcher implementation choices (model selection, prompting, temperature) in LLM-based data annotation, leading to incorrect statistical conclusions in approximately 1/3 to 1/2 of hypotheses tested.

DetailsMotivation: To quantify the risk of biased results in social science research when using LLMs for data annotation tasks, as implementation choices can introduce systematic errors that propagate to downstream analyses.

Method: Replicated 37 data annotation tasks from 21 published studies using 18 different LLM models, analyzing 13 million labels and testing 2,361 realistic hypotheses to measure how researcher choices affect statistical conclusions.

Result: Found incorrect conclusions in ~33% of hypotheses for state-of-the-art models and ~50% for small language models. Higher task performance reduces but doesn’t eliminate risk. Intentional hacking is simple - few models and prompt paraphrases can make anything appear statistically significant.

Conclusion: LLM hacking poses significant risks to research validity. Human annotations remain crucial for reducing false positives, and findings near significance thresholds require rigorous verification. Common correction techniques are largely ineffective at mitigating these risks.

Abstract: Large language models (LLMs) are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection, prompting strategy, or temperature settings). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I, Type II, Type S, or Type M errors. We call this LLM hacking. We quantify the risk of LLM hacking by replicating 37 data annotation tasks from 21 published social science research studies with 18 different models. Analyzing 13 million LLM labels, we test 2,361 realistic hypotheses to measure how plausible researcher choices affect statistical conclusions. We find incorrect conclusions based on LLM-annotated data in approximately one in three hypotheses for state-of-the-art models, and in half the hypotheses for small language models. While our findings show that higher task performance and better general model capabilities reduce LLM hacking risk, even highly accurate models do not completely eliminate it. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of findings near significance thresholds. Our extensive analysis of LLM hacking mitigation techniques emphasizes the importance of human annotations in reducing false positive findings and improving model selection. Surprisingly, common regression estimator correction techniques are largely ineffective in reducing LLM hacking risk, as they heavily trade off Type I vs. Type II errors. Beyond accidental errors, we find that intentional LLM hacking is unacceptably simple. With few LLMs and just a handful of prompt paraphrases, anything can be presented as statistically significant.

[33] A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou

Main category: cs.CL

TL;DR: Survey paper reviewing recent advances in Reinforcement Learning for reasoning with Large Language Models, examining foundational components, challenges, and future directions for scaling RL towards Artificial SuperIntelligence.

DetailsMotivation: RL has emerged as a foundational methodology for transforming LLMs into LRMs (Large Reasoning Models), but faces scaling challenges in computational resources, algorithm design, training data, and infrastructure that need to be addressed.

Method: The paper conducts a comprehensive survey examining research applying RL to LLMs and LRMs for reasoning abilities, including foundational components, core problems, training resources, and downstream applications since the release of DeepSeek-R1.

Result: The survey identifies that RL has achieved remarkable success in advancing LLM capabilities for complex logical tasks like mathematics and coding, but scaling RL for LRMs faces significant foundational challenges.

Conclusion: The review aims to promote future research on RL for broader reasoning models by reassessing the field’s trajectory and exploring strategies to enhance scalability toward Artificial SuperIntelligence.

Abstract: In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

[34] Baba Is AI: Break the Rules to Beat the Benchmark

Nathan Cloos, Meagan Jens, Michelangelo Naim, Yen-Ling Kuo, Ignacio Cases, Andrei Barbu, Christopher J. Cueva

Main category: cs.CL

TL;DR: LLMs fail dramatically at rule manipulation tasks in Baba Is You game, showing poor generalization when rules need to be combined and modified

DetailsMotivation: To test AI systems' ability to solve problems through both rule-following and creative rule-redefinition, similar to human problem-solving

Method: Developed a benchmark based on Baba Is You game where agents manipulate both objects and rule tiles; tested GPT-4o, Gemini-1.5-Pro, and Gemini-1.5-Flash models

Result: All three state-of-the-art multi-modal LLMs failed dramatically when generalization required rule manipulation and combination

Conclusion: Current LLMs struggle with tasks requiring rule redefinition and creative problem-solving beyond standard rule-following

Abstract: Humans solve problems by following existing rules and procedures, and also by leaps of creativity to redefine those rules and objectives. To probe these abilities, we developed a new benchmark based on the game Baba Is You where an agent manipulates both objects in the environment and rules, represented by movable tiles with words written on them, to reach a specified goal and win the game. We test three state-of-the-art multi-modal large language models (OpenAI GPT-4o, Google Gemini-1.5-Pro and Gemini-1.5-Flash) and find that they fail dramatically when generalization requires that the rules of the game must be manipulated and combined.

[35] Localizing Factual Inconsistencies in Attributable Text Generation

Arie Cattan, Paul Roit, Shiyue Zhang, David Wan, Roee Aharoni, Idan Szpektor, Mohit Bansal, Ido Dagan

Main category: cs.CL

TL;DR: QASemConsistency is a fine-grained method for detecting factual inconsistencies in text generation by decomposing text into predicate-argument propositions as QA pairs and checking support from reference texts.

DetailsMotivation: Existing methods for detecting hallucinations in model-generated texts fail to precisely pinpoint errors at fine-grained levels, creating a need for more precise localization of factual inconsistencies.

Method: Inspired by Neo-Davidsonian formal semantics, the method decomposes generated text into minimal predicate-argument propositions expressed as simple QA pairs, then assesses each QA pair’s support from trusted reference texts.

Result: Achieved substantial inter-annotator agreement on crowdsourced annotations of granular consistency errors, created benchmark with 3K+ instances, and showed factual consistency scores correlate well with human judgments. Also implemented automated detection methods using supervised entailment models and LLMs.

Conclusion: QASemConsistency provides an effective formalism for fine-grained localization of factual inconsistencies in attributable text generation, enabling both human annotation and automated detection with strong correlation to human judgment.

Abstract: There has been an increasing interest in detecting hallucinations in model-generated texts, both manually and automatically, at varying levels of granularity. However, most existing methods fail to precisely pinpoint the errors. In this work, we introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation, at a fine-grained level. Drawing inspiration from Neo-Davidsonian formal semantics, we propose decomposing the generated text into minimal predicate-argument level propositions, expressed as simple question-answer (QA) pairs, and assess whether each individual QA pair is supported by a trusted reference text. As each QA pair corresponds to a single semantic relation between a predicate and an argument, QASemConsistency effectively localizes the unsupported information. We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation, by collecting crowdsourced annotations of granular consistency errors, while achieving a substantial inter-annotator agreement. This benchmark includes more than 3K instances spanning various tasks of attributable text generation. We also show that QASemConsistency yields factual consistency scores that correlate well with human judgments. Finally, we implement several methods for automatically detecting localized factual inconsistencies, with both supervised entailment models and LLMs.

[36] TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig

Main category: cs.CL

TL;DR: TheAgentCompany benchmark evaluates AI agents’ performance on professional workplace tasks, finding current systems can autonomously complete 30% of tasks but struggle with complex long-horizon work.

DetailsMotivation: To measure AI agents' capabilities in performing real-world professional tasks and understand their potential impact on workforce automation and labor markets.

Method: Created an extensible benchmark with a self-contained environment simulating a small software company, testing baseline agents using both closed API-based and open-weights language models on various workplace tasks.

Result: The most competitive agent completed 30% of tasks autonomously, showing that simpler tasks can be automated but complex long-horizon tasks remain challenging for current systems.

Conclusion: Current AI agents show promise for automating simpler professional tasks but significant limitations remain for complex work, providing a nuanced view of AI’s current workplace automation capabilities.

Abstract: We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents’ performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents–in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on https://the-agent-company.com.

[37] MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: Mone is a self-evolving framework that enhances clinical reasoning in small medical language models through Monte Carlo Tree Search, reinforcement fine-tuning, and a novel soft dual process reward model, achieving state-of-the-art performance on medical benchmarks.

DetailsMotivation: Current medical language models lack sufficient task coverage, fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, making them inadequate for real-world clinical reasoning applications.

Method: Uses a curriculum strategy across 5 medical domains and 16 datasets, employs Monte Carlo Tree Search to construct rule-verifiable reasoning trajectories, and implements reinforcement fine-tuning with a soft dual process reward model that penalizes value-degrading steps.

Result: Outperforms previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points on eleven benchmarks.

Conclusion: Mone framework enables robust and faithful reasoning behavior in small, deployable medical language models, addressing critical barriers to real-world clinical reasoning applications.

Abstract: Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose \mone, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct. Experiments on eleven benchmarks show that \mone outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points. Additional empirical analysis further demonstrates that \mone achieves robust and faithful reasoning behavior.

[38] CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning

Jianfeng Pan, Senyou Deng, Shaomang Huang

Main category: cs.CL

TL;DR: CoAT framework combines MCTS with associative memory for slow thinking in LLMs, achieving 10-15% performance gains on reasoning tasks.

DetailsMotivation: Current LLMs use 'fast thinking' single-query inference, while human-like 'slow thinking' with continuous knowledge association and refinement is more effective.

Method: Chain-of-Associated-Thoughts (CoAT) integrates Monte Carlo Tree Search for structured exploration with dynamic associative memory for real-time knowledge integration.

Result: 10%+ improvement on HotpotQA and MuSiQue datasets, 15%+ gain on proprietary CRB dataset across various generative and reasoning tasks.

Conclusion: CoAT’s slow thinking approach with MCTS and associative memory significantly enhances LLM reasoning capabilities by expanding search space and enabling dynamic knowledge updates.

Abstract: Research on LLM technologies is rapidly emerging, with most of them employ a ‘fast thinking’ approach to inference. Most LLMs generate the final result based solely on a single query and LLM’s reasoning capabilities. However, with the advent of OpenAI-o1, ‘slow thinking’ techniques have garnered increasing attention because its process is closer to the human thought process. Inspired by the human ability to constantly associate and replenish knowledge during thinking, we developed the novel Chain-of-Associated-Thoughts (CoAT) framework, which introduces an innovative synergy between the Monte Carlo Tree Search (MCTS) algorithm and a dynamic mechanism for integrating new key information, termed ‘associative memory’. By combining the structured exploration capabilities of MCTS with the adaptive learning capacity of associative memory, CoAT significantly expands the LLM search space, enabling our framework to explore diverse reasoning pathways and dynamically update its knowledge base in real-time. This allows the framework to not only revisit and refine earlier inferences but also adaptively incorporate evolving information, ensuring that the final output is both accurate and comprehensive. We validate CoAT’s effectiveness across a variety of generative and reasoning tasks. Quantitative experiments show that CoAT achieves over 10% performance improvement on open-source multi-hop reasoning datasets (HotpotQA, MuSiQue) and more than 15% gain on our proprietary CRB dataset.

[39] IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance

Paul Röttger, Musashi Hinck, Valentin Hofmann, Kobi Hackenburg, Valentina Pyatkin, Faeze Brahman, Dirk Hovy

Main category: cs.CL

TL;DR: IssueBench is a dataset of 2.49 million prompts to measure issue bias in LLM writing assistance, revealing that state-of-the-art LLMs commonly exhibit political biases that align more with Democrat than Republican voter opinions.

DetailsMotivation: LLMs help users write about diverse issues but may present biased perspectives, influencing user thinking. There's a need to measure issue biases in real user interactions to address risks from biased LLMs.

Method: Created IssueBench with 2.49m realistic English prompts based on 3.9k templates and 212 political issues from real user interactions. Tested 10 state-of-the-art LLMs using this benchmark.

Result: Issue biases are common and persistent across all 10 LLMs. Biases are very similar across models, and all models align more with US Democrat than Republican voter opinions on tested issues.

Conclusion: IssueBench enables robust measurement of LLM biases and can be adapted for other issues/templates, providing better evidence for discussions about addressing LLM biases.

Abstract: Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic English-language prompts to measure issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. “write a blog about”) and 212 political issues (e.g. “AI regulation”) from real user interactions. Using IssueBench, we show that issue biases are common and persistent in 10 state-of-the-art LLMs. We also show that biases are very similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them.

[40] Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation

Shengxiang Gao, Jey Han Lau, Jianzhong Qi

Main category: cs.CL

TL;DR: SG-KBQA is a novel KBQA model that injects schema contexts into entity retrieval and logical form generation to handle unseen knowledge base elements, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: Current KBQA methods struggle with unseen knowledge base elements at test time, limiting their generalizability to new or evolving knowledge bases.

Method: SG-KBQA injects schema contexts into both entity retrieval and logical form generation processes, leveraging richer semantics and awareness of the knowledge base structure to enhance model generalization.

Result: SG-KBQA outperforms state-of-the-art models on two commonly used benchmark datasets across various test settings, demonstrating strong generalizability.

Conclusion: Incorporating schema contexts significantly improves KBQA model performance on unseen knowledge base elements, making SG-KBQA an effective solution for handling evolving knowledge bases in question answering systems.

Abstract: Knowledge base question answering (KBQA) aims to answer user questions in natural language using rich human knowledge stored in large KBs. As current KBQA methods struggle with unseen knowledge base elements at test time,we introduce SG-KBQA: a novel model that injects schema contexts into entity retrieval and logical form generation to tackle this issue. It uses the richer semantics and awareness of the knowledge base structure provided by schema contexts to enhance generalizability. We show that SG-KBQA achieves strong generalizability, outperforming state-of-the-art models on two commonly used benchmark datasets across a variety of test settings. Our source code is available at https://github.com/gaosx2000/SG_KBQA.

[41] Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension

Yulong Wu, Viktor Schlegel, Riza Batista-Navarro

Main category: cs.CL

TL;DR: A framework using Wikipedia edit history to test MRC model robustness against natural text perturbations, showing performance degradation in various models including LLMs, with limited improvement through perturbation training.

DetailsMotivation: Current robustness evaluation relies on synthetic perturbations, leaving unclear how well they reflect real-world scenarios where text naturally evolves through edits.

Method: Replace paragraphs in MRC benchmarks with their Wikipedia edit history counterparts to create natural perturbations, then test various models including encoder models and LLMs.

Result: Natural perturbations cause performance degradation in pre-trained encoder models and LLMs. Training on perturbed examples improves robustness but leaves a gap compared to unperturbed performance.

Conclusion: Natural perturbations significantly impact MRC model performance, revealing robustness gaps that synthetic perturbations may not capture, highlighting the need for evaluation methods that better reflect real-world text evolution.

Abstract: As neural language models achieve human-comparable performance on Machine Reading Comprehension (MRC) and see widespread adoption, ensuring their robustness in real-world scenarios has become increasingly important. Current robustness evaluation research, though, primarily develops synthetic perturbation methods, leaving unclear how well they reflect real life scenarios. Considering this, we present a framework to automatically examine MRC models on naturally occurring textual perturbations, by replacing paragraph in MRC benchmarks with their counterparts based on available Wikipedia edit history. Such perturbation type is natural as its design does not stem from an arteficial generative process, inherently distinct from the previously investigated synthetic approaches. In a large-scale study encompassing SQUAD datasets and various model architectures we observe that natural perturbations result in performance degradation in pre-trained encoder language models. More worryingly, these state-of-the-art Flan-T5 and Large Language Models (LLMs) inherit these errors. Further experiments demonstrate that our findings generalise to natural perturbations found in other more challenging MRC benchmarks. In an effort to mitigate these errors, we show that it is possible to improve the robustness to natural perturbations by training on naturally or synthetically perturbed examples, though a noticeable gap still remains compared to performance on unperturbed data.

[42] REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction

Omar Sharif, Joseph Gatto, Madhusudan Basak, Sarah M. Preum

Main category: cs.CL

TL;DR: REGen is a new evaluation framework for event argument extraction that addresses limitations of exact match scoring, combining exact, relaxed, and LLM-based matching to better capture model performance and align with human judgment.

DetailsMotivation: Exact match (EM) evaluation severely underestimates performance of large language models in event argument extraction by ignoring semantically accurate variations, implicit arguments, and scattered arguments across documents.

Method: REGen combines exact matching, relaxed matching, and LLM-based matching to create a more comprehensive evaluation framework that better captures diverse argument expressions and semantic accuracy.

Result: Experiments on six datasets show REGen reveals an average performance gain of +23.93 F1 over EM, with human validation confirming 87.67% alignment with human assessments of argument correctness.

Conclusion: REGen provides a more reliable evaluation framework that better captures the actual capabilities of models in event argument extraction, addressing the limitations of traditional exact match evaluation.

Abstract: Event argument extraction identifies arguments for predefined event roles in text. Existing work evaluates this task with exact match (EM), where predicted arguments must align exactly with annotated spans. While suitable for span-based models, this approach falls short for large language models (LLMs), which often generate diverse yet semantically accurate arguments. EM severely underestimates performance by disregarding valid variations. Furthermore, EM evaluation fails to capture implicit arguments (unstated but inferable) and scattered arguments (distributed across a document). These limitations underscore the need for an evaluation framework that better captures models’ actual performance. To bridge this gap, we introduce REGen, a Reliable Evaluation framework for Generative event argument extraction. REGen combines the strengths of exact, relaxed, and LLM-based matching to better align with human judgment. Experiments on six datasets show that REGen reveals an average performance gain of +23.93 F1 over EM, reflecting capabilities overlooked by prior evaluation. Human validation further confirms REGen’s effectiveness, achieving 87.67% alignment with human assessments of argument correctness.

[43] MPO: Boosting LLM Agents with Meta Plan Optimization

Weimin Xiong, Yifan Song, Qingxiu Dong, Bingchan Zhao, Feifan Song, Xun Wang, Sujian Li

Main category: cs.CL

TL;DR: MPO framework enhances LLM-based agent planning by incorporating meta plans for explicit guidance and continuous optimization, outperforming existing methods on interactive tasks.

DetailsMotivation: Existing LLM-based agents suffer from planning hallucinations and require retraining for each new agent, needing a more efficient and generalizable solution.

Method: Proposes Meta Plan Optimization (MPO) framework that uses high-level general guidance through meta plans to assist agent planning and enables continuous optimization based on task execution feedback.

Result: MPO significantly outperforms existing baselines on two representative tasks, showing improved task completion efficiency and generalization in unseen scenarios.

Conclusion: MPO provides a plug-and-play solution that enhances planning capabilities without requiring retraining for each new agent, addressing planning hallucinations effectively.

Abstract: Recent advancements in large language models (LLMs) have enabled LLM-based agents to successfully tackle interactive planning tasks. However, despite their successes, existing approaches often suffer from planning hallucinations and require retraining for each new agent. To address these challenges, we propose the Meta Plan Optimization (MPO) framework, , which enhances agent planning capabilities by directly incorporating explicit guidance. Unlike previous methods that rely on complex knowledge, which either require significant human effort or lack quality assurance, MPO leverages high-level general guidance through meta plans to assist agent planning and enables continuous optimization of the meta plans based on feedback from the agent’s task execution. Our experiments conducted on two representative tasks demonstrate that MPO significantly outperforms existing baselines. Moreover, our analysis indicates that MPO provides a plug-and-play solution that enhances both task completion efficiency and generalization capabilities in previous unseen scenarios.

[44] DomainCQA: Crafting Knowledge-Intensive QA from Domain-Specific Charts

Yujing Lu, Ling Zhong, Jing Yang, Weiming Li, Peng Wei, Yongheng Wang, Manni Duan, Qing Zhang

Main category: cs.CL

TL;DR: DomainCQA is a framework for creating domain-specific chart question answering benchmarks that test both visual comprehension and knowledge-intensive reasoning, addressing limitations of existing benchmarks that focus only on surface-level parsing.

DetailsMotivation: Existing Chart Question Answering benchmarks mostly test surface-level parsing like reading labels and legends, overlooking deeper scientific reasoning needed for domain-specific chart understanding.

Method: DomainCQA integrates complexity-aware chart selection, multitier QA generation, and expert validation. Applied to astronomy, it creates AstroChart with 1,690 QA pairs over 482 charts, and is tested across 21 MLLMs.

Result: The benchmark exposes persistent weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration across models. Fine-tuning on AstroChart improves performance across fundamental and advanced tasks.

Conclusion: DomainCQA serves as a unified pipeline for constructing and augmenting domain-specific chart reasoning benchmarks, with pilot demonstrations in biochemistry, economics, medicine, and social science showing its generality.

Abstract: Chart Question Answering (CQA) evaluates Multimodal Large Language Models (MLLMs) on visual understanding and reasoning over chart data. However, existing benchmarks mostly test surface-level parsing, such as reading labels and legends, while overlooking deeper scientific reasoning. We propose DomainCQA, a framework for constructing domain-specific CQA benchmarks that emphasize both visual comprehension and knowledge-intensive reasoning. It integrates complexity-aware chart selection, multitier QA generation, and expert validation. Applied to astronomy, DomainCQA yields AstroChart, a benchmark of 1,690 QA pairs over 482 charts, exposing persistent weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration across 21 MLLMs. Fine-tuning on AstroChart improves performance across fundamental and advanced tasks. Pilot QA sets in biochemistry, economics, medicine, and social science further demonstrate DomainCQA’s generality. Together, our results establish DomainCQA as a unified pipeline for constructing and augmenting domain-specific chart reasoning benchmarks.

[45] Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan

Main category: cs.CL

TL;DR: ViLAMP introduces differential distillation for efficient long-form video processing, using hierarchical keyframe selection and feature merging to maintain temporal dependencies while reducing computational costs.

DetailsMotivation: Existing methods for long-form video processing sacrifice critical temporal dependencies or dilute semantic information due to high computational costs of handling extended temporal sequences.

Method: Differential distillation approach with two key mechanisms: differential keyframe selection to maximize query relevance while maintaining temporal distinctiveness, and differential feature merging to preserve query-salient features in non-keyframes.

Result: Superior performance across four video understanding benchmarks, particularly on long-form content. Can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU with substantial computational efficiency while maintaining state-of-the-art performance.

Conclusion: ViLAMP provides an effective solution for long-form video processing by systematically preserving task-relevant information while suppressing redundancy, achieving both computational efficiency and high performance.

Abstract: Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLAMP, a hierarchical video-language model that processes hour-long videos at “mixed precision” through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLAMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLAMP’s superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. Code and model are available at https://github.com/steven-ccq/ViLAMP.

[46] CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models

Feiyang Li, Peng Fang, Zhan Shi, Arijit Khan, Fang Wang, Weihao Wang, Xin Zhang, Yongjian Cui

Main category: cs.CL

TL;DR: CoT-RAG is a novel reasoning framework that combines knowledge graphs, retrieval-augmented generation, and pseudo-program execution to enhance Chain-of-Thought reasoning reliability and performance in LLMs.

DetailsMotivation: Address limitations of traditional CoT reasoning: lack of reliability when relying solely on LLM-generated reasoning chains, and lower performance of natural language prompts compared to code prompts.

Method: Three key designs: 1) Knowledge Graph-driven CoT Generation for enhanced credibility, 2) Learnable Knowledge Case-aware RAG for retrieving relevant sub-cases, 3) Pseudo Program Prompting Execution for greater logical rigor.

Result: Significant accuracy gains of 4.0% to 44.3% over state-of-the-art methods on nine public datasets across three reasoning tasks. Exceptional accuracy and efficient execution on four domain-specific datasets.

Conclusion: CoT-RAG demonstrates practical applicability and scalability, providing a robust framework for enhancing LLM reasoning performance through structured knowledge integration and program-like execution guidance.

Abstract: Chain-of-thought (CoT) reasoning boosts large language models’ (LLMs) performance on complex tasks but faces two key limitations: a lack of reliability when solely relying on LLM-generated reasoning chains and lower reasoning performance from natural language prompts compared with code prompts. To address these issues, we propose CoT-RAG, a novel reasoning framework with three key designs: (i) Knowledge Graph-driven CoT Generation, featuring knowledge graphs to modulate reasoning chain generation of LLMs, thereby enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which incorporates retrieval-augmented generation (RAG) into knowledge graphs to retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable information; (iii) Pseudo Program Prompting Execution, which promotes greater logical rigor by guiding LLMs to execute reasoning tasks as pseudo-programs. Evaluations on nine public datasets spanning three reasoning tasks reveal significant accuracy gains-ranging from 4.0% to 44.3%-over state-of-the-art methods. Furthermore, tests on four domain-specific datasets demonstrate exceptional accuracy and efficient execution, underscoring its practical applicability and scalability. Our code and data are available at https: //github.com/hustlfy123/CoT-RAG.

[47] Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

Hanhua Hong, Chenghao Xiao, Yang Wang, Yiqi Liu, Wenge Rong, Chenghua Lin

Main category: cs.CL

TL;DR: Proposes inversion learning method to automatically generate effective evaluation prompts for LLM-based evaluators, eliminating manual prompt engineering and improving robustness.

DetailsMotivation: Human evaluation of NLG systems suffers from inconsistencies and biases, while LLM-based evaluators are highly sensitive to prompt design variations, limiting reproducibility and scalability.

Method: Uses inversion learning to learn reverse mappings from model outputs back to input instructions, enabling automatic generation of model-specific evaluation prompts with just one sample.

Result: Eliminates need for manual prompt engineering, improves efficiency and robustness of LLM-based evaluation systems.

Conclusion: Contributes to more robust and efficient LLM-based evaluation through automatic prompt generation via inversion learning.

Abstract: Evaluating natural language generation systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardisation, and demographic biases, limiting reproducibility. LLM-based evaluators offer a scalable alternative but are highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.

[48] Prior Prompt Engineering for Reinforcement Fine-Tuning

Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul

Main category: cs.CL

TL;DR: Prior prompt engineering (pPE) in reinforcement fine-tuning significantly outperforms inference-time prompt engineering, with null-example approach achieving largest gains on reasoning benchmarks.

DetailsMotivation: Existing RFT research focuses on algorithms and reward shaping, but the design of prior prompts (instructions prepended during training) remains underexplored despite its potential to guide model behaviors.

Method: Translated five representative inference-time prompt engineering strategies (reasoning, planning, code-based reasoning, knowledge recall, null-example) into pPE approaches and experimented with Qwen2.5-7B model, evaluating on in-domain and out-of-domain benchmarks including AIME2024, HumanEval+, and GPQA-Diamond.

Result: All pPE-trained models surpassed their iPE-prompted counterparts, with null-example pPE achieving the largest average performance gain and highest improvement on AIME2024 and GPQA-Diamond. Different pPE strategies instilled distinct behavioral styles in the resulting models.

Conclusion: pPE is a powerful yet understudied axis for reinforcement fine-tuning that can effectively guide language models to internalize distinct behaviors and significantly improve performance compared to inference-time prompting approaches.

Abstract: This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt–the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning–remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies–reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization–into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.

[49] Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors

Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang

Main category: cs.CL

TL;DR: CoPA is a training-free method that uses off-the-shelf LLMs to generate human-like text that bypasses detection by subtracting machine-like patterns during decoding.

DetailsMotivation: Existing paraphrase attacks require substantial data and computational resources to train specialized paraphrasers, and their effectiveness decreases against advanced detection algorithms.

Method: CoPA crafts instructions to encourage LLMs to produce human-like texts, then constructs an auxiliary machine-like word distribution to subtract machine-like patterns from the human-like distribution during decoding.

Result: Extensive experiments validate CoPA’s effectiveness in fooling text detectors across various scenarios.

Conclusion: CoPA provides an effective training-free solution for bypassing text detectors by leveraging contrastive distributions to remove machine-like attributes from generated text.

Abstract: The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbf{Co}ntrastive \textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.

[50] How Far Are We from Optimal Reasoning Efficiency?

Jiaxuan Gao, Shu Yan, Qixin Tan, Lu Yang, Shusheng Xu, Wei Fu, Zhiyu Mei, Kaifeng Lyu, Yi Wu

Main category: cs.CL

TL;DR: Proposes reasoning efficiency frontiers and REG metric to evaluate reasoning model efficiency, introduces REO-RL algorithm that reduces reasoning length by 50%+ while maintaining accuracy.

DetailsMotivation: Large Reasoning Models produce verbose reasoning traces with high inference costs, but existing efficiency evaluations are inconsistent and current methods either sacrifice accuracy or remain inefficient.

Method: Introduces reasoning efficiency frontiers as empirical upper bounds, proposes REG metric to quantify efficiency gaps, and develops REO-RL reinforcement learning algorithm using strategic token budget selection.

Result: REO-RL reduces REG by >=50% across all evaluated models, matches efficiency frontiers under 16K token budget with minimal accuracy loss, and effectively captures accuracy-length trade-offs.

Conclusion: Fine-tuning LRMs to perfectly align with efficiency frontiers remains challenging, but the proposed REG metric and REO-RL algorithm significantly improve reasoning efficiency while maintaining accuracy.

Abstract: Large Reasoning Models (LRMs) demonstrate remarkable problem-solving capabilities through extended Chain-of-Thought (CoT) reasoning but often produce excessively verbose and redundant reasoning traces. This inefficiency incurs high inference costs and limits practical deployment. While existing fine-tuning methods aim to improve reasoning efficiency, assessing their efficiency gains remains challenging due to inconsistent evaluations. In this work, we introduce the reasoning efficiency frontiers, empirical upper bounds derived from fine-tuning base LRMs across diverse approaches and training configurations. Based on these frontiers, we propose the Reasoning Efficiency Gap (REG), a unified metric quantifying deviations of any fine-tuned LRMs from these frontiers. Systematic evaluation on challenging mathematical benchmarks reveals significant gaps in current methods: they either sacrifice accuracy for short length or still remain inefficient under tight token budgets. To reduce the efficiency gap, we propose REO-RL, a class of Reinforcement Learning algorithms that minimizes REG by targeting a sparse set of token budgets. Leveraging numerical integration over strategically selected budgets, REO-RL approximates the full efficiency objective with low error using a small set of token budgets. Through systematic benchmarking, we demonstrate that our efficiency metric, REG, effectively captures the accuracy-length trade-off, with low-REG methods reducing length while maintaining accuracy. Our approach, REO-RL, consistently reduces REG by >=50 across all evaluated LRMs and matching Qwen3-4B/8B efficiency frontiers under a 16K token budget with minimal accuracy loss. Ablation studies confirm the effectiveness of our exponential token budget strategy. Finally, our findings highlight that fine-tuning LRMs to perfectly align with the efficiency frontiers remains an open challenge.

[51] VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Sam Yu-Te Lee, Chenyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma

Main category: cs.CL

TL;DR: VIDEE is a system that enables entry-level analysts to perform advanced text analytics using LLMs through a human-agent collaboration workflow with decomposition, execution, and evaluation stages.

DetailsMotivation: Traditional text analytics requires specialized NLP knowledge, creating barriers for entry-level analysts. Recent LLM advances enable more accessible text analysis but still need systems that support non-experts.

Method: VIDEE implements a three-stage human-agent collaboration workflow: (1) Decomposition with human-in-the-loop Monte-Carlo Tree Search for generative reasoning, (2) Execution that generates executable text analytics pipelines, (3) Evaluation with LLM-based assessment and visualizations for user validation.

Result: Two quantitative experiments show VIDEE’s effectiveness and analyze agent errors. A user study with participants ranging from no experience to experts demonstrates system usability and reveals distinct user behavior patterns.

Conclusion: The findings identify design implications for human-agent collaboration, validate VIDEE’s utility for non-experts, and inform future improvements to intelligent text analytics systems.

Abstract: Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE’s effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience – from none to expert – demonstrates the system’s usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

[52] HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng, Yue Shen, Jian Wang, Peng Wei

Main category: cs.CL

TL;DR: HIRAG introduces hierarchical instruction fine-tuning with multi-level chain-of-thought to enhance RAG models’ filtering, combination, and reasoning abilities.

DetailsMotivation: Traditional RAG systems rely on LLMs' in-context learning but lack specialized capabilities for handling document quality inconsistencies and retrieval imperfections. Current fine-tuning approaches don't sufficiently focus on RAG-specific tasks or deep chain-of-thought utilization.

Method: Hierarchical-Thought Instruction-Tuning (HIRAG) with “think before answering” strategy using multi-level progressive chain-of-thought to develop three hierarchical abilities: filtering, combination, and RAG-specific reasoning.

Result: HIRAG significantly improves model performance on RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA datasets.

Conclusion: The proposed hierarchical instruction fine-tuning approach effectively enhances RAG models’ capabilities by systematically developing progressive reasoning skills through chain-of-thought processes.

Abstract: Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering” strategy. This method enhances the model’s open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model’s performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.

[53] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

Jian Chen, Jinbao Tian, Yankui Li, Yuqi Lu, Zhou Li

Main category: cs.CL

TL;DR: ARCE proposes using LLMs to generate simple explanations for AEC domain texts to enhance RoBERTa’s NER performance, achieving state-of-the-art results with 77.20% Macro-F1 score.

DetailsMotivation: Standard pre-trained models struggle with specialized AEC terminology and complex relational contexts, while manual domain corpus creation is labor-intensive and costly.

Method: Uses LLM to generate simple explanations (Cote corpus), then incrementally pre-trains RoBERTa with this corpus before fine-tuning on downstream NER task.

Result: Achieves new state-of-the-art 77.20% Macro-F1 score on benchmark AEC dataset, showing simple explanations outperform complex rationales.

Conclusion: Simple explanation-based knowledge generation effectively bridges domain gap and enhances smaller models’ performance in specialized domains like AEC.

Abstract: Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:https://github.com/nxcc-lab/ARCE.

[54] Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu

Main category: cs.CL

TL;DR: ASearcher is an open-source RL framework that trains LLM-based search agents to achieve expert-level search intelligence through scalable asynchronous training and autonomous QA synthesis, achieving significant performance gains on benchmarks.

DetailsMotivation: Open-source LLM agents lack expert-level search capabilities (resolving ambiguous queries, precise searches, result analysis, thorough exploration) due to scalability, efficiency, and data quality limitations in existing approaches.

Method: Scalable fully asynchronous RL training for long-horizon search, plus a prompt-based LLM agent that autonomously synthesizes high-quality QA datasets for training.

Result: 46.7% and 20.8% Avg@4 gains on xBench and GAIA benchmarks, with extreme long-horizon search (40+ tool calls, 150k+ output tokens), surpassing existing open-source 32B agents.

Conclusion: ASearcher demonstrates that scalable RL training with autonomous data synthesis enables open-source agents to achieve expert-level search intelligence without external LLMs, with models and code publicly available.

Abstract: Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

[55] A Survey on Training-free Alignment of Large Language Models

Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian

Main category: cs.CL

TL;DR: This paper provides the first systematic review of training-free alignment methods for LLMs, categorizing them into pre-decoding, in-decoding, and post-decoding stages as alternatives to resource-intensive fine-tuning approaches.

DetailsMotivation: Traditional alignment methods rely on resource-intensive fine-tuning which suffers from knowledge degradation and faces challenges in constrained computational environments. Training-free alignment techniques offer a promising alternative that works with both open-source and closed-source LLMs.

Method: Systematic review and categorization of training-free alignment methods into three stages: pre-decoding (in-context learning), in-decoding (decoding-time adjustments), and post-decoding (post-generation corrections). Analysis covers both LLMs and multimodal LLMs.

Result: Comprehensive examination of TF alignment mechanisms and limitations across different stages, providing a structured framework for understanding these methods.

Conclusion: The survey synthesizes rapidly growing research in TF alignment, identifies key challenges and future directions, and provides guidance for developing more inclusive and effective alignment techniques for safer and more reliable LLMs.

Abstract: The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques–leveraging in-context learning, decoding-time adjustments, and post-generation corrections–offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.

Figarri Keisha, Prince Singh, Pallavi, Dion Fernandes, Aravindh Manivannan, Ilham Wicaksono, Faisal Ahmad, Wiem Ben Rim

Main category: cs.CL

TL;DR: Novel end-to-end RAG pipeline for legal domain with three key enhancements: context-aware query translation, open-source retrieval strategies, and comprehensive evaluation framework, achieving performance rivaling proprietary approaches.

DetailsMotivation: RAG has transformed text generation by grounding LLM outputs in retrieved knowledge, which is especially critical in the legal domain where accuracy and faithfulness are paramount.

Method: Three targeted enhancements: (i) context-aware query translator that handles document references and adapts retrieval parameters, (ii) open-source retrieval using SBERT and GTE embeddings, (iii) comprehensive evaluation framework with RAGAS, BERTScore-F1, and ROUGE-Recall metrics.

Result: Open-source pipelines rival proprietary approaches in retrieval quality, and custom legal-grounded prompts produce more faithful and contextually relevant answers than baseline prompting.

Conclusion: Task-aware, component-level tuning enables legally grounded, reproducible, and cost-effective RAG systems for legal research assistance, demonstrating the potential of carefully designed open-source solutions.

Abstract: Retrieval-Augmented Generation (RAG) has transformed how we approach text generation tasks by grounding Large Language Model (LLM) outputs in retrieved knowledge. This capability is especially critical in the legal domain. In this work, we introduce a novel end-to-end RAG pipeline that improves upon previous baselines using three targeted enhancements: (i) a context-aware query translator that disentangles document references from natural-language questions and adapts retrieval depth and response style based on expertise and specificity, (ii) open-source retrieval strategies using SBERT and GTE embeddings that achieve substantial performance gains while remaining cost-efficient, and (iii) a comprehensive evaluation and generation framework that combines RAGAS, BERTScore-F1, and ROUGE-Recall to assess semantic alignment and faithfulness across models and prompt designs. Our results show that carefully designed open-source pipelines can rival proprietary approaches in retrieval quality, while a custom legal-grounded prompt consistently produces more faithful and contextually relevant answers than baseline prompting. Taken together, these contributions demonstrate the potential of task-aware, component-level tuning to deliver legally grounded, reproducible, and cost-effective RAG systems for legal research assistance.

[57] Subjective Behaviors and Preferences in LLM: Language of Browsing

Sai Sundaresan, Harshita Chopra, Atanu R. Sinha, Koustava Goswami, Nagasai Saketh Naidu, Raghav Karan, N Anushka

Main category: cs.CL

TL;DR: Small LMs with page-level tokenization outperform large LMs for browsing behavior modeling, and cluster-specific training (HeTLM) with heterogeneous parameters beats single LMs while reducing performance variance.

DetailsMotivation: Questioning whether large language models can adequately capture users' subjective browsing behaviors and preferences, which form unique "languages" without natural language structure.

Method: Introduces HeTLM (Heterogeneity aware Training of Language Model) with clusterwise training using page-level tokenizer and heterogeneous cluster-specific parameters instead of a single LM.

Result: Small LM with page-level tokenizer outperforms large pretrained/finetuned LMs; HeTLM with cluster-specific parameters beats single LMs with same parameter count; achieves higher mean and lower variance in generation.

Conclusion: Cluster-specific training approach better captures subjective user behaviors, improves alignment, and demonstrates that smaller, specialized models can outperform large general-purpose LMs for browsing behavior modeling.

Abstract: A Large Language Model (LLM) offers versatility across domains and tasks, purportedly benefiting users with a wide variety of behaviors and preferences. We question this perception about an LLM when users have inherently subjective behaviors and preferences, as seen in their ubiquitous and idiosyncratic browsing of websites or apps. The sequential behavior logs of pages, thus generated, form something akin to each user’s self-constructed “language”, albeit without the structure and grammar imbued in natural languages. We ask: (i) Can a small LM represent the “language of browsing” better than a large LM? (ii) Can an LM with a single set of parameters (or, single LM) adequately capture myriad users’ heterogeneous, subjective behaviors and preferences? (iii) Can a single LM with high average performance, yield low variance in performance to make alignment good at user level? We introduce clusterwise LM training, HeTLM (Heterogeneity aware Training of Language Model), appropriate for subjective behaviors. We find that (i) a small LM trained using a page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM with heterogeneous cluster specific set of parameters outperforms a single LM of the same family, controlling for the number of parameters; and (iii) a higher mean and a lower variance in generation ensues, implying improved alignment.

[58] A Dynamic Fusion Model for Consistent Crisis Response

Xiaoying Song, Anirban Saha Anik, Eduardo Blanco, Vanessa Frias-Martinez, Lingzi Hong

Main category: cs.CL

TL;DR: Proposes a novel metric and fusion-based approach for maintaining consistent response style in crisis communication language models, reducing stylistic variation while maintaining quality.

DetailsMotivation: Address the critical need for stylistic consistency in automated crisis communication responses to build trust with affected populations, as current methods often overlook this important factor.

Method: Two-stage process: 1) Assess style of candidate responses using a novel consistency metric, 2) Optimize and integrate responses through instance-level fusion to reduce stylistic variation while maintaining quality.

Result: Experimental results across multiple datasets show the approach consistently outperforms baselines in both response quality and stylistic uniformity.

Conclusion: The proposed fusion-based generation method successfully addresses the style consistency problem in crisis communication, enabling more trustworthy automated responses through reduced stylistic variation.

Abstract: In response to the urgent need for effective communication with crisis-affected populations, automated responses driven by language models have been proposed to assist in crisis communications. A critical yet often overlooked factor is the consistency of response style, which could affect the trust of affected individuals in responders. Despite its importance, few studies have explored methods for maintaining stylistic consistency across generated responses. To address this gap, we propose a novel metric for evaluating style consistency and introduce a fusion-based generation approach grounded in this metric. Our method employs a two-stage process: it first assesses the style of candidate responses and then optimizes and integrates them at the instance level through a fusion process. This enables the generation of high-quality responses while significantly reducing stylistic variation between instances. Experimental results across multiple datasets demonstrate that our approach consistently outperforms baselines in both response quality and stylistic uniformity.

[59] Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL

Xiaoying Song, Anirban Saha Anik, Dibakar Barua, Pengcheng Luo, Junhua Ding, Lingzi Hong

Main category: cs.CL

TL;DR: A framework using RAG with RL to generate health misinformation counterspeech tailored to different literacy levels, outperforming uniform baseline approaches.

DetailsMotivation: Health misinformation online threatens public health, and existing counterspeech methods produce uniform responses that ignore audience health literacy levels, affecting accessibility and effectiveness.

Method: Controlled-Literacy framework combining retrieval-augmented generation (RAG) with reinforcement learning (RL) to retrieve knowledge aligned with specific health literacy levels and optimize counterspeech using reward functions based on user preferences and readability.

Result: Experiment results show Controlled-Literacy outperforms baselines by generating more accessible and user-preferred counterspeech.

Conclusion: This research contributes to more equitable and impactful public health communication by improving the accessibility and comprehension of counterspeech to health misinformation.

Abstract: Health misinformation spreading online poses a significant threat to public health. Researchers have explored methods for automatically generating counterspeech to health misinformation as a mitigation strategy. Existing approaches often produce uniform responses, ignoring that the health literacy level of the audience could affect the accessibility and effectiveness of counterspeech. We propose a Controlled-Literacy framework using retrieval-augmented generation (RAG) with reinforcement learning (RL) to generate tailored counterspeech adapted to different health literacy levels. In particular, we retrieve knowledge aligned with specific health literacy levels, enabling accessible and factual information to support generation. We design a reward function incorporating subjective user preferences and objective readability-based rewards to optimize counterspeech to the target health literacy level. Experiment results show that Controlled-Literacy outperforms baselines by generating more accessible and user-preferred counterspeech. This research contributes to more equitable and impactful public health communication by improving the accessibility and comprehension of counterspeech to health misinformation

[60] GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Chenglong Wang, Yongyu Mu, Hang Zhou, Yifu Huo, Ziming Zhu, Jiali Zeng, Murun Yang, Bei Li, Tong Xiao, Xiaoyang Hao, Chunliang Zhang, Fandong Meng, Jingbo Zhu

Main category: cs.CL

TL;DR: GRAM-R^2 is a generative reward model that produces both preference labels and reward rationales through self-training on unlabeled data, outperforming existing baselines across multiple tasks.

DetailsMotivation: Current reward models heavily rely on large-scale labeled preference data, and existing pre-training approaches fail to instill explicit reasoning capabilities into reward models.

Method: Proposed self-training approach leveraging unlabeled data to elicit reward reasoning. Developed GRAM-R^2, a generative reward model that produces both preference labels and accompanying reward rationales.

Result: GRAM-R^2 consistently delivers strong performance, outperforming several strong discriminative and generative baselines in response ranking, task adaptation, and reinforcement learning from human feedback.

Conclusion: GRAM-R^2 serves as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning, supporting downstream applications like response ranking and task-specific reward tuning.

Abstract: Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

[61] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin

Main category: cs.CL

TL;DR: Drivelology is “nonsense with depth” - syntactically coherent but pragmatically paradoxical language that current LLMs fail to understand despite excelling at other NLP tasks.

DetailsMotivation: To investigate LLMs' limitations in understanding layered semantic meaning that requires contextual inference, moral reasoning, or emotional interpretation beyond surface-level coherence.

Method: Created a benchmark dataset of 1,200+ curated examples across 6 languages, evaluated LLMs on classification, generation, and reasoning tasks using expert-reviewed Drivelological text.

Result: LLMs consistently fail to grasp Drivelology, confusing it with shallow nonsense, producing incoherent justifications, and missing implied rhetorical functions.

Conclusion: There’s a deep representational gap in LLMs’ pragmatic understanding, challenging the assumption that statistical fluency implies cognitive comprehension.

Abstract: We introduce Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth” - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a benchmark dataset of over 1,200+ meticulously curated and diverse examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each example underwent careful expert review to verify its Drivelological characteristics, involving multiple rounds of discussion and adjudication to address disagreements. Using this dataset, we evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss implied rhetorical functions altogether. These findings highlight a deep representational gap in LLMs’ pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

[62] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

Bufan Gao, Elisa Kreiss

Main category: cs.CL

TL;DR: LLM gender bias evaluations are sensitive to how prompts signal evaluative context, with more explicit evaluation framing eliciting different gender output distributions and discrete-choice metrics amplifying bias compared to probabilistic measures.

DetailsMotivation: As LLMs are increasingly used in socially impactful settings, concerns about gender bias have grown, but current evaluation methods often use artificial prompts that may not reflect natural language distributions, raising questions about how signaling evaluative purpose affects measured bias.

Method: Tested models under different prompt conditions that make testing context and gender-focused content salient, assessed across four task formats using both token-probability and discrete-choice metrics.

Result: Prompts that clearly align with gender bias evaluation framing produce distinct gender output distributions compared to less evaluation-framed prompts. Discrete-choice metrics tend to amplify bias relative to probabilistic measures.

Conclusion: The findings reveal brittleness in LLM gender bias evaluations and raise important questions about whether well-controlled testing designs trigger ’testing mode’ performance, challenging the ecological validity of future benchmarks.

Abstract: As LLMs are increasingly applied in socially impactful settings, concerns about gender bias have prompted growing efforts both to measure and mitigate such bias. These efforts often rely on evaluation tasks that differ from natural language distributions, as they typically involve carefully constructed task prompts that overtly or covertly signal the presence of gender bias-related content. In this paper, we examine how signaling the evaluative purpose of a task impacts measured gender bias in LLMs. Concretely, we test models under prompt conditions that (1) make the testing context salient, and (2) make gender-focused content salient. We then assess prompt sensitivity across four task formats with both token-probability and discrete-choice metrics. We find that prompts that more clearly align with (gender bias) evaluation framing elicit distinct gender output distributions compared to less evaluation-framed prompts. Discrete-choice metrics further tend to amplify bias relative to probabilistic measures. These findings do not only highlight the brittleness of LLM gender bias evaluations but open a new puzzle for the NLP benchmarking and development community: To what extent can well-controlled testing designs trigger LLM “testing mode” performance, and what does this mean for the ecological validity of future benchmarks.

[63] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

Jianghao Chen, Wei Sun, Qixiang Yin, Lingxing Kong, Zhixing Tan, Jiajun Zhang

Main category: cs.CL

TL;DR: ACE-RL framework uses adaptive constraint-enhanced reinforcement learning to improve LLM long-form generation by automatically deconstructing instructions into fine-grained constraints and converting subjective quality evaluation into constraint verification.

DetailsMotivation: Address limitations in current LLM long-form generation: heavy reliance on scarce high-quality training data and focus on coarse-grained quality metrics that overlook fine-grained specifics of diverse generation scenarios.

Method: Proposes ACE-RL framework that: 1) automatically deconstructs instructions into fine-grained adaptive constraint criteria, 2) designs reward mechanism quantifying response quality based on constraint satisfaction, 3) uses reinforcement learning to guide models toward better generation capabilities.

Result: Significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, with top-performing model surpassing GPT-4o by 7.10%.

Conclusion: Provides more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios through constraint-based reinforcement learning.

Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in long-context understanding, yet they face significant challenges in high-quality long-form generation. Existing studies primarily suffer from two limitations: (1) A heavy reliance on scarce, high-quality long-form response data for supervised fine-tuning (SFT) or for pairwise preference reward in reinforcement learning (RL). (2) Focus on coarse-grained quality optimization dimensions, such as relevance, coherence, and helpfulness, overlooking the fine-grained specifics inherent to diverse long-form generation scenarios. To address this issue, we propose a framework using Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first automatically deconstructs each instruction into a set of fine-grained, adaptive constraint criteria by identifying its underlying intents and demands. Subsequently, we design a reward mechanism that quantifies the quality of long-form responses based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we utilize reinforcement learning to guide models toward superior long-form generation capabilities. Experimental results demonstrate that our ACE-RL framework significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 7.10%, providing a more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios.

[64] CURE: Controlled Unlearning for Robust Embeddings – Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Aysenur Kocak, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.CL

TL;DR: CURE is a lightweight framework that disentangles and suppresses conceptual shortcuts in pre-trained language models while preserving task-relevant information, achieving significant performance improvements with minimal computational overhead.

DetailsMotivation: Pre-trained language models are susceptible to spurious, concept-driven correlations that impair robustness and fairness, requiring methods to address these biases without losing essential content information.

Method: Uses a content extractor with reversal network to get concept-irrelevant representations, followed by a controllable debiasing module with contrastive learning to adjust residual conceptual cues based on task needs.

Result: Achieves +10 points F1 improvement on IMDB and +2 points on Yelp across three pre-trained architectures with minimal computational overhead.

Conclusion: Provides a flexible, unsupervised blueprint for combating conceptual biases, enabling more reliable and fair language understanding systems.

Abstract: Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.

[65] MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke

Main category: cs.CL

TL;DR: MachineLearningLM is a framework that enhances LLMs’ in-context learning for ML tasks using structural causal models and knowledge distillation, achieving strong performance on tabular classification while preserving general capabilities.

DetailsMotivation: LLMs struggle to learn from many in-context examples on standard ML tasks without gradient descent, despite having broad knowledge and reasoning abilities.

Method: Continued-pretraining framework that synthesizes ML tasks from structural causal models, uses random-forest teacher for knowledge distillation, and employs token-efficient prompting for batch inference.

Result: Outperforms strong LLM baselines by ~15% on out-of-distribution tabular classification, shows many-shot scaling law with accuracy increasing from 8 to 1,024 shots, and achieves random-forest-level accuracy without task-specific training while preserving 75.4% MMLU score.

Conclusion: MachineLearningLM successfully equips general-purpose LLMs with robust in-context ML capability while maintaining their general knowledge and reasoning abilities for broader applications.

Abstract: Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.

[66] DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge

Zonghai Yao, Michael Sun, Won Seok Jang, Sunjae Kwon, Soie Kwon, Hong Yu

Main category: cs.CL

TL;DR: DischargeSim is a new benchmark that evaluates LLMs’ ability to provide personalized discharge education through simulated doctor-patient conversations, revealing significant performance gaps across different patient profiles.

DetailsMotivation: Current LLM benchmarks focus on diagnostic reasoning but fail to evaluate models' ability to support patients after visits through discharge communication and education.

Method: DischargeSim simulates multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles across six discharge topics, evaluating dialogue quality, personalized document generation, and patient comprehension.

Result: Experiments with 18 LLMs show significant gaps in discharge education capability, with performance varying widely across patient profiles. Model size doesn’t always correlate with better education outcomes.

Conclusion: DischargeSim provides the first benchmark for evaluating LLMs in post-visit clinical education and promotes equitable, personalized patient support.

Abstract: Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models’ ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.

[67] M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models

Zexuan Li, Hongliang Dai, Piji Li

Main category: cs.CL

TL;DR: M-BRe framework combines multi-class and binary classification approaches to efficiently extract high-quality training instances for relation extraction from unlabeled texts using LLMs.

DetailsMotivation: Manual annotation for relation extraction is expensive due to scarcity of relevant sentences. LLMs struggle with comprehensive multi-class classification and binary classification has high computational overhead.

Method: Three-module framework: Relation Grouping, Relation Extraction, and Label Decision. Combines advantages of multi-class and binary classification approaches to optimize both performance and efficiency.

Result: Extensive experiments confirm superior capability in discovering high-quality training samples from unlabeled texts for relation extraction.

Conclusion: M-BRe effectively addresses the limitations of existing LLM-based approaches for relation extraction, providing an efficient and high-quality solution for training data extraction.

Abstract: For Relation Extraction (RE), the manual annotation of training data may be prohibitively expensive, since the sentences that contain the target relations in texts can be very scarce and difficult to find. It is therefore beneficial to develop an efficient method that can automatically extract training instances from unlabeled texts for training RE models. Recently, large language models (LLMs) have been adopted in various natural language processing tasks, with RE also benefiting from their advances. However, when leveraging LLMs for RE with predefined relation categories, two key challenges arise. First, in a multi-class classification setting, LLMs often struggle to comprehensively capture the semantics of every relation, leading to suboptimal results. Second, although employing binary classification for each relation individually can mitigate this issue, it introduces significant computational overhead, resulting in impractical time complexity for real-world applications. Therefore, this paper proposes a framework called M-BRe to extract training instances from unlabeled texts for RE. It utilizes three modules to combine the advantages of both of the above classification approaches: Relation Grouping, Relation Extraction, and Label Decision. Extensive experiments confirm its superior capability in discovering high-quality training samples from unlabeled texts for RE.

[68] SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP

Decheng Duan, Yingyi Zhang, Jitong Peng, Chengzhi Zhang

Main category: cs.CL

TL;DR: SciNLP is a new benchmark dataset for full-text entity and relation extraction in NLP research, featuring 60 annotated papers with 7K+ entities and 1.8K+ relations, enabling better knowledge graph construction.

DetailsMotivation: Existing datasets for scientific information extraction focus on specific publication sections due to domain complexity and high annotation costs, limiting comprehensive analysis of full scientific texts.

Method: Created SciNLP dataset with 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Conducted comparative experiments with state-of-the-art supervised models and performed cross-dataset comparisons.

Result: Results show varying extraction capabilities across different text lengths. SciNLP achieves significant performance improvements on certain baseline models. The constructed knowledge graph has an average node degree of 3.2 per entity, indicating rich semantic information.

Conclusion: SciNLP is the first full-text entity and relation extraction dataset for the NLP domain, enabling better knowledge graph construction and enhancing downstream applications. The dataset is publicly available for research use.

Abstract: Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP - a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.2 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at https://github.com/AKADDC/SciNLP.

cs.CV

[69] HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu

Main category: cs.CV

TL;DR: HuMo is a unified Human-Centric Video Generation framework that addresses multimodal coordination challenges through a two-stage training approach with specialized strategies for subject preservation and audio-visual synchronization.

DetailsMotivation: Existing methods struggle to effectively coordinate heterogeneous modalities (text, image, audio) due to scarce training data with paired triplet conditions and difficulty collaborating sub-tasks of subject preservation and audio-visual sync.

Method: Two-stage progressive multimodal training: 1) Minimal-invasive image injection for subject preservation, 2) Focus-by-predicting strategy for audio-visual sync with audio cross-attention, plus time-adaptive Classifier-Free Guidance for inference.

Result: Extensive experiments show HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned human video generation.

Conclusion: HuMo successfully addresses multimodal coordination challenges through dataset construction and progressive training strategies, achieving superior performance in human-centric video generation from multimodal inputs.

Abstract: Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

[70] 3D and 4D World Modeling: A Survey

Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu

Main category: cs.CV

TL;DR: This survey provides the first comprehensive review of 3D and 4D world modeling, establishing standardized definitions and taxonomy to address fragmented literature in the field.

DetailsMotivation: Prior work has focused on 2D image/video generation while overlooking native 3D/4D representations, and the absence of standardized definitions has led to inconsistent claims in world modeling research.

Method: The authors establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics for 3D/4D settings.

Result: The survey provides a coherent foundational reference with systematic literature summary, practical applications discussion, and identification of open challenges and research directions.

Conclusion: This work aims to advance the field of 3D and 4D world modeling by providing standardized definitions, taxonomy, and comprehensive review to address current fragmentation and inconsistencies in the literature.

Abstract: World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models’’ has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey

[71] An Explainable Deep Neural Network with Frequency-Aware Channel and Spatial Refinement for Flood Prediction in Sustainable Cities

Shahid Shafi Dar, Bharat Kaurav, Arnav Jain, Chandravardhan Singh Raghaw, Mohammad Zia Ur Rehman, Nagendra Kumar

Main category: cs.CV

TL;DR: XFloodNet is a novel deep learning framework for urban flood classification that integrates hierarchical cross-modal attention, heterogeneous convolutional multi-scale attention, and cascading transformer feature refinement, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: Traditional flood detection methods rely on unimodal data and static rule-based systems, failing to capture dynamic flood relationships. Existing attention mechanisms and ensemble learning have limitations in hierarchical refinement, cross-modal integration, and noise adaptability.

Method: XFloodNet integrates three novel components: 1) Hierarchical Cross-Modal Gated Attention for dynamic visual-textual feature alignment, 2) Heterogeneous Convolutional Adaptive Multi-Scale Attention with frequency-enhanced channel and spatial attention, and 3) Cascading Convolutional Transformer Feature Refinement for hierarchical feature harmonization.

Result: Achieved state-of-the-art F1-scores: 93.33% on Chennai Floods, 82.24% on Rhine18 Floods, and 88.60% on Harz17 Floods, significantly surpassing existing methods.

Conclusion: XFloodNet provides a robust and noise-resistant solution for urban flood classification, effectively addressing the limitations of traditional methods through advanced deep learning techniques and multi-modal feature integration.

Abstract: In an era of escalating climate change, urban flooding has emerged as a critical challenge for sustainable cities, threatening lives, infrastructure, and ecosystems. Traditional flood detection methods are constrained by their reliance on unimodal data and static rule-based systems, which fail to capture the dynamic, non-linear relationships inherent in flood events. Furthermore, existing attention mechanisms and ensemble learning approaches exhibit limitations in hierarchical refinement, cross-modal feature integration, and adaptability to noisy or unstructured environments, resulting in suboptimal flood classification performance. To address these challenges, we present XFloodNet, a novel framework that redefines urban flood classification through advanced deep-learning techniques. XFloodNet integrates three novel components: (1) a Hierarchical Cross-Modal Gated Attention mechanism that dynamically aligns visual and textual features, enabling precise multi-granularity interactions and resolving contextual ambiguities; (2) a Heterogeneous Convolutional Adaptive Multi-Scale Attention module, which leverages frequency-enhanced channel attention and frequency-modulated spatial attention to extract and prioritize discriminative flood-related features across spectral and spatial domains; and (3) a Cascading Convolutional Transformer Feature Refinement technique that harmonizes hierarchical features through adaptive scaling and cascading operations, ensuring robust and noise-resistant flood detection. We evaluate our proposed method on three benchmark datasets, such as Chennai Floods, Rhine18 Floods, and Harz17 Floods, XFloodNet achieves state-of-the-art F1-scores of 93.33%, 82.24%, and 88.60%, respectively, surpassing existing methods by significant margins.

[72] Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

Hyungjin Chung, Hyelin Nam, Jiyeon Kim, Hyojun Go, Byeongjun Park, Junho Kim, Joonseok Lee, Seongsu Ha, Byung-Hoon Kim

Main category: cs.CV

TL;DR: Video Parallel Scaling (VPS) enables VideoLLMs to process more frames without increasing context window by running parallel inference streams on disjoint frame subsets and aggregating outputs, improving temporal reasoning efficiently.

DetailsMotivation: VideoLLMs face computational bottlenecks when increasing input frames for temporal detail, causing prohibitive costs and performance degradation from long context lengths.

Method: VPS runs multiple parallel inference streams, each processing unique disjoint subsets of video frames, then aggregates output probabilities to integrate richer visual information without additional training.

Result: Extensive experiments across 2B-32B models on benchmarks show VPS consistently and significantly improves performance, scales better than alternatives like Self-consistency, and is memory-efficient.

Conclusion: VPS provides a robust framework for enhancing VideoLLMs’ temporal reasoning capabilities by effectively contracting the Chinchilla scaling law through uncorrelated visual evidence aggregation.

Abstract: Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model’s perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video’s frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.

[73] Two Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change

Lata Pangtey, Omkar Kabde, Shahid Shafi Dar, Nagendra Kumar

Main category: cs.CV

TL;DR: Multimodal stance detection framework combining text and visual information through hierarchical fusion, achieving state-of-the-art performance on climate change stance detection.

DetailsMotivation: Address the gap in existing stance detection approaches that focus only on textual data, while real-world social media content increasingly combines text with visual elements, requiring advanced multimodal methods.

Method: Hierarchical fusion approach using Large Language Model for stance-relevant text summaries, domain-aware image caption generator for visual content interpretation, and specialized transformer module to capture interactions between texts and images.

Result: Achieved 76.2% accuracy, 76.3% precision, 76.2% recall, and 76.2% F1-score on MultiClimate dataset, outperforming existing state-of-the-art approaches.

Conclusion: The proposed multimodal framework successfully integrates diverse modalities for robust stance classification, demonstrating superior performance in climate change-related stance detection tasks.

Abstract: With the rapid proliferation of information across digital platforms, stance detection has emerged as a pivotal challenge in social media analysis. While most of the existing approaches focus solely on textual data, real-world social media content increasingly combines text with visual elements creating a need for advanced multimodal methods. To address this gap, we propose a multimodal stance detection framework that integrates textual and visual information through a hierarchical fusion approach. Our method first employs a Large Language Model to retrieve stance-relevant summaries from source text, while a domain-aware image caption generator interprets visual content in the context of the target topic. These modalities are then jointly modeled along with the reply text, through a specialized transformer module that captures interactions between the texts and images. The proposed modality fusion framework integrates diverse modalities to facilitate robust stance classification. We evaluate our approach on the MultiClimate dataset, a benchmark for climate change-related stance detection containing aligned video frames and transcripts. We achieve accuracy of 76.2%, precision of 76.3%, recall of 76.2% and F1-score of 76.2%, respectively, outperforming existing state-of-the-art approaches.

[74] Two-Stage Swarm Intelligence Ensemble Deep Transfer Learning (SI-EDTL) for Vehicle Detection Using Unmanned Aerial Vehicles

Zeinab Ghasemi Darehnaei, Mohammad Shokouhifar, Hossein Yazdanjouei, S. M. J. Rastegar Fatemi

Main category: cs.CV

TL;DR: SI-EDTL is a two-stage swarm intelligence ensemble deep transfer learning model that combines multiple pre-trained CNN feature extractors with transfer classifiers for vehicle detection in UAV images, optimized using whale optimization algorithm.

DetailsMotivation: To develop an efficient and accurate method for detecting multiple vehicle types in UAV images, addressing the challenges of complex backgrounds and varying vehicle appearances in aerial imagery.

Method: Combines three pre-trained Faster R-CNN feature extractors (InceptionV3, ResNet50, GoogLeNet) with five transfer classifiers (KNN, SVM, MLP, C4.5, Naive Bayes) to create 15 base learners, aggregated via weighted averaging and optimized using whale optimization algorithm.

Result: Outperforms existing methods on the AU-AIR UAV dataset, achieving better accuracy, precision, and recall for vehicle detection tasks.

Conclusion: SI-EDTL demonstrates the effectiveness of combining ensemble deep transfer learning with swarm intelligence optimization for robust vehicle detection in UAV imagery.

Abstract: This paper introduces SI-EDTL, a two-stage swarm intelligence ensemble deep transfer learning model for detecting multiple vehicles in UAV images. It combines three pre-trained Faster R-CNN feature extractor models (InceptionV3, ResNet50, GoogLeNet) with five transfer classifiers (KNN, SVM, MLP, C4.5, Na"ive Bayes), resulting in 15 different base learners. These are aggregated via weighted averaging to classify regions as Car, Van, Truck, Bus, or background. Hyperparameters are optimized with the whale optimization algorithm to balance accuracy, precision, and recall. Implemented in MATLAB R2020b with parallel processing, SI-EDTL outperforms existing methods on the AU-AIR UAV dataset.

[75] MCTED: A Machine-Learning-Ready Dataset for Digital Elevation Model Generation From Mars Imagery

Rafał Osadnik, Pablo Gómez, Eleni Bohacek, Rickbir Bahia

Main category: cs.CV

TL;DR: New MCTED dataset for Martian DEM prediction with 80,898 samples from CTX instrument data, featuring optical images, DEM patches, and masks for missing/altered data. Small U-Net trained on this dataset outperforms DepthAnythingV2 foundation model.

DetailsMotivation: Address artefacts and missing data in large-scale Martian DEMs by creating a clean, machine learning-ready dataset for elevation model prediction tasks.

Method: Developed comprehensive pipeline to process Mars orthoimage and DEM pairs from MRO CTX data, creating training/validation splits without mutual area coverage to prevent data leakage. Includes tools to mitigate original data issues.

Result: Generated 80,898 diverse samples covering Martian surface. Small U-Net architecture trained on MCTED outperforms zero-shot performance of DepthAnythingV2 foundation model for elevation prediction.

Conclusion: Specialized datasets like MCTED enable better performance than general foundation models for domain-specific tasks like Martian DEM prediction. Dataset and code are open source for community use.

Abstract: This work presents a new dataset for the Martian digital elevation model prediction task, ready for machine learning applications called MCTED. The dataset has been generated using a comprehensive pipeline designed to process high-resolution Mars orthoimage and DEM pairs from Day et al., yielding a dataset consisting of 80,898 data samples. The source images are data gathered by the Mars Reconnaissance Orbiter using the CTX instrument, providing a very diverse and comprehensive coverage of the Martian surface. Given the complexity of the processing pipelines used in large-scale DEMs, there are often artefacts and missing data points in the original data, for which we developed tools to solve or mitigate their impact. We divide the processed samples into training and validation splits, ensuring samples in both splits cover no mutual areas to avoid data leakage. Every sample in the dataset is represented by the optical image patch, DEM patch, and two mask patches, indicating values that were originally missing or were altered by us. This allows future users of the dataset to handle altered elevation regions as they please. We provide statistical insights of the generated dataset, including the spatial distribution of samples, the distributions of elevation values, slopes and more. Finally, we train a small U-Net architecture on the MCTED dataset and compare its performance to a monocular depth estimation foundation model, DepthAnythingV2, on the task of elevation prediction. We find that even a very small architecture trained on this dataset specifically, beats a zero-shot performance of a depth estimation foundation model like DepthAnythingV2. We make the dataset and code used for its generation completely open source in public repositories.

[76] APML: Adaptive Probabilistic Matching Loss for Robust 3D Point Cloud Reconstruction

Sasan Sharifipour, Constantino Álvarez Casado, Mohammad Sabokrou, Miguel Bordallo López

Main category: cs.CV

TL;DR: APML is a differentiable approximation of Earth Mover Distance that uses Sinkhorn iterations for one-to-one point matching, achieving near-quadratic runtime and improved spatial distribution in point cloud prediction tasks.

DetailsMotivation: Existing point cloud loss functions like Chamfer Distance suffer from many-to-one correspondences causing point congestion and poor coverage, while EMD has prohibitive cubic complexity.

Method: Uses Sinkhorn iterations on temperature-scaled similarity matrix from pairwise distances, with analytical temperature computation to guarantee minimum assignment probability without manual tuning.

Result: Achieves faster convergence, superior spatial distribution (especially in low-density regions), and improved/on-par quantitative performance on ShapeNet benchmarks and WiFi CSI to 3D human point cloud generation.

Conclusion: APML provides an efficient, differentiable alternative to EMD with practical runtime comparable to Chamfer-based losses, eliminating hyperparameter search while improving point distribution quality.

Abstract: Training deep learning models for point cloud prediction tasks such as shape completion and generation depends critically on loss functions that measure discrepancies between predicted and ground-truth point sets. Commonly used functions such as Chamfer Distance (CD), HyperCD, and InfoCD rely on nearest-neighbor assignments, which often induce many-to-one correspondences, leading to point congestion in dense regions and poor coverage in sparse regions. These losses also involve non-differentiable operations due to index selection, which may affect gradient-based optimization. Earth Mover Distance (EMD) enforces one-to-one correspondences and captures structural similarity more effectively, but its cubic computational complexity limits its practical use. We propose the Adaptive Probabilistic Matching Loss (APML), a fully differentiable approximation of one-to-one matching that leverages Sinkhorn iterations on a temperature-scaled similarity matrix derived from pairwise distances. We analytically compute the temperature to guarantee a minimum assignment probability, eliminating manual tuning. APML achieves near-quadratic runtime, comparable to Chamfer-based losses, and avoids non-differentiable operations. When integrated into state-of-the-art architectures (PoinTr, PCN, FoldingNet) on ShapeNet benchmarks and on a spatiotemporal Transformer (CSI2PC) that generates 3D human point clouds from WiFi CSI measurements, APM loss yields faster convergence, superior spatial distribution, especially in low-density regions, and improved or on-par quantitative performance without additional hyperparameter search. The code is available at: https://github.com/apm-loss/apml.

[77] Lightweight Deep Unfolding Networks with Enhanced Robustness for Infrared Small Target Detection

Jingjing Liu, Yinchao Han, Xianchao Xiu, Jianhua Zhang, Wanquan Liu

Main category: cs.CV

TL;DR: L-RPCANet is a lightweight infrared small target detection framework using hierarchical bottleneck structure and noise reduction to achieve parameter efficiency and noise robustness.

DetailsMotivation: Existing deep unfolding networks for infrared small target detection face challenges in parameter lightweightness and noise robustness, requiring a more efficient and robust solution.

Method: Proposes L-RPCANet based on robust principal component analysis with hierarchical bottleneck structure for channel-wise feature refinement, noise reduction module for robustness, and squeeze-and-excitation networks for channel attention.

Result: Extensive experiments on ISTD datasets show superiority over state-of-the-art methods including RPCANet, DRPCANet, and RPCANet++.

Conclusion: L-RPCANet achieves excellent performance while maintaining both lightweightness and robustness in infrared small target detection.

Abstract: Infrared small target detection (ISTD) is one of the key techniques in image processing. Although deep unfolding networks (DUNs) have demonstrated promising performance in ISTD due to their model interpretability and data adaptability, existing methods still face significant challenges in parameter lightweightness and noise robustness. In this regard, we propose a highly lightweight framework based on robust principal component analysis (RPCA) called L-RPCANet. Technically, a hierarchical bottleneck structure is constructed to reduce and increase the channel dimension in the single-channel input infrared image to achieve channel-wise feature refinement, with bottleneck layers designed in each module to extract features. This reduces the number of channels in feature extraction and improves the lightweightness of network parameters. Furthermore, a noise reduction module is embedded to enhance the robustness against complex noise. In addition, squeeze-and-excitation networks (SENets) are leveraged as a channel attention mechanism to focus on the varying importance of different features across channels, thereby achieving excellent performance while maintaining both lightweightness and robustness. Extensive experiments on the ISTD datasets validate the superiority of our proposed method compared with state-of-the-art methods covering RPCANet, DRPCANet, and RPCANet++. The code will be available at https://github.com/xianchaoxiu/L-RPCANet.

[78] TransitReID: Transit OD Data Collection with Occlusion-Resistant Dynamic Passenger Re-Identification

Kaicong Huang, Talha Azfar, Jack Reilly, Ruimin Ke

Main category: cs.CV

TL;DR: TransitReID is a novel framework for automated passenger re-identification using onboard surveillance cameras to collect transit origin-destination data, featuring occlusion-robust algorithms, dynamic matching, and edge implementation for real-time privacy-preserving OD estimation.

DetailsMotivation: Current transit OD data collection methods are costly, device-dependent, or lack individual-level matching capabilities, while existing onboard surveillance cameras present an underutilized opportunity for automated data collection.

Method: Three key innovations: (1) occlusion-robust ReID algorithm with variational autoencoder-guided region-attention and selective quality feature averaging, (2) Hierarchical Storage and Dynamic Matching mechanism for dynamic gallery matching, and (3) multi-threaded edge implementation for real-time processing with local data privacy.

Result: Achieves state-of-the-art performance with 88.3% R-1 accuracy and 92.5% mAP, sustains 90% OD estimation accuracy in bus route simulations on NVIDIA Jetson edge devices, and includes a new dataset with 17,000+ images.

Conclusion: TransitReID advances both algorithmic and system-level foundations for automated transit OD collection, enabling scalable, privacy-preserving deployment in intelligent transportation systems.

Abstract: Transit Origin-Destination (OD) data are fundamental for optimizing public transit services, yet current collection methods, such as manual surveys, Bluetooth and WiFi tracking, or Automated Passenger Counters, are either costly, device-dependent, or incapable of individual-level matching. Meanwhile, onboard surveillance cameras already deployed on most transit vehicles provide an underutilized opportunity for automated OD data collection. Leveraging this, we present TransitReID, a novel framework for individual-level and occlusion-resistant passenger re-identification tailored to transit environments. Our approach introduces three key innovations: (1) an occlusion-robust ReID algorithm that integrates a variational autoencoder-guided region-attention mechanism and selective quality feature averaging to dynamically emphasize visible and discriminative body regions under severe occlusions and viewpoint variations; (2) a Hierarchical Storage and Dynamic Matching HSDM mechanism that transforms static gallery matching into a dynamic process for robustness, accuracy, and speed in real-world bus operations; and (3) a multi-threaded edge implementation that enables near real-time OD estimation while ensuring privacy by processing all data locally. To support research in this domain, we also construct a new TransitReID dataset with over 17,000 images captured from bus front and rear cameras under diverse occlusion and viewpoint conditions. Experimental results demonstrate that TransitReID achieves state-of-the-art performance, with R-1 accuracy of 88.3 percent and mAP of 92.5 percent, and further sustains 90 percent OD estimation accuracy in bus route simulations on NVIDIA Jetson edge devices. This work advances both the algorithmic and system-level foundations of automated transit OD collection, paving the way for scalable, privacy-preserving deployment in intelligent transportation systems.

[79] Sparse Transformer for Ultra-sparse Sampled Video Compressive Sensing

Miao Cao, Siming Zheng, Lishun Wang, Ziyang Chen, David Brady, Xin Yuan

Main category: cs.CV

TL;DR: Proposes Ultra-Sparse Sampling (USS) regime for video compressive imaging that reduces power consumption by sampling only one sub-frame per spatial location, and introduces BSTFormer transformer architecture to handle measurement mismatch.

DetailsMotivation: Current digital camera power consumption is unsustainable for gigapixel cameras at high frame rates (100-1000 fps), requiring 10-100X reduction in power per pixel through compressive measurement techniques.

Method: Ultra-Sparse Sampling (USS) where only one sub-frame is set to 1 per spatial location, built DMD encoding system, and proposed BSTFormer transformer with local block attention, global sparse attention, and temporal attention to handle measurement mismatch.

Result: Significantly outperforms all previous state-of-the-art algorithms on both simulated and real-world data, with additional advantages of higher dynamic range and fixed exposure time suitable for on-chip implementation.

Conclusion: USS strategy combined with BSTFormer provides an effective solution for low-power high-speed video compressive imaging with practical advantages for system implementation.

Abstract: Digital cameras consume ~0.1 microjoule per pixel to capture and encode video, resulting in a power usage of ~20W for a 4K sensor operating at 30 fps. Imagining gigapixel cameras operating at 100-1000 fps, the current processing model is unsustainable. To address this, physical layer compressive measurement has been proposed to reduce power consumption per pixel by 10-100X. Video Snapshot Compressive Imaging (SCI) introduces high frequency modulation in the optical sensor layer to increase effective frame rate. A commonly used sampling strategy of video SCI is Random Sampling (RS) where each mask element value is randomly set to be 0 or 1. Similarly, image inpainting (I2P) has demonstrated that images can be recovered from a fraction of the image pixels. Inspired by I2P, we propose Ultra-Sparse Sampling (USS) regime, where at each spatial location, only one sub-frame is set to 1 and all others are set to 0. We then build a Digital Micro-mirror Device (DMD) encoding system to verify the effectiveness of our USS strategy. Ideally, we can decompose the USS measurement into sub-measurements for which we can utilize I2P algorithms to recover high-speed frames. However, due to the mismatch between the DMD and CCD, the USS measurement cannot be perfectly decomposed. To this end, we propose BSTFormer, a sparse TransFormer that utilizes local Block attention, global Sparse attention, and global Temporal attention to exploit the sparsity of the USS measurement. Extensive results on both simulated and real-world data show that our method significantly outperforms all previous state-of-the-art algorithms. Additionally, an essential advantage of the USS strategy is its higher dynamic range than that of the RS strategy. Finally, from the application perspective, the USS strategy is a good choice to implement a complete video SCI system on chip due to its fixed exposure time.

[80] GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation

Seongho Kim, Sejong Ryu, Hyoukjun You, Je Hyeong Hong

Main category: cs.CV

TL;DR: GTA-Crime dataset and framework for fatal video anomaly detection using GTA5 game footage, with domain adaptation to improve real-world violence detection.

DetailsMotivation: Detecting fatal incidents like shootings and stabbings in surveillance videos is difficult due to their rarity and ethical issues in collecting real data.

Method: Created synthetic dataset using Grand Theft Auto 5, developed generation framework, and proposed snippet-level domain adaptation with Wasserstein adversarial training.

Result: Experimental results show that incorporating GTA-Crime with domain adaptation strategy consistently enhances real-world fatal violence detection accuracy.

Conclusion: GTA-Crime provides an effective synthetic dataset and framework for fatal video anomaly detection, bridging the gap between synthetic and real-world features.

Abstract: Recent advancements in video anomaly detection (VAD) have enabled identification of various criminal activities in surveillance videos, but detecting fatal incidents such as shootings and stabbings remains difficult due to their rarity and ethical issues in data collection. Recognizing this limitation, we introduce GTA-Crime, a fatal video anomaly dataset and generation framework using Grand Theft Auto 5 (GTA5). Our dataset contains fatal situations such as shootings and stabbings, captured from CCTV multiview perspectives under diverse conditions including action types, weather, time of day, and viewpoints. To address the rarity of such scenarios, we also release a framework for generating these types of videos. Additionally, we propose a snippet-level domain adaptation strategy using Wasserstein adversarial training to bridge the gap between synthetic GTA-Crime features and real-world features like UCF-Crime. Experimental results validate our GTA-Crime dataset and demonstrate that incorporating GTA-Crime with our domain adaptation strategy consistently enhances real world fatal violence detection accuracy. Our dataset and the data generation framework are publicly available at https://github.com/ta-ho/GTA-Crime.

[81] RepViT-CXR: A Channel Replication Strategy for Vision Transformers in Chest X-ray Tuberculosis and Pneumonia Classification

Faisal Ahmed

Main category: cs.CV

TL;DR: RepViT-CXR uses channel replication to adapt single-channel chest X-rays for Vision Transformers, achieving state-of-the-art performance on TB and pneumonia detection across multiple datasets.

DetailsMotivation: Vision Transformers are typically pretrained on three-channel natural images but chest X-rays are grayscale (single-channel), creating a compatibility gap that needs to be addressed for effective medical image analysis.

Method: Proposed RepViT-CXR with channel replication strategy that converts single-channel CXR images into ViT-compatible format without information loss, evaluated on three benchmark datasets.

Result: Achieved 99.9% accuracy/AUC on TB-CXR dataset (surpassing Topo-CXR), 99.0% accuracy on Pediatric Pneumonia dataset (outperforming DCNN/VGG16), and 91.1% accuracy on Shenzhen TB dataset (improving over CNN methods).

Conclusion: Simple channel replication strategy enables ViTs to leverage full representational power on grayscale medical images, establishing new state-of-the-art for TB/pneumonia detection with strong clinical deployment potential.

Abstract: Chest X-ray (CXR) imaging remains one of the most widely used diagnostic tools for detecting pulmonary diseases such as tuberculosis (TB) and pneumonia. Recent advances in deep learning, particularly Vision Transformers (ViTs), have shown strong potential for automated medical image analysis. However, most ViT architectures are pretrained on natural images and require three-channel inputs, while CXR scans are inherently grayscale. To address this gap, we propose RepViT-CXR, a channel replication strategy that adapts single-channel CXR images into a ViT-compatible format without introducing additional information loss. We evaluate RepViT-CXR on three benchmark datasets. On the TB-CXR dataset,our method achieved an accuracy of 99.9% and an AUC of 99.9%, surpassing prior state-of-the-art methods such as Topo-CXR (99.3% accuracy, 99.8% AUC). For the Pediatric Pneumonia dataset, RepViT-CXR obtained 99.0% accuracy, with 99.2% recall, 99.3% precision, and an AUC of 99.0%, outperforming strong baselines including DCNN and VGG16. On the Shenzhen TB dataset, our approach achieved 91.1% accuracy and an AUC of 91.2%, marking a performance improvement over previously reported CNN-based methods. These results demonstrate that a simple yet effective channel replication strategy allows ViTs to fully leverage their representational power on grayscale medical imaging tasks. RepViT-CXR establishes a new state of the art for TB and pneumonia detection from chest X-rays, showing strong potential for deployment in real-world clinical screening systems.

[82] Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer’s Disease Using Structural MRI

Zheng Yang, Yanteng Zhang, Xupeng Kou, Yang Liu, Chao Ren

Main category: cs.CV

TL;DR: A novel end-to-end network combining 3D CNN and Symmetry Interactive Transformer for Alzheimer’s disease detection, focusing on brain asymmetry caused by atrophy and achieving 92.5% accuracy on ADNI dataset.

DetailsMotivation: Existing deep learning approaches for AD diagnosis often ignore the asymmetrical characteristics caused by brain disorders or rely on pretraining, limiting their ability to capture disease-induced asymmetry patterns.

Method: Proposed network with 3D CNN Encoder and Symmetry Interactive Transformer (SIT). Uses inter-equal grid block fetch operation to align left and right hemisphere features, then processes them through SIT to focus on asymmetric regions caused by structural changes.

Result: Achieved 92.5% diagnostic accuracy on ADNI dataset, outperforming several CNN methods and CNN-transformer combinations. Visualization shows the network effectively focuses on brain atrophy regions and asymmetric pathological characteristics.

Conclusion: The method successfully captures asymmetric features induced by AD, improves diagnostic performance, and provides interpretable results by highlighting regions of brain atrophy, demonstrating effectiveness for Alzheimer’s disease detection.

Abstract: Structural magnetic resonance imaging (sMRI) combined with deep learning has achieved remarkable progress in the prediction and diagnosis of Alzheimer’s disease (AD). Existing studies have used CNN and transformer to build a well-performing network, but most of them are based on pretraining or ignoring the asymmetrical character caused by brain disorders. We propose an end-to-end network for the detection of disease-based asymmetric induced by left and right brain atrophy which consist of 3D CNN Encoder and Symmetry Interactive Transformer (SIT). Following the inter-equal grid block fetch operation, the corresponding left and right hemisphere features are aligned and subsequently fed into the SIT for diagnostic analysis. SIT can help the model focus more on the regions of asymmetry caused by structural changes, thus improving diagnostic performance. We evaluated our method based on the ADNI dataset, and the results show that the method achieves better diagnostic accuracy (92.5%) compared to several CNN methods and CNNs combined with a general transformer. The visualization results show that our network pays more attention in regions of brain atrophy, especially for the asymmetric pathological characteristics induced by AD, demonstrating the interpretability and effectiveness of the method.

[83] EVDI++: Event-based Video Deblurring and Interpolation via Self-Supervised Learning

Chi Zhang, Xiang Zhang, Chenxu Jiang, Gui-Song Xia, Lei Yu

Main category: cs.CV

TL;DR: EVDI++ is a self-supervised framework that uses event cameras to deblur videos and interpolate frames, achieving state-of-the-art results through learnable double integration and adaptive fusion strategies.

DetailsMotivation: Frame-based cameras with long exposure times cause visual blurring and information loss between frames, degrading video quality. Event cameras offer high temporal resolution but need effective integration methods.

Method: Uses Learnable Double Integral (LDI) network to map reference frames to sharp images, learning-based division reconstruction for varying exposure intervals, and adaptive parameter-free fusion strategy using event confidence.

Result: Achieves state-of-the-art performance on both synthetic and real-world datasets for video deblurring and interpolation tasks, with demonstrated generalizability in real scenarios.

Conclusion: EVDI++ provides an effective unified framework that leverages event camera capabilities to address motion blur and enable frame interpolation through self-supervised learning with real-world data.

Abstract: Frame-based cameras with extended exposure times often produce perceptible visual blurring and information loss between frames, significantly degrading video quality. To address this challenge, we introduce EVDI++, a unified self-supervised framework for Event-based Video Deblurring and Interpolation that leverages the high temporal resolution of event cameras to mitigate motion blur and enable intermediate frame prediction. Specifically, the Learnable Double Integral (LDI) network is designed to estimate the mapping relation between reference frames and sharp latent images. Then, we refine the coarse results and optimize overall training efficiency by introducing a learning-based division reconstruction module, enabling images to be converted with varying exposure intervals. We devise an adaptive parameter-free fusion strategy to obtain the final results, utilizing the confidence embedded in the LDI outputs of concurrent events. A self-supervised learning framework is proposed to enable network training with real-world blurry videos and events by exploring the mutual constraints among blurry frames, latent images, and event streams. We further construct a dataset with real-world blurry images and events using a DAVIS346c camera, demonstrating the generalizability of the proposed EVDI++ in real-world scenarios. Extensive experiments on both synthetic and real-world datasets show that our method achieves state-of-the-art performance in video deblurring and interpolation tasks.

[84] Hyperspectral Mamba for Hyperspectral Object Tracking

Long Gao, Yunhe Zhang, Yan Jiang, Weiying Xie, Yunsong Li

Main category: cs.CV

TL;DR: HyMamba is a new hyperspectral object tracking network that uses state space modules to unify spectral, cross-depth, and temporal modeling, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: Existing hyperspectral trackers fail to capture intrinsic spectral information, temporal dependencies, and cross-depth interactions, limiting their effectiveness in challenging scenarios.

Method: Proposes HyMamba network with Spectral State Integration (SSI) module and Hyperspectral Mamba (HSM) module that uses three directional scanning state space modules to learn spatial and spectral information synchronously.

Result: Achieves 73.0% AUC score and 96.3% DP@20 score on HOTC2020 dataset, demonstrating state-of-the-art performance across seven benchmark datasets.

Conclusion: HyMamba effectively addresses limitations of existing hyperspectral trackers by unifying spectral, cross-depth, and temporal modeling through state space modules, showing superior tracking performance.

Abstract: Hyperspectral object tracking holds great promise due to the rich spectral information and fine-grained material distinctions in hyperspectral images, which are beneficial in challenging scenarios. While existing hyperspectral trackers have made progress by either transforming hyperspectral data into false-color images or incorporating modality fusion strategies, they often fail to capture the intrinsic spectral information, temporal dependencies, and cross-depth interactions. To address these limitations, a new hyperspectral object tracking network equipped with Mamba (HyMamba), is proposed. It unifies spectral, cross-depth, and temporal modeling through state space modules (SSMs). The core of HyMamba lies in the Spectral State Integration (SSI) module, which enables progressive refinement and propagation of spectral features with cross-depth and temporal spectral information. Embedded within each SSI, the Hyperspectral Mamba (HSM) module is introduced to learn spatial and spectral information synchronously via three directional scanning SSMs. Based on SSI and HSM, HyMamba constructs joint features from false-color and hyperspectral inputs, and enhances them through interaction with original spectral features extracted from raw hyperspectral images. Extensive experiments conducted on seven benchmark datasets demonstrate that HyMamba achieves state-of-the-art performance. For instance, it achieves 73.0% of the AUC score and 96.3% of the DP@20 score on the HOTC2020 dataset. The code will be released at https://github.com/lgao001/HyMamba.

[85] Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features

Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown

Main category: cs.CV

TL;DR: VLMs rely on training biases and fail on specific visual questions. Minor image/prompt changes cause large performance variations.

DetailsMotivation: VLMs often disregard visual evidence and rely on inherent biases when answering specific questions about images, leading to inaccurate responses.

Method: Developed a multi-dimensional examination framework to analyze how input characteristics (image size, object count, background color, prompt specificity) affect VLM attention and performance using open-source models.

Result: Minor modifications in image characteristics and prompt specificity cause significant changes in how VLMs formulate answers and their overall performance.

Conclusion: VLMs are highly sensitive to input variations, and systematic examination of attention patterns can help characterize and potentially mitigate their reliance on training biases.

Abstract: Recent research on Vision Language Models (VLMs) suggests that they rely on inherent biases learned during training to respond to questions about visual properties of an image. These biases are exacerbated when VLMs are asked highly specific questions that require focusing on specific areas of the image. For example, a VLM tasked with counting stars on a modified American flag (e.g., with more than 50 stars) will often disregard the visual evidence and fail to answer accurately. We build upon this research and develop a multi-dimensional examination framework to systematically determine which characteristics of the input data, including both the image and the accompanying prompt, lead to such differences in performance. Using open-source VLMs, we further examine how attention values fluctuate with varying input parameters (e.g., image size, number of objects in the image, background color, prompt specificity). This research aims to learn how the behavior of vision language models changes and to explore methods for characterizing such changes. Our results suggest, among other things, that even minor modifications in image characteristics and prompt specificity can lead to large changes in how a VLM formulates its answer and, subsequently, its overall performance.

[86] Generalized Zero-Shot Learning for Point Cloud Segmentation with Evidence-Based Dynamic Calibration

Hyeonseok Kim, Byeongkeun Kang, Yeejin Lee

Main category: cs.CV

TL;DR: E3DPC-GZSL is a novel method for generalized zero-shot 3D point cloud segmentation that reduces biased predictions toward seen classes using evidence-based uncertainty estimation and dynamic calibration.

DetailsMotivation: Address the bias problem in 3D point cloud segmentation where models tend to favor seen classes over unseen ones, especially challenging due to smaller training data scales in 3D compared to image tasks.

Method: Integrates evidence-based uncertainty estimator into classifier, uses dynamic calibrated stacking factor to adjust probabilities based on pointwise uncertainty, and employs novel training strategy merging learnable parameters with text-derived features to refine semantic space.

Result: Achieves state-of-the-art performance on generalized zero-shot semantic segmentation datasets including ScanNet v2 and S3DIS.

Conclusion: The proposed approach effectively reduces overconfident predictions toward seen classes and improves uncertainty estimation, demonstrating superior performance in 3D zero-shot segmentation tasks.

Abstract: Generalized zero-shot semantic segmentation of 3D point clouds aims to classify each point into both seen and unseen classes. A significant challenge with these models is their tendency to make biased predictions, often favoring the classes encountered during training. This problem is more pronounced in 3D applications, where the scale of the training data is typically smaller than in image-based tasks. To address this problem, we propose a novel method called E3DPC-GZSL, which reduces overconfident predictions towards seen classes without relying on separate classifiers for seen and unseen data. E3DPC-GZSL tackles the overconfidence problem by integrating an evidence-based uncertainty estimator into a classifier. This estimator is then used to adjust prediction probabilities using a dynamic calibrated stacking factor that accounts for pointwise prediction uncertainty. In addition, E3DPC-GZSL introduces a novel training strategy that improves uncertainty estimation by refining the semantic space. This is achieved by merging learnable parameters with text-derived features, thereby improving model optimization for unseen data. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on generalized zero-shot semantic segmentation datasets, including ScanNet v2 and S3DIS.

[87] Dual-Thresholding Heatmaps to Cluster Proposals for Weakly Supervised Object Detection

Yuelin Guo, Haoyu He, Zhiyuan Chen, Zitong Huang, Renhao Lu, Lu Shi, Zejun Wang, Weizhe Zhang

Main category: cs.CV

TL;DR: A novel weakly supervised object detection framework that addresses three key limitations in existing methods through heatmap-guided proposal selection, enhanced network architecture with background class representation, and negative certainty supervision for faster convergence.

DetailsMotivation: Existing WSOD methods suffer from three main problems: 1) pseudo GT boxes either focus only on discriminative parts or fail to distinguish adjacent intra-class instances, 2) WSDDN lacks background class representation and has semantic gap between branches, 3) discarded proposals lead to slow convergence.

Method: Proposes three key components: 1) Heatmap-Guided Proposal Selector (HGPS) with dual thresholds for better proposal selection, 2) Weakly Supervised Basic Detection Network (WSBDN) with background class representation and heatmap pre-supervision, 3) Negative certainty supervision loss on ignored proposals.

Result: Achieves state-of-the-art performance with mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, outperforming existing WSOD methods.

Conclusion: The proposed framework effectively addresses the three key limitations in WSOD, demonstrating superior performance through improved proposal selection, enhanced network architecture, and better optimization of ignored proposals.

Abstract: Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we first design a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then present a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 datasets demonstrate the effectiveness of our framework. We achieve mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, performing favorably against the state-of-the-art WSOD methods. Our code is publicly available at https://github.com/gyl2565309278/DTH-CP.

[88] An Open Benchmark Dataset for GeoAI Foundation Models for Oil Palm Mapping in Indonesia

M. Warizmi Wafiq, Peter Cutter, Ate Poortinga, Daniel Marc G. dela Torre, Karis Tenneson, Vanna Teck, Enikoe Bihari, Chanarun Saisaward, Weraphong Suaruang, Andrea McMahon, Andi Vika Faradiba Muin, Karno B. Batiran, Chairil A, Nurul Qomar, Arya Arismaya Metananda, David Ganz, David Saah

Main category: cs.CV

TL;DR: Open-access geospatial dataset of Indonesian oil palm plantations created through expert labeling of high-resolution satellite imagery (2020-2024) to support deforestation monitoring and sustainability efforts.

DetailsMotivation: Oil palm cultivation is a leading cause of deforestation in Indonesia, requiring detailed mapping to support sustainability efforts and regulatory frameworks for transparent monitoring.

Method: Expert labeling of high-resolution satellite imagery using wall-to-wall digitization over large grids, with quality ensured through multi-interpreter consensus and field validation. Includes hierarchical typology distinguishing oil palm planting stages and similar perennial crops.

Result: Polygon-based, wall-to-wall annotations across agro-ecological zones suitable for training both conventional CNNs and newer geospatial foundation models. Dataset released under CC-BY license following FAIR data principles.

Conclusion: This dataset fills a key gap in training data for remote sensing, aims to improve land cover mapping accuracy, and contributes to global deforestation reduction goals by supporting transparent monitoring of oil palm expansion.

Abstract: Oil palm cultivation remains one of the leading causes of deforestation in Indonesia. To better track and address this challenge, detailed and reliable mapping is needed to support sustainability efforts and emerging regulatory frameworks. We present an open-access geospatial dataset of oil palm plantations and related land cover types in Indonesia, produced through expert labeling of high-resolution satellite imagery from 2020 to 2024. The dataset provides polygon-based, wall-to-wall annotations across a range of agro-ecological zones and includes a hierarchical typology that distinguishes oil palm planting stages as well as similar perennial crops. Quality was ensured through multi-interpreter consensus and field validation. The dataset was created using wall-to-wall digitization over large grids, making it suitable for training and benchmarking both conventional convolutional neural networks and newer geospatial foundation models. Released under a CC-BY license, it fills a key gap in training data for remote sensing and aims to improve the accuracy of land cover types mapping. By supporting transparent monitoring of oil palm expansion, the resource contributes to global deforestation reduction goals and follows FAIR data principles.

[89] SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training

Rongsheng Wang, Fenghe Tang, Qingsong Yao, Rui Yan, Xu Zhang, Zhen Huang, Haoran Lai, Zhiyang He, Xiaodong Tao, Zihang Jiang, Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: SimCroP is a vision-language pre-training framework for chest CT scans that uses similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation by addressing spatial sparsity of lesions and complex relationships between reports and image regions.

DetailsMotivation: CT scans have spatially sparse lesion distributions with intricate structures, and there are complex implicit relationships between pathological descriptions in reports and their corresponding sub-regions in radiographs, which pose challenges for medical vision-language pre-training.

Method: The framework uses multi-modal masked modeling to optimize encoder understanding of low-level semantics, similarity-driven alignment to adaptively select and align correct patches with report sentences, and cross-granularity fusion to integrate multimodal information across instance level and word-patch level.

Result: SimCroP outperforms state-of-the-art medical self-supervised learning methods and medical vision-language pre-training methods on image classification and segmentation tasks across five public datasets.

Conclusion: The proposed SimCroP framework effectively addresses the challenges of spatial sparsity and complex relationships in CT scans, demonstrating superior performance in capturing key pathology structures and improving multi-scale downstream tasks.

Abstract: Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multimodal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at https://github.com/ToniChopp/SimCroP.

[90] Boosted Training of Lightweight Early Exits for Optimizing CNN Image Classification Inference

Yehudit Aperstein, Alexander Apartsin

Main category: cs.CV

TL;DR: BTS-EE is a sequential training method for early-exit CNNs that addresses covariance shift by aligning branch training with inference distributions, achieving 45% computation reduction with minimal accuracy loss.

DetailsMotivation: Real-time image classification on resource-constrained platforms requires balancing accuracy with latency/power budgets. Conventional early-exit training suffers from covariance shift where downstream branches are trained on full datasets but only process harder samples at inference.

Method: Boosted Training Scheme for Early Exits (BTS-EE) with sequential training where each branch is trained and calibrated before the next. Uses lightweight 1D convolution branch architecture and Class Precision Margin (CPM) calibration for per-class threshold tuning.

Result: On CINIC-10 dataset with ResNet18 backbone, BTS-EE outperforms non-boosted training across 64 configurations, achieving up to 45% computation reduction with only 2% accuracy degradation.

Conclusion: BTS-EE expands design space for deploying CNNs in real-time systems, offering practical efficiency gains for industrial inspection, embedded vision, and UAV-based monitoring applications.

Abstract: Real-time image classification on resource-constrained platforms demands inference methods that balance accuracy with strict latency and power budgets. Early-exit strategies address this need by attaching auxiliary classifiers to intermediate layers of convolutional neural networks (CNNs), allowing “easy” samples to terminate inference early. However, conventional training of early exits introduces a covariance shift: downstream branches are trained on full datasets, while at inference they process only the harder, non-exited samples. This mismatch limits efficiency–accuracy trade-offs in practice. We introduce the Boosted Training Scheme for Early Exits (BTS-EE), a sequential training approach that aligns branch training with inference-time data distributions. Each branch is trained and calibrated before the next, ensuring robustness under selective inference conditions. To further support embedded deployment, we propose a lightweight branch architecture based on 1D convolutions and a Class Precision Margin (CPM) calibration method that enables per-class threshold tuning for reliable exit decisions. Experiments on the CINIC-10 dataset with a ResNet18 backbone demonstrate that BTS-EE consistently outperforms non-boosted training across 64 configurations, achieving up to 45 percent reduction in computation with only 2 percent accuracy degradation. These results expand the design space for deploying CNNs in real-time image processing systems, offering practical efficiency gains for applications such as industrial inspection, embedded vision, and UAV-based monitoring.

[91] Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis

Jihyun Moon, Charmgil Hong

Main category: cs.CV

TL;DR: Retrieval-augmented vision-language model for melanoma diagnosis that incorporates similar patient cases into prompts, improving accuracy without fine-tuning.

DetailsMotivation: CNNs neglect clinical metadata and require extensive preprocessing, while VLMs trained on general data lack clinical specificity for melanoma diagnosis.

Method: Proposed a retrieval-augmented VLM framework that incorporates semantically similar patient cases into diagnostic prompts without requiring model fine-tuning.

Result: Significantly improves classification accuracy and error correction over conventional baselines in melanoma diagnosis.

Conclusion: Retrieval-augmented prompting provides a robust strategy for clinical decision support in melanoma diagnosis.

Abstract: Accurate and early diagnosis of malignant melanoma is critical for improving patient outcomes. While convolutional neural networks (CNNs) have shown promise in dermoscopic image analysis, they often neglect clinical metadata and require extensive preprocessing. Vision-language models (VLMs) offer a multimodal alternative but struggle to capture clinical specificity when trained on general-domain data. To address this, we propose a retrieval-augmented VLM framework that incorporates semantically similar patient cases into the diagnostic prompt. Our method enables informed predictions without fine-tuning and significantly improves classification accuracy and error correction over conventional baselines. These results demonstrate that retrieval-augmented prompting provides a robust strategy for clinical decision support.

[92] InsFusion: Rethink Instance-level LiDAR-Camera Fusion for 3D Object Detection

Zhongyu Xia, Hansong Yang, Yongtao Wang

Main category: cs.CV

TL;DR: InsFusion is a 3D object detection method that extracts proposals from both raw and fused features to mitigate accumulated errors from feature extraction, perspective transformation, and fusion processes.

DetailsMotivation: To address the problem of noise and error accumulation during basic feature extraction, perspective transformation, and feature fusion in 3D object detection from multi-view cameras and LiDAR for autonomous driving applications.

Method: Proposes InsFusion which extracts proposals from both raw and fused features, uses these proposals to query raw features, and incorporates attention mechanisms applied to raw features to reduce accumulated errors.

Result: Experiments on nuScenes dataset show InsFusion is compatible with various advanced baseline methods and achieves new state-of-the-art performance for 3D object detection.

Conclusion: InsFusion effectively mitigates error accumulation in 3D object detection pipelines and delivers superior performance on benchmark datasets.

Abstract: Three-dimensional Object Detection from multi-view cameras and LiDAR is a crucial component for autonomous driving and smart transportation. However, in the process of basic feature extraction, perspective transformation, and feature fusion, noise and error will gradually accumulate. To address this issue, we propose InsFusion, which can extract proposals from both raw and fused features and utilizes these proposals to query the raw features, thereby mitigating the impact of accumulated errors. Additionally, by incorporating attention mechanisms applied to the raw features, it thereby mitigates the impact of accumulated errors. Experiments on the nuScenes dataset demonstrate that InsFusion is compatible with various advanced baseline methods and delivers new state-of-the-art performance for 3D object detection.

[93] Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video

Xiao Li, Qi Chen, Xiulian Peng, Kai Yu, Xie Chen, Yan Lu

Main category: cs.CV

TL;DR: A self-supervised framework that disentangles videos into motion and content components using transformer architecture with vector quantization bottleneck and diffusion models for representation learning.

DetailsMotivation: To develop a general video analysis framework that can separate dynamic motion from static content with fewer assumptions and inductive biases than previous works, enabling better video understanding and generation.

Method: Uses transformer-based architecture to generate implicit features for motion and content, incorporates low-bitrate vector quantization as information bottleneck for disentanglement, and employs denoising diffusion models with motion/content conditions for self-supervised learning.

Result: Validated on talking head videos for motion transfer and auto-regressive generation tasks, and shown to generalize to other video types like 2D cartoon character sprites.

Conclusion: Presents a novel perspective on self-supervised learning of disentangled video representations that contributes to video analysis and generation with broad applicability across different video domains.

Abstract: We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.

[94] Semantic Causality-Aware Vision-Based 3D Occupancy Prediction

Dubing Chen, Huan Zheng, Yucheng Zhou, Xianfei Li, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen

Main category: cs.CV

TL;DR: Proposes a novel causal loss for end-to-end supervision of 2D-to-3D semantic occupancy prediction, making modular pipelines fully differentiable and learnable.

DetailsMotivation: Existing methods use modular pipelines optimized independently, leading to cascading errors and lack of end-to-end learning.

Method: Semantic Causality-Aware 2D-to-3D Transformation with three components: Channel-Grouped Lifting, Learnable Camera Offsets, and Normalized Convolution, all guided by a novel causal loss that regulates gradient flow from 3D to 2D.

Result: Achieves state-of-the-art performance on Occ3D benchmark with significant robustness to camera perturbations and improved 2D-to-3D semantic consistency.

Conclusion: The proposed causal loss enables holistic end-to-end supervision, making previously non-trainable components learnable and improving overall pipeline performance.

Abstract: Vision-based 3D semantic occupancy prediction is a critical task in 3D vision that integrates volumetric 3D reconstruction with semantic understanding. Existing methods, however, often rely on modular pipelines. These modules are typically optimized independently or use pre-configured inputs, leading to cascading errors. In this paper, we address this limitation by designing a novel causal loss that enables holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline. Grounded in the principle of 2D-to-3D semantic causality, this loss regulates the gradient flow from 3D voxel representations back to the 2D features. Consequently, it renders the entire pipeline differentiable, unifying the learning process and making previously non-trainable components fully learnable. Building on this principle, we propose the Semantic Causality-Aware 2D-to-3D Transformation, which comprises three components guided by our causal loss: Channel-Grouped Lifting for adaptive semantic mapping, Learnable Camera Offsets for enhanced robustness against camera perturbations, and Normalized Convolution for effective feature propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Occ3D benchmark, demonstrating significant robustness to camera perturbations and improved 2D-to-3D semantic consistency.

[95] VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring

Cuong Nguyen, Dung T. Tran, Hong Nguyen, Xuan-Vu Phan, Nam-Phong Nguyen

Main category: cs.CV

TL;DR: Proposes Vertical Residual Autoencoder (VRAE) for real-time license plate image enhancement in traffic surveillance, achieving significant improvements over existing methods with minimal parameter increase.

DetailsMotivation: Vehicle images in traffic surveillance often suffer from noise and blur due to adverse weather, poor lighting, or high-speed motion, which severely degrades license plate recognition accuracy, especially when plates occupy small regions in images.

Method: Vertical Residual Autoencoder (VRAE) architecture with an auxiliary block that injects input-aware features at each encoding stage to guide representation learning and preserve general information better than conventional autoencoders.

Result: Outperforms Autoencoder (AE), GAN, and Flow-Based approaches - improves PSNR by ~20%, reduces NMSE by ~50%, enhances SSIM by 1%, with only ~1% parameter increase compared to AE at same depth.

Conclusion: VRAE provides an effective real-time solution for enhancing degraded license plate images in traffic surveillance, significantly boosting recognition performance with minimal computational overhead.

Abstract: In real-world traffic surveillance, vehicle images captured under adverse weather, poor lighting, or high-speed motion often suffer from severe noise and blur. Such degradations significantly reduce the accuracy of license plate recognition systems, especially when the plate occupies only a small region within the full vehicle image. Restoring these degraded images a fast realtime manner is thus a crucial pre-processing step to enhance recognition performance. In this work, we propose a Vertical Residual Autoencoder (VRAE) architecture designed for the image enhancement task in traffic surveillance. The method incorporates an enhancement strategy that employs an auxiliary block, which injects input-aware features at each encoding stage to guide the representation learning process, enabling better general information preservation throughout the network compared to conventional autoencoders. Experiments on a vehicle image dataset with visible license plates demonstrate that our method consistently outperforms Autoencoder (AE), Generative Adversarial Network (GAN), and Flow-Based (FB) approaches. Compared with AE at the same depth, it improves PSNR by about 20%, reduces NMSE by around 50%, and enhances SSIM by 1%, while requiring only a marginal increase of roughly 1% in parameters.

[96] Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Main category: cs.CV

TL;DR: MMB is a multimodal Bayesian prompt ensemble method that improves text-to-image evaluation accuracy and calibration by dynamically weighting prompts based on image characteristics.

DetailsMotivation: MLLMs used for TTI evaluation suffer from biases, overconfidence, and inconsistent performance across image domains, while standard prompt ensembling methods fail in multimodal settings.

Method: Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB) using Bayesian prompt ensemble augmented by image clustering to dynamically assign prompt weights based on visual characteristics.

Result: MMB outperforms baselines on HPSv2 and MJBench benchmarks, improving accuracy in pairwise preference judgments and greatly enhancing calibration for better uncertainty estimation.

Conclusion: Multimodal-specific strategies like MMB are crucial for reliable judge calibration and represent a promising approach for large-scale TTI evaluation.

Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these “judge” models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge’s true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.

[97] Sparse BEV Fusion with Self-View Consistency for Multi-View Detection and Tracking

Keisuke Toida, Taigo Sakai, Naoki Kato, Kazutoyo Yokota, Takeshi Nakamura, Kazuhiro Hotta

Main category: cs.CV

TL;DR: SCFusion improves multi-view multi-object tracking by addressing BEV projection issues with sparse transformation, density-aware weighting, and multi-view consistency loss, achieving state-of-the-art performance.

DetailsMotivation: Maintaining consistent object identities across multiple cameras is challenging due to viewpoint changes, lighting variations, and occlusions. BEV projection introduces feature distortion and non-uniform density that degrade tracking accuracy.

Method: Combines three techniques: 1) sparse transformation to avoid unnatural interpolation, 2) density-aware weighting for adaptive feature fusion based on spatial confidence and camera distance, 3) multi-view consistency loss to encourage discriminative feature learning before fusion.

Result: Achieves 95.9% IDF1 score on WildTrack and 89.2% MODP on MultiviewX, outperforming baseline TrackTacular method.

Conclusion: SCFusion effectively mitigates limitations of conventional BEV projection and provides a robust, accurate solution for multi-view object detection and tracking.

Abstract: Multi-View Multi-Object Tracking (MVMOT) is essential for applications such as surveillance, autonomous driving, and sports analytics. However, maintaining consistent object identities across multiple cameras remains challenging due to viewpoint changes, lighting variations, and occlusions, which often lead to tracking errors.Recent methods project features from multiple cameras into a unified Bird’s-Eye-View (BEV) space to improve robustness against occlusion. However, this projection introduces feature distortion and non-uniform density caused by variations in object scale with distance. These issues degrade the quality of the fused representation and reduce detection and tracking accuracy.To address these problems, we propose SCFusion, a framework that combines three techniques to improve multi-view feature integration. First, it applies a sparse transformation to avoid unnatural interpolation during projection. Next, it performs density-aware weighting to adaptively fuse features based on spatial confidence and camera distance. Finally, it introduces a multi-view consistency loss that encourages each camera to learn discriminative features independently before fusion.Experiments show that SCFusion achieves state-of-the-art performance, reaching an IDF1 score of 95.9% on WildTrack and a MODP of 89.2% on MultiviewX, outperforming the baseline method TrackTacular. These results demonstrate that SCFusion effectively mitigates the limitations of conventional BEV projection and provides a robust and accurate solution for multi-view object detection and tracking.

[98] LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

Payal Varshney, Adriano Lucieri, Christoph Balada, Sheraz Ahmed, Andreas Dengel

Main category: cs.CV

TL;DR: LD-ViCE is a novel framework that uses latent diffusion models to generate realistic, temporally coherent counterfactual explanations for video-based AI systems, significantly improving computational efficiency and explanation quality.

DetailsMotivation: Video-based AI systems in safety-critical domains need interpretable decisions, but existing explanation techniques lack temporal coherence, robustness, and actionable causal insights. Current counterfactual methods don't incorporate model guidance, reducing semantic fidelity.

Method: LD-ViCE operates in latent space using a state-of-the-art diffusion model to reduce computational costs, with an additional refinement step to produce realistic and interpretable counterfactuals. It incorporates guidance from the target model.

Result: LD-ViCE outperforms state-of-the-art methods with up to 68% increase in R2 score while reducing inference time by half. It generates semantically meaningful and temporally coherent explanations across three diverse video datasets.

Conclusion: LD-ViCE represents a significant advancement toward trustworthy AI deployment in safety-critical domains by providing efficient, high-quality counterfactual explanations for video-based models.

Abstract: Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence, insufficient robustness, and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Our experiments demonstrate the effectiveness of LD-ViCE across three diverse video datasets, including EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving an increase in R2 score of up to 68% while reducing inference time by half. Qualitative analysis confirms that LD-ViCE generates semantically meaningful and temporally coherent explanations, offering valuable insights into the target model behavior. LD-ViCE represents a valuable step toward the trustworthy deployment of AI in safety-critical domains.

[99] Spherical Brownian Bridge Diffusion Models for Conditional Cortical Thickness Forecasting

Ivan Stoyanov, Fabian Bongratz, Christian Wachinger

Main category: cs.CV

TL;DR: SBDM is a novel diffusion model for accurate cortical thickness trajectory forecasting on spherical brain surfaces, achieving better performance than previous methods.

DetailsMotivation: Accurate cortical thickness forecasting is crucial for detecting neurodegenerative changes and enabling early interventions, but challenging due to the brain's complex geometry and need for multi-modal integration.

Method: Proposes Spherical Brownian Bridge Diffusion Model (SBDM) with bidirectional conditional diffusion process and conditional spherical U-Net (CoS-UNet) that combines spherical convolutions with dense cross-attention.

Result: Significantly reduced prediction errors compared to previous approaches on ADNI and OASIS datasets, with ability to generate both factual and counterfactual cortical thickness trajectories.

Conclusion: SBDM provides a novel framework for exploring cortical development scenarios and offers improved accuracy for personalized cortical thickness forecasting.

Abstract: Accurate forecasting of individualized, high-resolution cortical thickness (CTh) trajectories is essential for detecting subtle cortical changes, providing invaluable insights into neurodegenerative processes and facilitating earlier and more precise intervention strategies. However, CTh forecasting is a challenging task due to the intricate non-Euclidean geometry of the cerebral cortex and the need to integrate multi-modal data for subject-specific predictions. To address these challenges, we introduce the Spherical Brownian Bridge Diffusion Model (SBDM). Specifically, we propose a bidirectional conditional Brownian bridge diffusion process to forecast CTh trajectories at the vertex level of registered cortical surfaces. Our technical contribution includes a new denoising model, the conditional spherical U-Net (CoS-UNet), which combines spherical convolutions and dense cross-attention to integrate cortical surfaces and tabular conditions seamlessly. Compared to previous approaches, SBDM achieves significantly reduced prediction errors, as demonstrated by our experiments based on longitudinal datasets from the ADNI and OASIS. Additionally, we demonstrate SBDM’s ability to generate individual factual and counterfactual CTh trajectories, offering a novel framework for exploring hypothetical scenarios of cortical development.

[100] Beyond Distribution Shifts: Adaptive Hyperspectral Image Classification at Test Time

Xia Yue, Anfeng Liu, Ning Chen, Chenjia Huang, Hui Liu, Zhou Huang, Leyuan Fang

Main category: cs.CV

TL;DR: HyperTTA is a unified framework for robust hyperspectral image classification under various degradation conditions, featuring a multi-degradation dataset, spectral-spatial transformer classifier, and lightweight test-time adaptation strategy.

DetailsMotivation: HSI classification models are highly sensitive to distribution shifts caused by real-world degradations like noise, blur, compression, and atmospheric effects, requiring robust solutions.

Method: Constructed multi-degradation hyperspectral dataset with 9 degradation types; designed spectral-spatial transformer classifier with multi-level receptive field and label smoothing; developed confidence-aware entropy-minimized LayerNorm adapter for test-time adaptation.

Result: Extensive experiments on two benchmark datasets show HyperTTA outperforms existing baselines across various degradation scenarios.

Conclusion: HyperTTA effectively enhances model robustness through its classification backbone and TTA scheme, enabling dynamic adaptation without source data or target annotations.

Abstract: Hyperspectral image (HSI) classification models are highly sensitive to distribution shifts caused by various real-world degradations such as noise, blur, compression, and atmospheric effects. To address this challenge, we propose HyperTTA, a unified framework designed to enhance model robustness under diverse degradation conditions. Specifically, we first construct a multi-degradation hyperspectral dataset that systematically simulates nine representative types of degradations, providing a comprehensive benchmark for robust classification evaluation. Based on this, we design a spectral-spatial transformer classifier (SSTC) enhanced with a multi-level receptive field mechanism and label smoothing regularization to jointly capture multi-scale spatial context and improve generalization. Furthermore, HyperTTA incorporates a lightweight test-time adaptation (TTA) strategy, the confidence-aware entropy-minimized LayerNorm adapter (CELA), which updates only the affine parameters of LayerNorm layers by minimizing prediction entropy on high-confidence unlabeled target samples. This confidence-aware adaptation prevents unreliable updates from noisy predictions, enabling robust and dynamic adaptation without access to source data or target annotations. Extensive experiments on two benchmark datasets demonstrate that HyperTTA outperforms existing baselines across a wide range of degradation scenarios, validating the effectiveness of both its classification backbone and the proposed TTA scheme. Code will be made available publicly.

[101] First-order State Space Model for Lightweight Image Super-resolution

Yujie Zhu, Xinyi Zhang, Yekai Lu, Guang Yang, Faming Fang, Guixu Zhang

Main category: cs.CV

TL;DR: FSSM improves Mamba-based vision models for super-resolution by modifying SSM calculation with first-order hold condition to incorporate token correlations, achieving state-of-the-art results without additional parameters.

DetailsMotivation: Most Mamba-based vision models focus on network architecture and scan paths but neglect the SSM module itself. The authors want to explore SSM's potential for lightweight super-resolution tasks.

Method: Introduce First-order State Space Model (FSSM) that applies first-order hold condition in SSMs, derives new discretized form, and analyzes cumulative error to improve token correlations.

Result: FSSM improves MambaIR performance on five benchmark datasets without increasing parameters, surpassing current lightweight SR methods and achieving state-of-the-art results.

Conclusion: The proposed FSSM modification successfully enhances SSM performance for vision tasks, demonstrating that optimizing the SSM module itself can significantly improve model capabilities in super-resolution applications.

Abstract: State space models (SSMs), particularly Mamba, have shown promise in NLP tasks and are increasingly applied to vision tasks. However, most Mamba-based vision models focus on network architecture and scan paths, with little attention to the SSM module. In order to explore the potential of SSMs, we modified the calculation process of SSM without increasing the number of parameters to improve the performance on lightweight super-resolution tasks. In this paper, we introduce the First-order State Space Model (FSSM) to improve the original Mamba module, enhancing performance by incorporating token correlations. We apply a first-order hold condition in SSMs, derive the new discretized form, and analyzed cumulative error. Extensive experimental results demonstrate that FSSM improves the performance of MambaIR on five benchmark datasets without additionally increasing the number of parameters, and surpasses current lightweight SR methods, achieving state-of-the-art results.

[102] Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation

Kaleem Ahmad

Main category: cs.CV

TL;DR: A unified pipeline for prompt-driven image analysis that combines detection, segmentation, inpainting, and description into a single workflow with both UI and CLI interfaces.

DetailsMotivation: To create a practical, end-to-end system that converts natural language instructions into multiple image processing steps while maintaining transparency and reliability.

Method: Combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description in a unified workflow with debugging artifacts and both interactive UI and scriptable CLI.

Result: Achieved over 90% usable mask production with 85% accuracy in single-word prompts, with inpainting consuming 60-75% of runtime on high-end GPUs.

Conclusion: Provides a transparent, reliable pattern for assembling vision and multimodal models with clear guardrails and operational practices for improved reliability in object manipulation tasks.

Abstract: Prompt-driven image analysis converts a single natural-language instruction into multiple steps: locate, segment, edit, and describe. We present a practical case study of a unified pipeline that combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description into a single workflow. The system works end to end from a single prompt, retains intermediate artifacts for transparent debugging (such as detections, masks, overlays, edited images, and before and after composites), and provides the same functionality through an interactive UI and a scriptable CLI for consistent, repeatable runs. We highlight integration choices that reduce brittleness, including threshold adjustments, mask inspection with light morphology, and resource-aware defaults. In a small, single-word prompt segment, detection and segmentation produced usable masks in over 90% of cases with an accuracy above 85% based on our criteria. On a high-end GPU, inpainting makes up 60 to 75% of total runtime under typical guidance and sampling settings, which highlights the need for careful tuning. The study offers implementation-guided advice on thresholds, mask tightness, and diffusion parameters, and details version pinning, artifact logging, and seed control to support replay. Our contribution is a transparent, reliable pattern for assembling modern vision and multimodal models behind a single prompt, with clear guardrails and operational practices that improve reliability in object replacement, scene augmentation, and removal.

[103] Maximally Useful and Minimally Redundant: The Key to Self Supervised Learning for Imbalanced Data

Yash Kumar Sharma, Vineet Nair, Wilson Naik

Main category: cs.CV

TL;DR: Proposes a multi-view contrastive self-supervised learning approach using mutual information theory to handle imbalanced datasets, achieving state-of-the-art results on various benchmark datasets.

DetailsMotivation: Contrastive self-supervised learning (CSSL) performs well on balanced datasets but struggles with imbalanced datasets. The research aims to extend the multi-view framework beyond two views to address dataset imbalance issues.

Method: Theoretical justification based on mutual information for more than two views, introducing a loss function that segregates intra and inter discriminatory characteristics and filters out extreme features to learn better representations.

Result: Achieved 2% improvement on Cifar10-LT, 5% on Cifar100-LT, and 3% on Imagenet-LT using ResNet architectures, setting new state-of-the-art accuracy in self-supervised imbalanced dataset classification.

Conclusion: The multi-view objective with more than two views effectively addresses dataset imbalance in self-supervised learning, extracting better representations for tail classes and improving classification performance across various frameworks.

Abstract: The robustness of contrastive self-supervised learning (CSSL) for imbalanced datasets is largely unexplored. CSSL usually makes use of \emph{multi-view} assumptions to learn discriminatory features via similar and dissimilar data samples. CSSL works well on balanced datasets, but does not generalize well for imbalanced datasets. In a very recent paper, as part of future work, Yann LeCun pointed out that the self-supervised multiview framework can be extended to cases involving \emph{more than two views}. Taking a cue from this insight we propose a theoretical justification based on the concept of \emph{mutual information} to support the \emph{more than two views} objective and apply it to the problem of dataset imbalance in self-supervised learning. The proposed method helps extract representative characteristics of the tail classes by segregating between \emph{intra} and \emph{inter} discriminatory characteristics. We introduce a loss function that helps us to learn better representations by filtering out extreme features. Experimental evaluation on a variety of self-supervised frameworks (both contrastive and non-contrastive) also prove that the \emph{more than two view} objective works well for imbalanced datasets. We achieve a new state-of-the-art accuracy in self-supervised imbalanced dataset classification (2% improvement in Cifar10-LT using Resnet-18, 5% improvement in Cifar100-LT using Resnet-18, 3% improvement in Imagenet-LT (1k) using Resnet-50).

[104] A Structured Review of Underwater Object Detection Challenges and Solutions: From Traditional to Large Vision Language Models

Edwine Nabahirwa, Wei Song, Minghua Zhang, Yi Fang, Zhou Ni

Main category: cs.CV

TL;DR: This review paper systematically analyzes underwater object detection challenges and explores the potential of large vision-language models (LVLMs) to address issues like image degradation, small object detection, and data limitations.

DetailsMotivation: Underwater object detection faces numerous performance-compromising challenges that existing methods fail to fully address, particularly in complex underwater environments with image quality issues and dynamic conditions.

Method: The review categorizes UOD challenges into five key areas, analyzes progression from traditional to modern approaches, explores LVLM potential through case studies including synthetic dataset generation with DALL-E 3 and fine-tuning Florence-2 LVLM.

Result: Three key insights: current UOD methods are insufficient for challenges like image degradation and small object detection; LVLM synthetic data generation shows potential but needs refinement; LVLMs hold promise but real-time applications require further optimization research.

Conclusion: Large vision-language models represent a promising direction for advancing underwater object detection, though significant research is needed to address current limitations in realism, applicability, and real-time performance optimization.

Abstract: Underwater object detection (UOD) is vital to diverse marine applications, including oceanographic research, underwater robotics, and marine conservation. However, UOD faces numerous challenges that compromise its performance. Over the years, various methods have been proposed to address these issues, but they often fail to fully capture the complexities of underwater environments. This review systematically categorizes UOD challenges into five key areas: Image quality degradation, target-related issues, data-related challenges, computational and processing constraints, and limitations in detection methodologies. To address these challenges, we analyze the progression from traditional image processing and object detection techniques to modern approaches. Additionally, we explore the potential of large vision-language models (LVLMs) in UOD, leveraging their multi-modal capabilities demonstrated in other domains. We also present case studies, including synthetic dataset generation using DALL-E 3 and fine-tuning Florence-2 LVLM for UOD. This review identifies three key insights: (i) Current UOD methods are insufficient to fully address challenges like image degradation and small object detection in dynamic underwater environments. (ii) Synthetic data generation using LVLMs shows potential for augmenting datasets but requires further refinement to ensure realism and applicability. (iii) LVLMs hold significant promise for UOD, but their real-time application remains under-explored, requiring further research on optimization techniques.

[105] Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening

Piyush Bagad, Andrew Zisserman

Main category: cs.CV

TL;DR: The paper introduces chiral action recognition to measure time-sensitivity in video representations and proposes a self-supervised method to enhance time awareness in frozen image features using perceptual straightening principles.

DetailsMotivation: Current video embeddings poorly represent temporally opposite actions that require understanding of visual change over time, such as opening vs closing or approaching vs moving away.

Method: Self-supervised adaptation recipe using auto-encoder with latent space inspired by perceptual straightening to inject time-sensitivity into frozen image feature sequences.

Result: Outperforms larger video models on Something-Something, EPIC-Kitchens, and Charade datasets, and improves classification performance when combined with existing models.

Conclusion: The proposed method successfully creates compact, time-sensitive video representations that effectively distinguish chiral action pairs and enhances standard video classification benchmarks.

Abstract: Our objective is to develop compact video representations that are sensitive to visual change over time. To measure such time-sensitivity, we introduce a new task: chiral action recognition, where one needs to distinguish between a pair of temporally opposite actions, such as “opening vs. closing a door”, “approaching vs. moving away from something”, “folding vs. unfolding paper”, etc. Such actions (i) occur frequently in everyday life, (ii) require understanding of simple visual change over time (in object state, size, spatial position, count . . . ), and (iii) are known to be poorly represented by many video embeddings. Our goal is to build time aware video representations which offer linear separability between these chiral pairs. To that end, we propose a self-supervised adaptation recipe to inject time-sensitivity into a sequence of frozen image features. Our model is based on an auto-encoder with a latent space with inductive bias inspired by perceptual straightening. We show that this results in a compact but time-sensitive video representation for the proposed task across three datasets: Something-Something, EPIC-Kitchens, and Charade. Our method (i) outperforms much larger video models pre-trained on large-scale video datasets, and (ii) leads to an improvement in classification performance on standard benchmarks when combined with these existing models.

[106] MESH – Understanding Videos Like Human: Measuring Hallucinations in Large Video Models

Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen, Zhenguo Li, Kaiwen Zhou, James Cheng

Main category: cs.CV

TL;DR: MESH is a new benchmark for evaluating hallucinations in Large Video Models that uses a bottom-up QA framework aligned with human perception, showing LVMs struggle with fine details and complex actions in longer videos.

DetailsMotivation: Current video hallucination benchmarks rely on manual categorization and don't capture human perceptual processes, leaving a gap in systematic evaluation of LVMs' tendency to produce inaccurate descriptions.

Method: MESH uses a Question-Answering framework with binary and multi-choice formats containing target and trap instances, following a bottom-up approach that evaluates basic objects, coarse-to-fine subject features, and subject-action pairs.

Result: Evaluations show LVMs perform well on basic object recognition but become increasingly susceptible to hallucinations when handling fine details or aligning multiple actions involving various subjects in longer videos.

Conclusion: MESH provides an effective and comprehensive approach for systematically identifying hallucinations in video models, revealing specific weaknesses in LVMs’ ability to process complex temporal and detailed visual information.

Abstract: Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos.

[107] ViewSparsifier: Killing Redundancy in Multi-View Plant Phenotyping

Robin-Nico Kampa, Fabian Deuser, Konrad Habel, Norbert Oswald

Main category: cs.CV

TL;DR: ViewSparsifier approach won both plant age prediction and leaf count estimation tasks in the GroMo Grand Challenge by using multi-view learning with random view selection to create view-invariant embeddings.

DetailsMotivation: Single-view classification/regression models often fail to capture all information needed for accurate plant phenotypic trait estimation, which affects plant health assessment and harvest readiness prediction.

Method: Used multi-view dataset with plants photographed from multiple heights and angles. Incorporated 24 views (selection vector) in random selection to learn view-invariant embeddings. Also experimented with randomized view selection across all five height levels (120 views total) using selection matrices.

Result: The ViewSparsifier approach won both tasks (Plant Age Prediction and Leaf Count Estimation) in the ACM Multimedia 2025 GroMo Grand Challenge.

Conclusion: Multi-view learning with random view selection is effective for plant phenotyping tasks, and further improvements can be achieved through expanded view selection across all height levels.

Abstract: Plant phenotyping involves analyzing observable characteristics of plants to better understand their growth, health, and development. In the context of deep learning, this analysis is often approached through single-view classification or regression models. However, these methods often fail to capture all information required for accurate estimation of target phenotypic traits, which can adversely affect plant health assessment and harvest readiness prediction. To address this, the Growth Modelling (GroMo) Grand Challenge at ACM Multimedia 2025 provides a multi-view dataset featuring multiple plants and two tasks: Plant Age Prediction and Leaf Count Estimation. Each plant is photographed from multiple heights and angles, leading to significant overlap and redundancy in the captured information. To learn view-invariant embeddings, we incorporate 24 views, referred to as the selection vector, in a random selection. Our ViewSparsifier approach won both tasks. For further improvement and as a direction for future research, we also experimented with randomized view selection across all five height levels (120 views total), referred to as selection matrices.

[108] Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation

Wenjun Yu, Yinchen Zhou, Jia-Xuan Jiang, Shubin Zeng, Yuee Li, Zhong Wang

Main category: cs.CV

TL;DR: Proposes EM Aggregation and Text-Guided Pixel Decoder to bridge semantic gap between text and medical images, improving multimodal segmentation performance in medical domain generalization.

DetailsMotivation: Multimodal models underperform in medical image segmentation due to semantic gap between abstract text prompts and fine-grained medical visual features, causing feature dispersion issues.

Method: Introduces Expectation-Maximization (EM) Aggregation to cluster features into compact semantic centers, and Text-Guided Pixel Decoder that uses domain-invariant textual knowledge to guide visual representations.

Result: Extensive experiments on cardiac and fundus datasets show consistent outperformance over state-of-the-art approaches across multiple domain generalization benchmarks.

Conclusion: The proposed semantic aggregation approach effectively addresses multimodal fusion challenges in medical imaging, significantly improving generalization capability through synergistic EM clustering and text-guided decoding.

Abstract: Multimodal models have achieved remarkable success in natural image segmentation, yet they often underperform when applied to the medical domain. Through extensive study, we attribute this performance gap to the challenges of multimodal fusion, primarily the significant semantic gap between abstract textual prompts and fine-grained medical visual features, as well as the resulting feature dispersion. To address these issues, we revisit the problem from the perspective of semantic aggregation. Specifically, we propose an Expectation-Maximization (EM) Aggregation mechanism and a Text-Guided Pixel Decoder. The former mitigates feature dispersion by dynamically clustering features into compact semantic centers to enhance cross-modal correspondence. The latter is designed to bridge the semantic gap by leveraging domain-invariant textual knowledge to effectively guide deep visual representations. The synergy between these two mechanisms significantly improves the model’s generalization ability. Extensive experiments on public cardiac and fundus datasets demonstrate that our method consistently outperforms existing SOTA approaches across multiple domain generalization benchmarks.

[109] Improving Greenland Bed Topography Mapping with Uncertainty-Aware Graph Learning on Sparse Radar Data

Bayu Adhi Tama, Homayra Alam, Mostafa Cham, Omar Faruque, Jianwu Wang, Vandana Janeja

Main category: cs.CV

TL;DR: GraphTopoNet is a graph-learning framework that fuses heterogeneous data to create accurate subglacial bed maps of Greenland, reducing error by up to 60% compared to existing methods.

DetailsMotivation: Accurate maps of Greenland's subglacial bed are crucial for sea-level projections, but current radar observations are sparse and unevenly distributed, limiting reliability.

Method: Uses graph-learning with Monte Carlo dropout for uncertainty modeling, spatial graphs built from surface observables (elevation, velocity, mass balance) with gradient features and polynomial trends, and a hybrid loss combining confidence-weighted radar supervision with dynamically balanced regularization.

Result: Outperforms interpolation, convolutional, and graph-based baselines by up to 60% error reduction while preserving fine-scale glacial features in three Greenland subregions.

Conclusion: GraphTopoNet demonstrates how graph machine learning can convert sparse, uncertain geophysical observations into actionable knowledge at continental scale, improving reliability for climate forecasting and policy.

Abstract: Accurate maps of Greenland’s subglacial bed are essential for sea-level projections, but radar observations are sparse and uneven. We introduce GraphTopoNet, a graph-learning framework that fuses heterogeneous supervision and explicitly models uncertainty via Monte Carlo dropout. Spatial graphs built from surface observables (elevation, velocity, mass balance) are augmented with gradient features and polynomial trends to capture both local variability and broad structure. To handle data gaps, we employ a hybrid loss that combines confidence-weighted radar supervision with dynamically balanced regularization. Applied to three Greenland subregions, GraphTopoNet outperforms interpolation, convolutional, and graph-based baselines, reducing error by up to 60 percent while preserving fine-scale glacial features. The resulting bed maps improve reliability for operational modeling, supporting agencies engaged in climate forecasting and policy. More broadly, GraphTopoNet shows how graph machine learning can convert sparse, uncertain geophysical observations into actionable knowledge at continental scale.

[110] Implicit Shape-Prior for Few-Shot Assisted 3D Segmentation

Mathilde Monvoisin, Louise Piecuch, Blanche Texier, Cédric Hémon, Anaïs Barateau, Jérémie Huet, Antoine Nordez, Anne-Sophie Boureau, Jean-Claude Nunes, Diana Mateus

Main category: cs.CV

TL;DR: A method to reduce manual segmentation workload in medical 3D imaging using implicit shape priors and automatic slice selection from sparse annotations.

DetailsMotivation: To alleviate the manual workload for medical professionals in complex 3D segmentation tasks like radiotherapy planning and degenerative disease diagnosis that cannot be fully automated.

Method: Introduces an implicit shape prior to segment volumes from sparse slice manual annotations generalized to multi-organ cases, with a framework for automatically selecting the most informative slices to minimize interactions.

Result: Experimental validation shows effectiveness on two medical use cases: assisted segmentation of at-risk organs for brain cancer patients and accelerating database creation for sarcopenia patients with unseen muscle shapes.

Conclusion: The proposed method successfully reduces manual segmentation burden in medical imaging by leveraging shape priors and intelligent slice selection from sparse annotations.

Abstract: The objective of this paper is to significantly reduce the manual workload required from medical professionals in complex 3D segmentation tasks that cannot be yet fully automated. For instance, in radiotherapy planning, organs at risk must be accurately identified in computed tomography (CT) or magnetic resonance imaging (MRI) scans to ensure they are spared from harmful radiation. Similarly, diagnosing age-related degenerative diseases such as sarcopenia, which involve progressive muscle volume loss and strength, is commonly based on muscular mass measurements often obtained from manual segmentation of medical volumes. To alleviate the manual-segmentation burden, this paper introduces an implicit shape prior to segment volumes from sparse slice manual annotations generalized to the multi-organ case, along with a simple framework for automatically selecting the most informative slices to guide and minimize the next interactions. The experimental validation shows the method’s effectiveness on two medical use cases: assisted segmentation in the context of at risks organs for brain cancer patients, and acceleration of the creation of a new database with unseen muscle shapes for patients with sarcopenia.

[111] UOPSL: Unpaired OCT Predilection Sites Learning for Fundus Image Diagnosis Augmentation

Zhihao Zhao, Yinzheng Zhao, Junjie Yang, Xiangtong Yao, Quanmin Liang, Daniel Zapp, Kai Huang, Nassir Navab, M. Ali Nasseri

Main category: cs.CV

TL;DR: A novel unpaired multimodal framework called UOPSL that uses OCT-derived spatial priors to enhance fundus image-based disease recognition without requiring paired multimodal data.

DetailsMotivation: Acquiring paired multimodal ophthalmic images is expensive and limited, with OCT data being scarce while fundus photography is more accessible. Existing methods using only fundus or text features fail to capture fine-grained spatial information from different imaging modalities.

Method: UOPSL uses contrastive learning on unpaired OCT and fundus images to learn a predilection sites matrix in OCT latent space. This matrix captures lesion localization patterns and is then used to assist fundus image classification when OCT data is unavailable during inference.

Result: Extensive experiments on 9 diverse datasets across 28 critical categories show that the framework outperforms existing benchmarks.

Conclusion: The proposed approach successfully bridges unpaired fundus and OCT data using disease text descriptions and spatial priors, enabling improved disease recognition using only fundus images while leveraging OCT-derived knowledge.

Abstract: Significant advancements in AI-driven multimodal medical image diagnosis have led to substantial improvements in ophthalmic disease identification in recent years. However, acquiring paired multimodal ophthalmic images remains prohibitively expensive. While fundus photography is simple and cost-effective, the limited availability of OCT data and inherent modality imbalance hinder further progress. Conventional approaches that rely solely on fundus or textual features often fail to capture fine-grained spatial information, as each imaging modality provides distinct cues about lesion predilection sites. In this study, we propose a novel unpaired multimodal framework \UOPSL that utilizes extensive OCT-derived spatial priors to dynamically identify predilection sites, enhancing fundus image-based disease recognition. Our approach bridges unpaired fundus and OCTs via extended disease text descriptions. Initially, we employ contrastive learning on a large corpus of unpaired OCT and fundus images while simultaneously learning the predilection sites matrix in the OCT latent space. Through extensive optimization, this matrix captures lesion localization patterns within the OCT feature space. During the fine-tuning or inference phase of the downstream classification task based solely on fundus images, where paired OCT data is unavailable, we eliminate OCT input and utilize the predilection sites matrix to assist in fundus image classification learning. Extensive experiments conducted on 9 diverse datasets across 28 critical categories demonstrate that our framework outperforms existing benchmarks.

[112] EfficientIML: Efficient High-Resolution Image Manipulation Localization

Jinhan Li, Haoyang He, Lei Xie, Jiangning Zhang

Main category: cs.CV

TL;DR: Proposed EfficientIML model with EfficientRWKV backbone for detecting diffusion-generated image manipulations, outperforms SOTA methods in localization, efficiency, and speed.

DetailsMotivation: Current detectors lack exposure to diffusion-based forgeries and face computational constraints with high-resolution images.

Method: Lightweight three-stage EfficientRWKV backbone combining state-space and attention networks with multi-scale supervision.

Result: Outperforms ViT-based and other SOTA lightweight baselines in localization performance, FLOPs, and inference speed.

Conclusion: Suitable for real-time forensic applications due to efficient performance on high-resolution diffusion-generated manipulations.

Abstract: With imaging devices delivering ever-higher resolutions and the emerging diffusion-based forgery methods, current detectors trained only on traditional datasets (with splicing, copy-moving and object removal forgeries) lack exposure to this new manipulation type. To address this, we propose a novel high-resolution SIF dataset of 1200+ diffusion-generated manipulations with semantically extracted masks. However, this also imposes a challenge on existing methods, as they face significant computational resource constraints due to their prohibitive computational complexities. Therefore, we propose a novel EfficientIML model with a lightweight, three-stage EfficientRWKV backbone. EfficientRWKV’s hybrid state-space and attention network captures global context and local details in parallel, while a multi-scale supervision strategy enforces consistency across hierarchical predictions. Extensive evaluations on our dataset and standard benchmarks demonstrate that our approach outperforms ViT-based and other SOTA lightweight baselines in localization performance, FLOPs and inference speed, underscoring its suitability for real-time forensic applications.

[113] CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging

Zhihao Zhao, Yinzheng Zhao, Junjie Yang, Xiangtong Yao, Quanmin Liang, Shahrooz Faghihroohi, Kai Huang, Nassir Navab, M. Ali Nasseri

Main category: cs.CV

TL;DR: CLAPS is a novel automated segmentation method that combines CLIP, GroundingDINO, and SAM to achieve unified retinal image segmentation across diverse tasks and modalities without manual prompting.

DetailsMotivation: Current medical image segmentation methods face modality ambiguity in text descriptions, reliance on manual SAM prompting, and lack of unified frameworks for diverse retinal imaging tasks.

Method: Pre-train CLIP image encoder on multi-modal retinal data, use GroundingDINO for automatic bounding box prompts, employ text prompts with modality signatures, and guide SAM for automated segmentation.

Result: Achieves performance comparable to specialized expert models and surpasses existing benchmarks across 12 datasets and 11 segmentation categories.

Conclusion: CLAPS demonstrates broad generalizability as a foundation model for automated, unified retinal image segmentation across diverse modalities and tasks.

Abstract: Recent advancements in foundation models, such as the Segment Anything Model (SAM), have significantly impacted medical image segmentation, especially in retinal imaging, where precise segmentation is vital for diagnosis. Despite this progress, current methods face critical challenges: 1) modality ambiguity in textual disease descriptions, 2) a continued reliance on manual prompting for SAM-based workflows, and 3) a lack of a unified framework, with most methods being modality- and task-specific. To overcome these hurdles, we propose CLIP-unified Auto-Prompt Segmentation (\CLAPS), a novel method for unified segmentation across diverse tasks and modalities in retinal imaging. Our approach begins by pre-training a CLIP-based image encoder on a large, multi-modal retinal dataset to handle data scarcity and distribution imbalance. We then leverage GroundingDINO to automatically generate spatial bounding box prompts by detecting local lesions. To unify tasks and resolve ambiguity, we use text prompts enhanced with a unique “modality signature” for each imaging modality. Ultimately, these automated textual and spatial prompts guide SAM to execute precise segmentation, creating a fully automated and unified pipeline. Extensive experiments on 12 diverse datasets across 11 critical segmentation categories show that CLAPS achieves performance on par with specialized expert models while surpassing existing benchmarks across most metrics, demonstrating its broad generalizability as a foundation model.

[114] AdsQA: Towards Advertisement Video Understanding

Xinwei Long, Kai Tian, Peng Xu, Guoli Jia, Jingxuan Li, Sa Yang, Yihua Shao, Kaiyan Zhang, Che Jiang, Hao Xu, Yang Liu, Jiaheng Ma, Bowen Zhou

Main category: cs.CV

TL;DR: This paper introduces AdsQA, the first benchmark using advertisement videos to evaluate LLMs’ ability to perceive marketing logic and persuasive strategies beyond basic visual content. The authors also propose ReAd-R, an RL model that achieves state-of-the-art performance on this challenging task.

DetailsMotivation: To extend LLMs' specialized applications by using ad videos as a challenging test-bed that requires understanding marketing logic, persuasive strategies, and audience engagement beyond objective physical content.

Method: Created AdsQA benchmark with 1,544 ad videos (10,962 clips, 22.7 hours) and 5 challenging tasks. Proposed ReAd-R, a Deepseek-R1 styled RL model that reflects on questions and generates answers via reward-driven optimization.

Result: Benchmarked 14 top-tier LLMs on AdsQA. ReAd-R achieved state-of-the-art performance, outperforming strong competitors with long-chain reasoning capabilities by a clear margin.

Conclusion: Ad videos provide a rich, challenging domain for evaluating LLMs’ advanced perception capabilities. The proposed ReAd-R model demonstrates superior performance in understanding complex marketing content and persuasive strategies.

Abstract: Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domain-specific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos’ traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold: (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 22.7 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 top-tier LLMs on AdsQA, and our \texttt{ReAd-R}~achieves the state-of-the-art outperforming strong competitors equipped with long-chain reasoning capabilities by a clear margin.

[115] Skeleton-based sign language recognition using a dual-stream spatio-temporal dynamic graph convolutional network

Liangjin Liu, Haoyang Zheng, Pei Zhou

Main category: cs.CV

TL;DR: DSLNet is a dual-reference, dual-stream architecture for isolated sign language recognition that separates gesture shape and motion analysis using wrist-centric and facial-centric coordinate systems, achieving state-of-the-art performance with fewer parameters.

DetailsMotivation: Existing ISLR methods struggle with morphologically similar but semantically distinct gestures due to geometric ambiguity from using single reference frames, which cannot effectively separate hand shape from motion trajectory.

Method: Dual-reference system: wrist-centric frame for view-invariant shape analysis and facial-centric frame for context-aware trajectory modeling. Uses topology-aware graph convolution for shape and Finsler geometry encoder for trajectory, integrated via geometry-driven optimal transport fusion.

Result: Achieved 93.70% on WLASL-100, 89.97% on WLASL-300, and 99.79% on LSA64 datasets - new state-of-the-art results with significantly fewer parameters than competing models.

Conclusion: Decoupling gesture morphology and trajectory into separate coordinate systems with specialized processing networks effectively resolves geometric ambiguity in sign language recognition, leading to superior performance with improved efficiency.

Abstract: Isolated Sign Language Recognition (ISLR) is challenged by gestures that are morphologically similar yet semantically distinct, a problem rooted in the complex interplay between hand shape and motion trajectory. Existing methods, often relying on a single reference frame, struggle to resolve this geometric ambiguity. This paper introduces Dual-SignLanguageNet (DSLNet), a dual-reference, dual-stream architecture that decouples and models gesture morphology and trajectory in separate, complementary coordinate systems. Our approach utilizes a wrist-centric frame for view-invariant shape analysis and a facial-centric frame for context-aware trajectory modeling. These streams are processed by specialized networks-a topology-aware graph convolution for shape and a Finsler geometry-based encoder for trajectory-and are integrated via a geometry-driven optimal transport fusion mechanism. DSLNet sets a new state-of-the-art, achieving 93.70%, 89.97% and 99.79% accuracy on the challenging WLASL-100, WLASL-300 and LSA64 datasets, respectively, with significantly fewer parameters than competing models.

[116] LADB: Latent Aligned Diffusion Bridges for Semi-Supervised Domain Translation

Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Dong Wang, Mingwei Sun, Yongliang Wang, Niclas Zeller, Daniel Cremers

Main category: cs.CV

TL;DR: LADB is a semi-supervised framework that uses partially paired data to bridge domain gaps in diffusion models, enabling sample-to-sample translation without full supervision.

DetailsMotivation: Diffusion models struggle in data-scarce domains where exhaustive retraining or costly paired data are required. Current unpaired methods lack controllability while fully paired approaches need large domain-specific datasets.

Method: Aligns source and target distributions in shared latent space, integrates pretrained source-domain diffusion models with target-domain Latent Aligned Diffusion Model (LADM) trained on partially paired latent representations.

Result: Superior performance in depth-to-image translation under partial supervision. Successfully extended to multi-source translation (depth maps + segmentation masks) and multi-target translation in class-conditioned style transfer.

Conclusion: LADB provides a scalable and versatile solution for real-world domain translation, particularly useful when data annotation is costly or incomplete, balancing fidelity and diversity.

Abstract: Diffusion models excel at generating high-quality outputs but face challenges in data-scarce domains, where exhaustive retraining or costly paired data are often required. To address these limitations, we propose Latent Aligned Diffusion Bridges (LADB), a semi-supervised framework for sample-to-sample translation that effectively bridges domain gaps using partially paired data. By aligning source and target distributions within a shared latent space, LADB seamlessly integrates pretrained source-domain diffusion models with a target-domain Latent Aligned Diffusion Model (LADM), trained on partially paired latent representations. This approach enables deterministic domain mapping without the need for full supervision. Compared to unpaired methods, which often lack controllability, and fully paired approaches that require large, domain-specific datasets, LADB strikes a balance between fidelity and diversity by leveraging a mixture of paired and unpaired latent-target couplings. Our experimental results demonstrate superior performance in depth-to-image translation under partial supervision. Furthermore, we extend LADB to handle multi-source translation (from depth maps and segmentation masks) and multi-target translation in a class-conditioned style transfer task, showcasing its versatility in handling diverse and heterogeneous use cases. Ultimately, we present LADB as a scalable and versatile solution for real-world domain translation, particularly in scenarios where data annotation is costly or incomplete.

[117] FractalPINN-Flow: A Fractal-Inspired Network for Unsupervised Optical Flow Estimation with Total Variation Regularization

Sara Behnamian, Rasoul Khaksarinezhad, Andreas Langer

Main category: cs.CV

TL;DR: FractalPINN-Flow is an unsupervised deep learning framework for optical flow estimation that learns from consecutive grayscale frames without ground truth, using a fractal-inspired recursive encoder-decoder architecture with TV regularization.

DetailsMotivation: To develop an optical flow estimation method that doesn't require ground truth annotations, can handle high-resolution data effectively, and captures both fine details and long-range motion patterns through fractal-inspired architecture.

Method: Uses Fractal Deformation Network (FDN) - a recursive encoder-decoder with skip connections inspired by fractal geometry. Minimizes energy functional with L1/L2 data fidelity terms for brightness constancy and total variation regularization for spatial smoothness.

Result: Produces accurate, smooth, and edge-preserving optical flow fields. Effective for high-resolution data and scenarios with limited annotations, as demonstrated on synthetic and benchmark datasets.

Conclusion: FractalPINN-Flow provides an effective unsupervised approach for optical flow estimation that leverages fractal geometry principles and variational regularization to achieve high-quality results without requiring ground truth data.

Abstract: We present FractalPINN-Flow, an unsupervised deep learning framework for dense optical flow estimation that learns directly from consecutive grayscale frames without requiring ground truth. The architecture centers on the Fractal Deformation Network (FDN) - a recursive encoder-decoder inspired by fractal geometry and self-similarity. Unlike traditional CNNs with sequential downsampling, FDN uses repeated encoder-decoder nesting with skip connections to capture both fine-grained details and long-range motion patterns. The training objective is based on a classical variational formulation using total variation (TV) regularization. Specifically, we minimize an energy functional that combines $L^1$ and $L^2$ data fidelity terms to enforce brightness constancy, along with a TV term that promotes spatial smoothness and coherent flow fields. Experiments on synthetic and benchmark datasets show that FractalPINN-Flow produces accurate, smooth, and edge-preserving optical flow fields. The model is especially effective for high-resolution data and scenarios with limited annotations.

[118] Multi-Modal Robust Enhancement for Coastal Water Segmentation: A Systematic HSV-Guided Framework

Zhen Tian, Christos Anagnostopoulos, Qiyuan Wang, Zhiwei Gao

Main category: cs.CV

TL;DR: Robust U-Net framework for coastal water segmentation using HSV color space supervision and multi-modal constraints to improve training stability and segmentation quality in diverse maritime environments.

DetailsMotivation: Traditional RGB-based approaches suffer from training instability and poor generalization in coastal water segmentation due to complex spectral characteristics and irregular boundary patterns.

Method: Integrates five components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Uses HSV color space instead of traditional RGB.

Result: HSV supervision provides highest impact (0.85 influence score), achieves 84% variance reduction in training stability, and shows consistent improvements across multiple evaluation metrics while maintaining computational efficiency.

Conclusion: The proposed Robust U-Net framework significantly enhances coastal water segmentation performance through systematic integration of HSV supervision and multi-modal constraints, offering superior training stability and segmentation quality.

Abstract: Coastal water segmentation from satellite imagery presents unique challenges due to complex spectral characteristics and irregular boundary patterns. Traditional RGB-based approaches often suffer from training instability and poor generalization in diverse maritime environments. This paper introduces a systematic robust enhancement framework, referred to as Robust U-Net, that leverages HSV color space supervision and multi-modal constraints for improved coastal water segmentation. Our approach integrates five synergistic components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Through comprehensive ablation studies, we demonstrate that HSV supervision provides the highest impact (0.85 influence score), while the complete framework achieves superior training stability (84% variance reduction) and enhanced segmentation quality. Our method shows consistent improvements across multiple evaluation metrics while maintaining computational efficiency. For reproducibility, our training configurations and code are available here: https://github.com/UofgCoastline/ICASSP-2026-Robust-Unet.

[119] Computational Imaging for Enhanced Computer Vision

Humera Shaikh, Kaur Jashanpreet

Main category: cs.CV

TL;DR: Survey of computational imaging techniques and their impact on computer vision applications, addressing limitations of conventional imaging in challenging conditions.

DetailsMotivation: Conventional imaging methods often fail in challenging conditions like low light, motion blur, or high dynamic range scenes, limiting computer vision system performance.

Method: Systematic exploration of computational imaging techniques including light field imaging, HDR imaging, deblurring, high-speed imaging, and glare mitigation, and their synergies with core CV tasks.

Result: The survey analyzes relationships between CI methods and their practical contributions to CV applications like object detection, depth estimation, optical flow, face recognition, and keypoint detection.

Conclusion: Highlights emerging opportunities for task-specific adaptive imaging pipelines that improve robustness, accuracy and efficiency in real-world applications like autonomous navigation, surveillance, AR, and robotics.

Abstract: This paper presents a comprehensive survey of computational imaging (CI) techniques and their transformative impact on computer vision (CV) applications. Conventional imaging methods often fail to deliver high-fidelity visual data in challenging conditions, such as low light, motion blur, or high dynamic range scenes, thereby limiting the performance of state-of-the-art CV systems. Computational imaging techniques, including light field imaging, high dynamic range (HDR) imaging, deblurring, high-speed imaging, and glare mitigation, address these limitations by enhancing image acquisition and reconstruction processes. This survey systematically explores the synergies between CI techniques and core CV tasks, including object detection, depth estimation, optical flow, face recognition, and keypoint detection. By analyzing the relationships between CI methods and their practical contributions to CV applications, this work highlights emerging opportunities, challenges, and future research directions. We emphasize the potential for task-specific, adaptive imaging pipelines that improve robustness, accuracy, and efficiency in real-world scenarios, such as autonomous navigation, surveillance, augmented reality, and robotics.

[120] BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion

Sike Xiang, Shuang Chen, Amir Atapour-Abarghouei

Main category: cs.CV

TL;DR: Proposes BcQLM, a lightweight 1.2B parameter multimodal language model with BreezeCLIP encoder for efficient visual question answering that maintains performance while reducing computational costs.

DetailsMotivation: Address deployment challenges of large MLLMs in resource-constrained environments by developing energy-efficient, scalable models for practical applications.

Method: Uses BreezeCLIP as compact vision-language encoder in end-to-end framework, featuring modular design with Q-gated multimodal architecture for efficient understanding.

Result: Achieves comparable performance to standard-size MLLMs while significantly reducing computational costs, validated across multiple datasets.

Conclusion: BcQLM offers promising path toward deployable MLLMs under hardware constraints with extensible design for broader multimodal tasks.

Abstract: As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at https://github.com/thico0224/BcQLM.

[121] CrowdQuery: Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes

Marius Dähling, Sebastian Krebs, J. Marius Zöllner

Main category: cs.CV

TL;DR: CrowdQuery (CQ) is a novel method that enhances transformer-based detectors by embedding object density maps into object queries, improving crowd detection in both 2D and 3D without additional data.

DetailsMotivation: Existing crowd detection methods struggle with crowded scenes where objects are densely packed. Current density map definitions are limited to head positions or spatial statistics, lacking consideration of individual bounding box dimensions.

Method: Developed CQ module that predicts and embeds object density maps, then integrates this density information into decoder queries. Created CQ2D and CQ3D architectures that extend density definitions to include bounding box dimensions for both 2D and 3D detection.

Result: Significant performance improvements on STCrowd dataset for both 2D and 3D domains, outperforming most state-of-the-art methods. Further improved performance on CrowdHuman dataset when integrated with existing crowd detectors.

Conclusion: CQ effectively bridges 2D and 3D detection in crowded environments, demonstrating universal applicability and generalizability across different transformer models and datasets without requiring additional data.

Abstract: This paper introduces a novel method for end-to-end crowd detection that leverages object density information to enhance existing transformer-based detectors. We present CrowdQuery (CQ), whose core component is our CQ module that predicts and subsequently embeds an object density map. The embedded density information is then systematically integrated into the decoder. Existing density map definitions typically depend on head positions or object-based spatial statistics. Our method extends these definitions to include individual bounding box dimensions. By incorporating density information into object queries, our method utilizes density-guided queries to improve detection in crowded scenes. CQ is universally applicable to both 2D and 3D detection without requiring additional data. Consequently, we are the first to design a method that effectively bridges 2D and 3D detection in crowded environments. We demonstrate the integration of CQ into both a general 2D and 3D transformer-based object detector, introducing the architectures CQ2D and CQ3D. CQ is not limited to the specific transformer models we selected. Experiments on the STCrowd dataset for both 2D and 3D domains show significant performance improvements compared to the base models, outperforming most state-of-the-art methods. When integrated into a state-of-the-art crowd detector, CQ can further improve performance on the challenging CrowdHuman dataset, demonstrating its generalizability. The code is released at https://github.com/mdaehl/CrowdQuery.

[122] ArgoTweak: Towards Self-Updating HD Maps through Structured Priors

Lena Wild, Rafael Valencia, Patric Jensfelt

Main category: cs.CV

TL;DR: ArgoTweak is the first dataset providing realistic map priors, current maps, and sensor data to address the sim2real gap in HD mapping, enabling accurate change detection and integration through fine-grained atomic modifications.

DetailsMotivation: Existing methods rely on synthetic priors due to lack of public datasets with the required triplet of prior maps, current maps, and sensor data, creating inconsistencies and significant sim2real gap.

Method: Employs a bijective mapping framework that breaks down large-scale modifications into fine-grained atomic changes at the map element level, ensuring interpretability and preserving unchanged elements with high fidelity.

Result: Training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors, with extensive ablations highlighting the impact of structured priors and detailed change annotations.

Conclusion: ArgoTweak establishes a benchmark for explainable, prior-aided HD mapping and advances scalable, self-improving mapping solutions by providing realistic map priors and detailed change detection capabilities.

Abstract: Reliable integration of prior information is crucial for self-verifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prior-aided HD mapping, ArgoTweak advances scalable, self-improving mapping solutions. The dataset, baselines, map modification toolbox, and further resources are available at https://kth-rpl.github.io/ArgoTweak/.

[123] An End-to-End Deep Learning Framework for Arsenicosis Diagnosis Using Mobile-Captured Skin Images

Asif Newaz, Asif Ur Rahman Adib, Rajit Sahil, Mashfique Mehzad

Main category: cs.CV

TL;DR: Deep learning framework using mobile phone images achieves 86% accuracy for arsenicosis diagnosis, outperforming CNNs with Transformer models and providing explainable AI visualizations for clinical use.

DetailsMotivation: Arsenicosis from contaminated water is a serious public health issue in Asia, with early skin manifestations often underdiagnosed in rural areas lacking dermatologists. Automated image-based diagnosis can enable early detection and timely interventions.

Method: Curated dataset of 20 classes with 11,000+ images of arsenic-induced and other skin conditions. Benchmarked CNN and Transformer models, integrated LIME and Grad-CAM for interpretability, and developed web-based diagnostic tool.

Result: Transformer models outperformed CNNs, with Swin Transformer achieving 86% accuracy. Visualizations confirmed models focused on lesion-relevant regions. Strong performance on external validation demonstrated generalization capability.

Conclusion: The framework shows deep learning’s potential for non-invasive, accessible, and explainable arsenicosis diagnosis from mobile images, serving as practical diagnostic aid in resource-limited communities for early detection.

Abstract: Background: Arsenicosis is a serious public health concern in South and Southeast Asia, primarily caused by long-term consumption of arsenic-contaminated water. Its early cutaneous manifestations are clinically significant but often underdiagnosed, particularly in rural areas with limited access to dermatologists. Automated, image-based diagnostic solutions can support early detection and timely interventions. Methods: In this study, we propose an end-to-end framework for arsenicosis diagnosis using mobile phone-captured skin images. A dataset comprising 20 classes and over 11000 images of arsenic-induced and other dermatological conditions was curated. Multiple deep learning architectures, including convolutional neural networks (CNNs) and Transformer-based models, were benchmarked for arsenicosis detection. Model interpretability was integrated via LIME and Grad-CAM, while deployment feasibility was demonstrated through a web-based diagnostic tool. Results: Transformer-based models significantly outperformed CNNs, with the Swin Transformer achieving the best results (86\% accuracy). LIME and Grad-CAM visualizations confirmed that the models attended to lesion-relevant regions, increasing clinical transparency and aiding in error analysis. The framework also demonstrated strong performance on external validation samples, confirming its ability to generalize beyond the curated dataset. Conclusion: The proposed framework demonstrates the potential of deep learning for non-invasive, accessible, and explainable diagnosis of arsenicosis from mobile-acquired images. By enabling reliable image-based screening, it can serve as a practical diagnostic aid in rural and resource-limited communities, where access to dermatologists is scarce, thereby supporting early detection and timely intervention.

[124] Quantifying Accuracy of an Event-Based Star Tracker via Earth’s Rotation

Dennis Melamed, Connor Hashemi, Scott McCloskey

Main category: cs.CV

TL;DR: Event-based cameras achieve 18.47 arcsecond accuracy for star tracking using Earth’s rotation as ground truth, demonstrating potential for low-cost, low-latency attitude determination.

DetailsMotivation: Event-based cameras show promise for star tracking but lack accurate ground truth validation in previous studies. This research aims to quantify their accuracy using Earth's predictable rotation as reference.

Method: Static event camera mounted on ground telescope captures night sky events. Earth’s rotation provides known motion reference. Event streams processed for orientation estimates and compared against IERS Earth orientation measurements.

Result: Achieved 18.47 arcseconds RMS error and 78.84 arcseconds about error. Demonstrates event cameras can provide accurate attitude determination while offering benefits like sparser data, higher dynamic range, lower energy consumption, and faster update rates.

Conclusion: Event cameras are viable for low-cost, low-latency star tracking applications, with accuracy sufficient for many attitude determination needs while providing computational and power advantages over traditional framing sensors.

Abstract: Event-based cameras (EBCs) are a promising new technology for star tracking-based attitude determination, but prior studies have struggled to determine accurate ground truth for real data. We analyze the accuracy of an EBC star tracking system utilizing the Earth’s motion as the ground truth for comparison. The Earth rotates in a regular way with very small irregularities which are measured to the level of milli-arcseconds. By keeping an event camera static and pointing it through a ground-based telescope at the night sky, we create a system where the only camera motion in the celestial reference frame is that induced by the Earth’s rotation. The resulting event stream is processed to generate estimates of orientation which we compare to the International Earth Rotation and Reference System (IERS) measured orientation of the Earth. The event camera system is able to achieve a root mean squared across error of 18.47 arcseconds and an about error of 78.84 arcseconds. Combined with the other benefits of event cameras over framing sensors (reduced computation due to sparser data streams, higher dynamic range, lower energy consumption, faster update rates), this level of accuracy suggests the utility of event cameras for low-cost and low-latency star tracking. We provide all code and data used to generate our results: https://gitlab.kitware.com/nest-public/telescope_accuracy_quantification.

[125] Handling Multiple Hypotheses in Coarse-to-Fine Dense Image Matching

Matthieu Vilain, Rémi Giraud, Yannick Berthoumieu, Guillaume Bourmaud

Main category: cs.CV

TL;DR: BEAMER introduces a beam search strategy for dense image matching that preserves multiple correspondent hypotheses per pixel across scales, improving robustness at depth discontinuities and strong zoom-in scenarios.

DetailsMotivation: Current dense matching methods produce single correspondent hypotheses per pixel, which fails in challenging cases like depth discontinuities or strong zoom-in where neighboring correspondences are widely spread, leading to erroneous matches.

Method: Proposes a beam search strategy to propagate multiple hypotheses at each scale and integrates these multiple hypotheses into cross-attention layers, creating a novel architecture that learns to preserve and propagate multiple correspondences.

Result: BEAMER becomes significantly more robust than state-of-the-art methods, particularly excelling at depth discontinuities and when the target image is a strong zoom-in of the source image.

Conclusion: Predicting multiple correspondent hypotheses per source location using beam search and cross-attention integration provides substantial improvements in dense image matching robustness for challenging scenarios.

Abstract: Dense image matching aims to find a correspondent for every pixel of a source image in a partially overlapping target image. State-of-the-art methods typically rely on a coarse-to-fine mechanism where a single correspondent hypothesis is produced per source location at each scale. In challenging cases – such as at depth discontinuities or when the target image is a strong zoom-in of the source image – the correspondents of neighboring source locations are often widely spread and predicting a single correspondent hypothesis per source location at each scale may lead to erroneous matches. In this paper, we investigate the idea of predicting multiple correspondent hypotheses per source location at each scale instead. We consider a beam search strategy to propagat multiple hypotheses at each scale and propose integrating these multiple hypotheses into cross-attention layers, resulting in a novel dense matching architecture called BEAMER. BEAMER learns to preserve and propagate multiple hypotheses across scales, making it significantly more robust than state-of-the-art methods, especially at depth discontinuities or when the target image is a strong zoom-in of the source image.

[126] GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts

Jenna Kang, Maria Silva, Patsorn Sangkloy, Kenneth Chen, Niall Williams, Qi Sun

Main category: cs.CV

TL;DR: GeneVA is a large-scale annotated dataset for benchmarking spatio-temporal artifacts in text-to-video generation models, addressing the lack of systematic evaluation tools for video generation quality.

DetailsMotivation: Existing benchmarks focus on image generation, but video generation introduces unique spatio-temporal complexities and artifacts like impossible physics and temporal inconsistency that need systematic evaluation.

Method: Created a large-scale dataset (GeneVA) with rich human annotations focusing on spatio-temporal artifacts in videos generated from natural text prompts.

Result: Provides a comprehensive benchmark for evaluating text-to-video generation models, enabling systematic assessment of artifacts and quality issues.

Conclusion: GeneVA fills a critical gap in video generation evaluation and can assist in benchmarking model performance and improving generative video quality.

Abstract: Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.

[127] RewardDance: Reward Scaling in Visual Generation

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang

Main category: cs.CV

TL;DR: RewardDance introduces a scalable generative reward modeling framework that aligns with VLM architectures, enabling scaling to 26B parameters and solving reward hacking issues in visual generation tasks.

DetailsMotivation: Existing reward models for visual generation have limitations - CLIP-based RMs suffer architectural constraints, Bradley-Terry losses misalign with VLM next-token prediction, and RLHF suffers from reward hacking where models exploit flaws without improving true quality.

Method: RewardDance reformulates reward scoring as the model’s probability of predicting a “yes” token when generated image outperforms reference image, creating intrinsic alignment with VLM architectures. Enables scaling across model size (up to 26B parameters) and context (instructions, references, chain-of-thought).

Result: Significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Large-scale RMs maintain high reward variance during RL fine-tuning, proving resistance to hacking and ability to produce diverse, high-quality outputs.

Conclusion: RewardDance successfully addresses fundamental limitations in visual generation reward modeling, enabling scalable reward models that overcome reward hacking and mode collapse problems, with strong performance across multiple visual generation tasks.

Abstract: Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model’s probability of predicting a “yes” token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of “reward hacking”: Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.

[128] SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video

David Stotko, Reinhard Klein

Main category: cs.CV

TL;DR: Novel approach combining 3D geometry reconstruction and appearance estimation for fabrics using only monocular RGB video, with improved reconstruction accuracy and realistic renderings.

DetailsMotivation: To address the challenge of reconstructing 3D dynamic scenes, particularly fabrics, from monocular RGB video while overcoming depth ambiguity issues and achieving high-quality deformations and renderings.

Method: Combines physical simulation of cloth geometry with differentiable rendering, introducing two novel regularization terms to improve reconstruction plausibility by addressing depth ambiguity in monocular video.

Result: Reduced 3D reconstruction error by a factor of 2.64 compared to recent methods, with medium runtime of 30 min per scene. Achieved sufficient motion quality for appearance estimation, recovering sharp details from single monocular RGB video.

Conclusion: The proposed system successfully performs both 3D reconstruction and appearance estimation for fabrics using only monocular video, demonstrating significant improvements in reconstruction accuracy while maintaining practical runtime.

Abstract: The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for physically based rendering and present a system that is able to perform both tasks for fabrics, utilizing only a single monocular RGB video sequence as input. In order to obtain realistic and high-quality deformations and renderings, a physical simulation of the cloth geometry and differentiable rendering are employed. In this paper, we introduce two novel regularization terms for the 3D reconstruction task that improve the plausibility of the reconstruction by addressing the depth ambiguity problem in monocular video. In comparison with the most recent methods in the field, we have reduced the error in the 3D reconstruction by a factor of 2.64 while requiring a medium runtime of 30 min per scene. Furthermore, the optimized motion achieves sufficient quality to perform an appearance estimation of the deforming object, recovering sharp details from this single monocular RGB video.

[129] Maximizing Information in Domain-Invariant Representation Improves Transfer Learning

Adrian Shuai Li, Elisa Bertino, Xuan-Hong Dang, Ankush Singla, Yuhai Tu, Mark N Wegman

Main category: cs.CV

TL;DR: MaxDIRep improves domain adaptation by using KL divergence to minimize domain-dependent information, ensuring domain-independent representations retain label-relevant features for better target classification.

DetailsMotivation: Existing methods like DSN use weak orthogonality constraints that can cause label-relevant features to be encoded in domain-dependent rather than domain-independent representations, leading to suboptimal adaptation.

Method: Applies KL divergence constraint to minimize information content in domain-dependent representations, forcing domain-independent representations to retain both domain-invariant and predictive features.

Result: Outperforms existing methods on standard image benchmarks and network intrusion detection tasks, works with pretrained models, and generalizes to non-image classification.

Conclusion: MaxDIRep effectively addresses limitations of previous domain separation approaches by using stronger information constraints to ensure domain-independent representations contain necessary classification features.

Abstract: We propose MaxDIRep, a domain adaptation method that improves the decomposition of data representations into domain-independent and domain-dependent components. Existing methods, such as Domain-Separation Networks (DSN), use a weak orthogonality constraint between these components, which can lead to label-relevant features being partially encoded in the domain-dependent representation (DDRep) rather than the domain-independent representation (DIRep). As a result, information crucial for target-domain classification may be missing from the DIRep. MaxDIRep addresses this issue by applying a Kullback-Leibler (KL) divergence constraint to minimize the information content of the DDRep, thereby encouraging the DIRep to retain features that are both domain-invariant and predictive of target labels. Through geometric analysis and an ablation study on synthetic datasets, we show why DSN’s weaker constraint can lead to suboptimal adaptation. Experiments on standard image benchmarks and a network intrusion detection task demonstrate that MaxDIRep achieves strong performance, works with pretrained models, and generalizes to non-image classification tasks.

[130] From Channel Bias to Feature Redundancy: Uncovering the “Less is More” Principle in Few-Shot Learning

Ji Zhang, Xu Luo, Lianli Gao, Difan Zou, Hengtao Shen, Jingkuan Song

Main category: cs.CV

TL;DR: Deep neural networks suffer from channel bias in few-shot learning, causing feature redundancy where most feature dimensions are harmful. Using only 1-5% of discriminative features improves accuracy. Proposed AFIA method uses augmented data to estimate feature importance and mitigate this issue.

DetailsMotivation: Networks fail to adapt to novel tasks under distribution shifts in few-shot settings due to channel bias - rigid emphasis on source task features that misalign with novel task needs, leading to feature redundancy.

Method: Proposed Augmented Feature Importance Adjustment (AFIA), a soft-masking method that estimates feature importance from augmented data to address channel bias and feature redundancy.

Result: Classification accuracy significantly improves by using only 1-5% of most discriminative feature dimensions, revealing majority are harmful. Theoretical analysis confirms redundancy originates from confounding features with high intra-class variance but low inter-class separability.

Conclusion: The ’less is more’ phenomenon characterizes few-shot learning. AFIA provides practical solution and establishes foundational principle for few-shot representation transfer, enabling more robust few-shot learning algorithms.

Abstract: Deep neural networks often fail to adapt representations to novel tasks under distribution shifts, especially when only a few examples are available. This paper identifies a core obstacle behind this failure: channel bias, where networks develop a rigid emphasis on feature dimensions that were discriminative for the source task, but this emphasis is misaligned and fails to adapt to the distinct needs of a novel task. This bias leads to a striking and detrimental consequence: feature redundancy. We demonstrate that for few-shot tasks, classification accuracy is significantly improved by using as few as 1-5% of the most discriminative feature dimensions, revealing that the vast majority are actively harmful. Our theoretical analysis confirms that this redundancy originates from confounding feature dimensions-those with high intra-class variance but low inter-class separability-which are especially problematic in low-data regimes. This “less is more” phenomenon is a defining characteristic of the few-shot setting, diminishing as more samples become available. To address this, we propose a simple yet effective soft-masking method, Augmented Feature Importance Adjustment (AFIA), which estimates feature importance from augmented data to mitigate the issue. By establishing the cohesive link from channel bias to its consequence of extreme feature redundancy, this work provides a foundational principle for few-shot representation transfer and a practical method for developing more robust few-shot learning algorithms.

[131] Learning Robust Representations via Bidirectional Transition for Visual Reinforcement Learning

Xiaobo Hu, Youfang Lin, Yue Liu, Jinwen Wang, Shuo Wang, Hehe Fan, Kai Lv

Main category: cs.CV

TL;DR: BiT model uses bidirectional prediction (forward and backward) of environmental transitions to extract reliable visual representations for reinforcement learning, showing strong generalization and sample efficiency.

DetailsMotivation: Extracting reliable and generalizable representations from high-dimensional visual observations remains challenging in visual RL. Inspired by human thought processes that predict future and trace history for reliable comprehension.

Method: Introduces Bidirectional Transition (BiT) model that leverages bidirectional prediction of environmental transitions (both forward and backward) to extract reliable representations.

Result: Demonstrates competitive generalization performance and sample efficiency on DeepMind Control suite, with additional validation on robotic manipulation and CARLA simulators showing wide applicability.

Conclusion: Bidirectional prediction of transitions provides an effective approach for extracting reliable visual representations in reinforcement learning, enabling better generalization across different environments.

Abstract: Visual reinforcement learning has proven effective in solving control tasks with high-dimensional observations. However, extracting reliable and generalizable representations from vision-based observations remains a central challenge. Inspired by the human thought process, when the representation extracted from the observation can predict the future and trace history, the representation is reliable and accurate in comprehending the environment. Based on this concept, we introduce a Bidirectional Transition (BiT) model, which leverages the ability to bidirectionally predict environmental transitions both forward and backward to extract reliable representations. Our model demonstrates competitive generalization performance and sample efficiency on two settings of the DeepMind Control suite. Additionally, we utilize robotic manipulation and CARLA simulators to demonstrate the wide applicability of our method.

[132] Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Zifu Wan, Pingping Zhang, Yuhao Wang, Silong Yong, Simon Stepputtis, Katia Sycara, Yaqi Xie

Main category: cs.CV

TL;DR: Sigma is a Siamese Mamba network for multi-modal semantic segmentation that achieves global receptive fields with linear complexity using State Space Models, outperforming CNN and ViT approaches on RGB-Thermal and RGB-Depth tasks.

DetailsMotivation: To enhance AI perception in adverse conditions by leveraging complementary modalities like thermal and depth alongside RGB, overcoming limitations of CNNs (local receptive fields) and ViTs (quadratic complexity).

Method: Uses Siamese encoder with Mamba-based fusion mechanism to select essential information from different modalities, plus a decoder for enhanced channel-wise modeling.

Result: Demonstrates superiority on RGB-Thermal and RGB-Depth semantic segmentation tasks, marking the first successful application of SSMs in multi-modal perception.

Conclusion: Sigma provides robust multi-modal segmentation with global receptive fields and linear complexity, advancing perception capabilities in challenging environments.

Abstract: Multi-modal semantic segmentation significantly enhances AI agents’ perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.

[133] PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

Jiancheng Pan, Muyuan Ma, Qing Ma, Cong Bai, Shengyong Chen

Main category: cs.CV

TL;DR: PriorCLIP is a visual prior-guided vision-language model that addresses semantic noise and domain shifts in remote sensing image-text retrieval through progressive attention encoders and a two-stage prior representation learning strategy.

DetailsMotivation: Remote sensing image-text retrieval faces challenges from semantic noise and domain shifts in both closed-domain and open-domain scenarios, requiring unbiased representation learning and adaptive vision-language alignment.

Method: Uses Progressive Attention Encoders (Spatial-PAE and Temporal-PAE) for closed-domain retrieval, and a two-stage prior representation learning strategy (pre-training on coarse-grained pairs + fine-tuning on fine-grained pairs) for open-domain retrieval, with cluster-based symmetric contrastive Attribution Loss.

Result: Achieves 4.9% and 4.0% improvements in closed-domain retrieval, and 7.3% and 9.4% improvements in open-domain retrieval on RSICD and RSITMD benchmarks compared to existing methods.

Conclusion: PriorCLIP effectively addresses semantic bias and domain shifts in remote sensing image-text retrieval through visual prior guidance and adaptive alignment strategies, demonstrating significant performance improvements in both closed and open-domain scenarios.

Abstract: Remote sensing image-text retrieval plays a crucial role in remote sensing interpretation, yet remains challenging under both closed-domain and open-domain scenarios due to semantic noise and domain shifts. To address these issues, we propose a visual prior-guided vision-language model, PriorCLIP, which leverages visual priors for unbiased representation learning and adaptive vision-language alignment. In the closed-domain setting, PriorCLIP introduces two Progressive Attention Encoder (PAE) structures: Spatial-PAE constructs a belief matrix with instruction embeddings to filter key features and mitigate semantic bias. At the same time, Temporal-PAE exploits cyclic activation across time steps to enhance text representation. For the open-domain setting, we design a two-stage prior representation learning strategy, consisting of large-scale pre-training on coarse-grained image-text pairs, followed by fine-tuning on fine-grained pairs using vision-instruction, which enables robust retrieval across long-tail concepts and vocabulary shifts. Furthermore, a cluster-based symmetric contrastive Attribution Loss is proposed to constrain inter-class relations and alleviate semantic confusion in the shared embedding space. Extensive experiments on RSICD and RSITMD benchmarks demonstrate that PriorCLIP achieves substantial improvements, outperforming existing methods by 4.9% and 4.0% in closed-domain retrieval, and by 7.3% and 9.4% in open-domain retrieval, respectively.

[134] Vision Transformer with Sparse Scan Prior

Yuguang Zhang, Qihang Fan, Huaibo Huang

Main category: cs.CV

TL;DR: Sparse Scan Self-Attention (S³A) mechanism inspired by human eye’s sparse scanning, reducing computational overhead while maintaining performance in vision transformers.

DetailsMotivation: Transformers have high computational costs due to global modeling, unlike the human eye's efficient sparse scanning mechanism. The goal is to reduce computational load while maintaining performance.

Method: Proposed Sparse Scan Self-Attention (S³A) that predefines Anchors of Interest for each token and uses local attention around these anchors. Built SSViT (Sparse Scan Vision Transformer) based on this mechanism.

Result: SSViT achieves 84.4%/85.7% top-1 accuracy on ImageNet with only 4.4G/18.2G FLOPs, without extra supervision or training data. Also excels in object detection, instance segmentation, semantic segmentation, and shows strong robustness.

Conclusion: The S³A mechanism successfully mimics human eye efficiency, significantly reducing computational costs while achieving state-of-the-art performance across various vision tasks.

Abstract: In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye’s efficient information processing. Inspired by the human eye’s sparse scanning mechanism, we propose a \textbf{S}parse \textbf{S}can \textbf{S}elf-\textbf{A}ttention mechanism ($\rm{S}^3\rm{A}$). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye’s functionality and significantly reduces the computational load of vision models. Building on $\rm{S}^3\rm{A}$, we introduce the \textbf{S}parse \textbf{S}can \textbf{Vi}sion \textbf{T}ransformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of \textbf{84.4%/85.7%} with \textbf{4.4G/18.2G} FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets.

[135] Have Large Vision-Language Models Mastered Art History?

Ombretta Strafforello, Derya Soydaner, Michiel Willems, Anne-Sofie Maerten, Stefanie De Winter

Main category: cs.CV

TL;DR: This paper evaluates whether large Vision-Language Models (VLMs) can classify painting styles, authors, and creation dates - tasks traditionally mastered by art historians - using zero-shot classification on three models (CLIP, LLaVA, GPT-4o).

DetailsMotivation: To test if VLMs' multimodal reasoning can address complex art historical challenges that require contextual and stylistic interpretation rather than simple object recognition, which has been a domain of human art experts.

Method: Conducted zero-shot classification experiments using three VLMs (CLIP, LLaVA, GPT-4o) on two artwork image benchmarks, evaluating style, author, and time period classification while analyzing prompt sensitivity and failure cases.

Result: The study provides the first comprehensive analysis of VLMs’ performance on art historical classification tasks, comparing their capabilities to human expertise through misclassification analysis.

Conclusion: The research examines whether current VLMs can effectively reason about the historical and stylistic attributes of paintings, offering insights into their classification patterns and limitations compared to human art historical expertise.

Abstract: The emergence of large Vision-Language Models (VLMs) has established new baselines in image classification across multiple domains. We examine whether their multimodal reasoning can also address a challenge mastered by human experts. Specifically, we test whether VLMs can classify the style, author and creation date of paintings, a domain traditionally mastered by art historians. Artworks pose a unique challenge compared to natural images due to their inherently complex and diverse structures, characterized by variable compositions and styles. This requires a contextual and stylistic interpretation rather than straightforward object recognition. Art historians have long studied the unique aspects of artworks, with style prediction being a crucial component of their discipline. This paper investigates whether large VLMs, which integrate visual and textual data, can effectively reason about the historical and stylistic attributes of paintings. We present the first study of its kind, conducting an in-depth analysis of three VLMs, namely CLIP, LLaVA, and GPT-4o, evaluating their zero-shot classification of art style, author and time period. Using two image benchmarks of artworks, we assess the models' ability to interpret style, evaluate their sensitivity to prompts, and examine failure cases. Additionally, we focus on how these models compare to human art historical expertise by analyzing misclassifications, providing insights into their reasoning and classification patterns.

[136] A Chinese Continuous Sign Language Dataset Based on Complex Environments

Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan

Main category: cs.CV

TL;DR: New Chinese sign language dataset (CE-CSL) with complex real-world backgrounds and a time-frequency network (TFNet) model that improves continuous sign language recognition performance in challenging environments.

DetailsMotivation: Existing sign language datasets are limited to controlled lab environments with uniform lighting, lacking the diversity and complexity of real-life scenarios, which creates a bottleneck for practical CSLR applications.

Method: Created CE-CSL dataset with 5,988 video clips from daily life scenes featuring 70+ complex backgrounds. Proposed TFNet model that extracts frame-level features and separately processes temporal and spectral information before fusion for robust CSLR.

Result: Significant performance improvements on the CE-CSL dataset, demonstrating effectiveness under complex background conditions. Also achieved highly competitive results on three publicly available CSL datasets.

Conclusion: The CE-CSL dataset addresses the lack of real-world complexity in existing sign language data, and the TFNet model effectively handles complex background challenges, advancing continuous sign language recognition towards practical real-world applications.

Abstract: The current bottleneck in continuous sign language recognition (CSLR) research lies in the fact that most publicly available datasets are limited to laboratory environments or television program recordings, resulting in a single background environment with uniform lighting, which significantly deviates from the diversity and complexity found in real-life scenarios. To address this challenge, we have constructed a new, large-scale dataset for Chinese continuous sign language (CSL) based on complex environments, termed the complex environment - chinese sign language dataset (CE-CSL). This dataset encompasses 5,988 continuous CSL video clips collected from daily life scenes, featuring more than 70 different complex backgrounds to ensure representativeness and generalization capability. To tackle the impact of complex backgrounds on CSLR performance, we propose a time-frequency network (TFNet) model for continuous sign language recognition. This model extracts frame-level features and then utilizes both temporal and spectral information to separately derive sequence features before fusion, aiming to achieve efficient and accurate CSLR. Experimental results demonstrate that our approach achieves significant performance improvements on the CE-CSL, validating its effectiveness under complex background conditions. Additionally, our proposed method has also yielded highly competitive results when applied to three publicly available CSL datasets.

[137] ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions

Dubing Chen, Jin Fang, Wencheng Han, Xinjing Cheng, Junbo Yin, Chenzhong Xu, Fahad Shahbaz Khan, Jianbing Shen

Main category: cs.CV

TL;DR: A vision-based framework for 3D semantic occupancy and flow prediction with three key improvements: occlusion-aware adaptive lifting with depth denoising, 3D-2D semantic consistency with joint prototypes, and BEV-centric cost volume for joint prediction.

DetailsMotivation: To improve spatiotemporal scene understanding through robust 3D semantic occupancy and flow prediction, addressing challenges like depth dependency, long-tail class problems, and diverse motion scales.

Method: Uses occlusion-aware adaptive lifting with depth denoising for robust 2D-to-3D feature transformation, enforces 3D-2D semantic consistency via jointly optimized prototypes with confidence-aware sampling, and employs BEV-centric cost volume with hybrid classification-regression supervision.

Result: Achieves new state-of-the-art performance on multiple benchmarks for both semantic occupancy and joint occupancy semantic-flow prediction, with real-time versions outperforming existing methods in both speed and accuracy.

Conclusion: The proposed purely convolutional framework provides superior performance and practical viability, offering a spectrum of efficiency-performance trade-offs for real-world applications.

Abstract: 3D semantic occupancy and flow prediction are fundamental to spatiotemporal scene understanding. This paper proposes a vision-based framework with three targeted improvements. First, we introduce an occlusion-aware adaptive lifting mechanism incorporating depth denoising. This enhances the robustness of 2D-to-3D feature transformation while mitigating reliance on depth priors. Second, we enforce 3D-2D semantic consistency via jointly optimized prototypes, using confidence- and category-aware sampling to address the long-tail classes problem. Third, to streamline joint prediction, we devise a BEV-centric cost volume to explicitly correlate semantic and flow features, supervised by a hybrid classification-regression scheme that handles diverse motion scales. Our purely convolutional architecture establishes new SOTA performance on multiple benchmarks for both semantic occupancy and joint occupancy semantic-flow prediction. We also present a family of models offering a spectrum of efficiency-performance trade-offs. Our real-time version exceeds all existing real-time methods in speed and accuracy, ensuring its practical viability.

[138] Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q. Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-Philippe Morency, Katia Sycara, Yaqi Xie

Main category: cs.CV

TL;DR: DeGF uses text-to-image generative models to provide self-feedback that helps Large Vision-Language Models reduce hallucinations by verifying and correcting responses through complementary/contrastive decoding.

DetailsMotivation: LVLMs often generate hallucinatory text responses that don't align with visual inputs, limiting their practical real-world applications. The inverse relationship between text-to-image generation and image-conditioned response generation suggests generative models could help mitigate hallucinations.

Method: Self-correcting Decoding with Generative Feedback (DeGF) - a training-free algorithm that generates images from initial LVLM responses, then uses these images as auxiliary visual references to verify and correct the initial responses through complementary or contrastive decoding.

Result: Extensive experiments show DeGF effectively mitigates diverse types of hallucinations and consistently outperforms state-of-the-art methods across six benchmarks.

Conclusion: Text-to-image generative models can provide valuable self-feedback at both response and token levels to help LVLMs reduce hallucinations, making DeGF an effective training-free solution for improving LVLM reliability.

Abstract: While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.

[139] GloFinder: AI-empowered QuPath Plugin for WSI-level Glomerular Detection, Visualization, and Curation

Jialin Yue, Tianyuan Yao, Ruining Deng, Siqi Lu, Junlin Guo, Quan Liu, Mengmeng Yin, Juming Xiong, Haichun Yang, Yuankai Huo

Main category: cs.CV

TL;DR: GloFinder is a QuPath plugin that enables single-click automated glomeruli detection from whole slide images using CircleNet and Weighted Circle Fusion ensemble method, making AI-powered kidney pathology analysis accessible to non-programmers like clinicians.

DetailsMotivation: Existing open-source tools for glomeruli detection require programming skills and lack flexibility, hindering accessibility for clinicians who need user-friendly interfaces and adjustable confidence levels.

Method: Uses CircleNet (anchor-free detection framework with circle representations) trained on ~160,000 annotated glomeruli, combined with Weighted Circle Fusion ensemble method to refine predictions through confidence score combination.

Result: Achieves superior performance in glomerular detection with direct visualization and editing capabilities in QuPath’s graphical interface, enabling seamless interaction for clinical use.

Conclusion: GloFinder provides an accessible, powerful tool for nephropathology research and clinical practice by bridging the gap between advanced AI detection and clinician-friendly interfaces.

Abstract: Artificial intelligence (AI) has demonstrated significant success in automating the detection of glomeruli, the key functional units of the kidney, from whole slide images (WSIs) in kidney pathology. However, existing open-source tools are often distributed as source code or Docker containers, requiring advanced programming skills that hinder accessibility for non-programmers, such as clinicians. Additionally, current models are typically trained on a single dataset and lack flexibility in adjusting confidence levels for predictions. To overcome these challenges, we introduce GloFinder, a QuPath plugin designed for single-click automated glomeruli detection across entire WSIs with online editing through the graphical user interface (GUI). GloFinder employs CircleNet, an anchor-free detection framework utilizing circle representations for precise object localization, with models trained on approximately 160,000 manually annotated glomeruli. To further enhance accuracy, the plugin incorporates Weighted Circle Fusion (WCF), an ensemble method that combines confidence scores from multiple CircleNet models to produce refined predictions, achieving superior performance in glomerular detection. GloFinder enables direct visualization and editing of results in QuPath, facilitating seamless interaction for clinicians and providing a powerful tool for nephropathology research and clinical practice. Code and the QuPath plugin are available at https://github.com/hrlblab/GloFinder

[140] TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Xingsong Ye, Yongkun Du, Yunbo Tao, Zhineng Chen

Main category: cs.CV

TL;DR: TextSSR is a novel pipeline for synthesizing high-quality scene text recognition training data that addresses accuracy, realism, and scalability through region-centric text generation with position-glyph enhancement and character-aware diffusion architecture.

DetailsMotivation: Scene text recognition suffers from limited realistic synthetic training data and difficulty collecting sufficient high-quality real-world data, while existing diffusion-based text generation methods struggle with accurate instance-level text synthesis at scale.

Method: Proposes TextSSR pipeline with region-centric text generation with position-glyph enhancement for accuracy, uses contextual hints for style/appearance realism, and employs character-aware diffusion architecture for precise character-level control without natural language prompts.

Result: Created TextSSR-F dataset with 3.55 million quality-screened text instances. STR models trained on TextSSR-F outperform those on existing synthetic datasets by clear margins on benchmarks, with further improvements when mixed with real-world data.

Conclusion: TextSSR effectively addresses the challenges of text data synthesis by achieving accuracy, realism, and scalability, providing a superior training dataset for scene text recognition models.

Abstract: Scene text recognition (STR) suffers from challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained models. Meanwhile, despite producing holistically appealing text images, diffusion-based visual text generation methods struggle to synthesize accurate and realistic instance-level text at scale. To tackle this, we introduce TextSSR: a novel pipeline for Synthesizing Scene Text Recognition training data. TextSSR targets three key synthesizing characteristics: accuracy, realism, and scalability. It achieves accuracy through a proposed region-centric text generation with position-glyph enhancement, ensuring proper character placement. It maintains realism by guiding style and appearance generation using contextual hints from surrounding text or background. This character-aware diffusion architecture enjoys precise character-level control and semantic coherence preservation, without relying on natural language prompts. Therefore, TextSSR supports large-scale generation through combinatorial text permutations. Based on these, we present TextSSR-F, a dataset of 3.55 million quality-screened text instances. Extensive experiments show that STR models trained on TextSSR-F outperform those trained on existing synthetic datasets by clear margins on common benchmarks, and further improvements are observed when mixed with real-world training data. Code is available at https://github.com/YesianRohn/TextSSR.

[141] F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

Lu Liu, Huiyu Duan, Qiang Hu, Liu Yang, Chunlei Cai, Tianxiao Ye, Huayu Liu, Xiaoyun Zhang, Guangtao Zhai

Main category: cs.CV

TL;DR: FaceQ is a large-scale database with quality annotations for AI-generated faces, addressing the gap in comprehensive evaluation of face generation, customization, and restoration models.

DetailsMotivation: Current AI-generated faces often fail to meet human preferences due to distortions, unrealistic details, and identity shifts, highlighting the need for a comprehensive quality evaluation framework.

Method: Created FaceQ database with 12,255 images from 29 models across three tasks, annotated with 32,742 mean opinion scores from 180 annotators across multiple quality dimensions.

Result: Established F-Bench benchmark showing that existing quality assessment metrics are ineffective for evaluating authenticity, ID fidelity, and text-image correspondence in AI-generated faces.

Conclusion: FaceQ provides a valuable resource for evaluating AI-generated face quality and reveals limitations of current assessment metrics, enabling better model development and evaluation.

Abstract: Artificial intelligence generative models exhibit remarkable capabilities in content creation, particularly in face image generation, customization, and restoration. However, current AI-generated faces (AIGFs) often fall short of human preferences due to unique distortions, unrealistic details, and unexpected identity shifts, underscoring the need for a comprehensive quality evaluation framework for AIGFs. To address this need, we introduce FaceQ, a large-scale, comprehensive database of AI-generated Face images with fine-grained Quality annotations reflecting human preferences. The FaceQ database comprises 12,255 images generated by 29 models across three tasks: (1) face generation, (2) face customization, and (3) face restoration. It includes 32,742 mean opinion scores (MOSs) from 180 annotators, assessed across multiple dimensions: quality, authenticity, identity (ID) fidelity, and text-image correspondence. Using the FaceQ database, we establish F-Bench, a benchmark for comparing and evaluating face generation, customization, and restoration models, highlighting strengths and weaknesses across various prompts and evaluation dimensions. Additionally, we assess the performance of existing image quality assessment (IQA), face quality assessment (FQA), AI-generated content image quality assessment (AIGCIQA), and preference evaluation metrics, manifesting that these standard metrics are relatively ineffective in evaluating authenticity, ID fidelity, and text-image correspondence. The FaceQ database will be publicly available upon publication.

[142] RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen, Chenxi Wang, Feng Zhang

Main category: cs.CV

TL;DR: Introduces RSCC dataset with 62,315 pre/post-disaster image pairs and detailed captions to address lack of temporal data in remote sensing for disaster monitoring.

DetailsMotivation: Existing remote sensing datasets lack temporal image pairs and detailed textual annotations, failing to capture dynamic disaster impacts over time.

Method: Created large-scale RSCC benchmark with 62,315 pre-/post-disaster image pairs spanning multiple disaster types, paired with human-like change captions.

Result: RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding, facilitating detailed disaster-related analysis.

Conclusion: The dataset bridges temporal and semantic gaps in remote sensing data, paving the way for more accurate and interpretable vision-language applications in disaster monitoring.

Abstract: Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC’s ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.

[143] UAR-NVC: A Unified AutoRegressive Framework for Memory-Efficient Neural Video Compression

Jia Wang, Xinfeng Zhang, Gai Zhang, Jun Zhu, Lv Tang, Li Zhang

Main category: cs.CV

TL;DR: UAR-NVC is a unified autoregressive framework for neural video compression that reduces memory consumption by processing videos in clips with separate INR models, achieving better performance in resource-constrained environments.

DetailsMotivation: Traditional INR models for video compression suffer from high memory consumption as frame count increases, making them impractical for resource-constrained scenarios. The authors aim to combine the efficiency of traditional frame-by-frame compression with INR benefits.

Method: Proposes UAR-NVC framework that partitions videos into clips, processes each clip with different INR model instances, and includes modules to reduce temporal redundancy between clips. Supports adjustable latencies through variable clip lengths.

Result: Extensive experiments show UAR-NVC significantly improves performance compared to baseline models while adapting well to resource-constrained environments through flexible clip settings.

Conclusion: The unified autoregressive framework successfully addresses memory consumption issues in INR-based video compression while maintaining competitive performance and offering flexible adaptation to different resource constraints.

Abstract: Implicit Neural Representations (INRs) have demonstrated significant potential in video compression by representing videos as neural networks. However, as the number of frames increases, the memory consumption for training and inference increases substantially, posing challenges in resource-constrained scenarios. Inspired by the success of traditional video compression frameworks, which process video frame by frame and can efficiently compress long videos, we adopt this modeling strategy for INRs to decrease memory consumption, while aiming to unify the frameworks from the perspective of timeline-based autoregressive modeling. In this work, we present a novel understanding of INR models from an autoregressive (AR) perspective and introduce a Unified AutoRegressive Framework for memory-efficient Neural Video Compression (UAR-NVC). UAR-NVC integrates timeline-based and INR-based neural video compression under a unified autoregressive paradigm. It partitions videos into several clips and processes each clip using a different INR model instance, leveraging the advantages of both compression frameworks while allowing seamless adaptation to either in form. To further reduce temporal redundancy between clips, we design two modules to optimize the initialization, training, and compression of these model parameters. UAR-NVC supports adjustable latencies by varying the clip length. Extensive experimental results demonstrate that UAR-NVC, with its flexible video clip setting, can adapt to resource-constrained environments and significantly improve performance compared to different baseline models. The project page: “https://wj-inf.github.io/UAR-NVC-page/".

[144] GNF: Gaussian Neural Fields for Multidimensional Signal Representation and Reconstruction

Abdelaziz Bouzidi, Hamid Laga, Hazem Wannous, Ferdous Sohel

Main category: cs.CV

TL;DR: Gaussian Neural Fields (GNF) replace traditional MLP decoders with a single layer of Gaussian kernels for faster, more compact neural field representations of 2D, 3D, and 5D signals.

DetailsMotivation: Traditional neural fields require wide and deep MLPs that are slow to train and evaluate. Existing acceleration techniques compromise memory efficiency, optimization time, or the continuous nature of neural fields.

Method: GNF uses a compact neural decoder that maps learned feature grids into continuous signals using a single layer of Gaussian kernels defined in high-dimensional feature space, replacing MLP-based decoders.

Result: GNF achieves high accuracy for 2D RGB, 3D geometry, and 5D radiance fields with far fewer parameters. Training takes under 15 seconds for 3D geometry and under 11 minutes for view synthesis, with significantly higher inference throughput.

Conclusion: Gaussian kernels provide an efficient alternative to MLP decoders, enabling compact, highly parallelizable neural field representations that maintain accuracy while dramatically reducing training time and computational requirements.

Abstract: Neural fields have emerged as a powerful framework for representing continuous multidimensional signals such as images and videos, 3D and 4D objects and scenes, and radiance fields. While efficient, achieving high-quality representation requires the use of wide and deep neural networks. These, however, are slow to train and evaluate. Although several acceleration techniques have been proposed, they either trade memory for faster training and/or inference, rely on thousands of fitted primitives with considerable optimization time, or compromise the smooth, continuous nature of neural fields. In this paper, we introduce Gaussian Neural Fields (GNF), a novel compact neural decoder that maps learned feature grids into continuous non-linear signals, such as RGB images, Signed Distance Functions (SDFs), and radiance fields, using a single compact layer of Gaussian kernels defined in a high-dimensional feature space. Our key observation is that neurons in traditional MLPs perform simple computations, usually a dot product followed by an activation function, necessitating wide and deep MLPs or high-resolution feature grids to model complex functions. In this paper, we show that replacing MLP-based decoders with Gaussian kernels whose centers are learned features yields highly accurate representations of 2D (RGB), 3D (geometry), and 5D (radiance fields) signals with just a single layer of such kernels. This representation is highly parallelizable, operates on low-resolution grids, and trains in under $15$ seconds for 3D geometry and under $11$ minutes for view synthesis. GNF matches the accuracy of deep MLP-based decoders with far fewer parameters and significantly higher inference throughput.

[145] Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

Hyeonho Jeong, Suhyeon Lee, Jong Chul Ye

Main category: cs.CV

TL;DR: Reangle-A-Video is a unified framework that generates synchronized multi-view videos from a single input video using video-to-video translation with diffusion priors, avoiding the need for large 4D datasets.

DetailsMotivation: To create a more efficient approach for multi-view video generation that doesn't require training on large-scale 4D datasets like mainstream methods, leveraging existing image and video diffusion models instead.

Method: Two-stage approach: 1) Multi-view motion learning through self-supervised fine-tuning of image-to-video diffusion transformer to distill view-invariant motion from warped videos; 2) Multi-view consistent image-to-image translation using DUSt3R for cross-view consistency guidance to generate consistent starting images from the first frame.

Result: The method surpasses existing approaches in both static view transport and dynamic camera control tasks, establishing a new state-of-the-art solution for multi-view video generation.

Conclusion: Reangle-A-Video provides an effective alternative to traditional 4D dataset training methods, demonstrating superior performance while being more resource-efficient through its video-to-video translation framework.

Abstract: We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/

[146] LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Can Jin, Dimitris N. Metaxas

Main category: cs.CV

TL;DR: LED enhances open-vocabulary object detection by fusing LLM hidden states into detectors using zero-initialized cross-attention adapters, achieving significant performance gains with minimal computational overhead.

DetailsMotivation: Existing methods using synthetic training data from foundation models introduce bias and overfit to specific prompts. The paper explores direct fusion of LLM hidden states into detectors as an under-explored alternative.

Method: Proposes LED (LLM Enhanced Open-Vocabulary Object Detection) with zero-initialized cross-attention adapter to efficiently fuse knowledge from LLM decoder layers into object detectors. Focuses on adapting early LLM layers which encode rich spatial semantics.

Result: With Swin-T vision encoder, Qwen2-0.5B + LED improves GroundingDINO by 3.82% on OmniLabel with only 8.7% extra GFLOPs. Larger vision backbone pushes improvement to 6.22%. Extensive ablations validate design choices.

Conclusion: Intermediate LLM layers encode valuable spatial semantics for visual grounding. LED provides an efficient and effective approach for enhancing open-vocabulary object detection through direct LLM knowledge fusion.

Abstract: Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD) via synthetic training data, yet the hand-crafted pipelines often introduce bias and overfit to specific prompts. We sidestep this issue by directly fusing hidden states from Large Language Models (LLMs) into detectors-an avenue surprisingly under-explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of an MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge fusion from LLMs to object detectors, a new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain. With Swin-T as the vision encoder, Qwen2-0.5B + LED lifts GroundingDINO by 3.82 % on OmniLabel at just 8.7 % extra GFLOPs, and a larger vision backbone pushes the improvement to 6.22 %. Extensive ablations on adapter variants, LLM scales and fusion depths further corroborate our design.

[147] Towards properties of adversarial image perturbations

Egor Kuznetsov, Kirill Aistov, Maxim Koroteev

Main category: cs.CV

TL;DR: Adversarial perturbations that significantly increase VMAF image quality metric while maintaining acceptable PSNR and subjective quality, showing discrepancies between metric values and human perception.

DetailsMotivation: To investigate how adversarial perturbations can artificially inflate VMAF scores while maintaining image quality, and to reveal discrepancies between objective metrics and subjective judgments.

Method: Used stochastic gradient approach with direct VMAF optimization in PyTorch, analyzed perturbation structure through Fourier power spectrum computations, and studied perturbations under different PSNR constraints.

Result: Found that moderate brightness variations (~10 pixel units) can increase VMAF by ~60% without noticeable subjective quality degradation, and perturbations show linear dependence on image brightness.

Conclusion: Demonstrates significant vulnerabilities in VMAF metric to adversarial attacks and highlights the gap between objective metrics and human perceptual quality assessment.

Abstract: Using stochastic gradient approach we study the properties of adversarial perturbations resulting in noticeable growth of VMAF image quality metric. The structure of the perturbations is investigated depending on the acceptable PSNR values and based on the Fourier power spectrum computations for the perturbations. It is demonstrated that moderate variation of image brightness ($\sim 10$ pixel units in a restricted region of an image can result in VMAF growth by $\sim 60%$). Unlike some other methods demonstrating similar VMAF growth, the subjective quality of an image remains almost unchanged. It is also shown that the adversarial perturbations may demonstrate approximately linear dependence of perturbation amplitudes on the image brightness. The perturbations are studied based on the direct VMAF optimization in PyTorch. The significant discrepancies between the metric values and subjective judgements are also demonstrated when image restoration from noise is carried out using the same direct VMAF optimization.

[148] CamC2V: Context-aware Controllable Video Generation

Luis Denninger, Sina Mokhtarzadeh Azar, Juergen Gall

Main category: cs.CV

TL;DR: CamC2V is a context-to-video model that integrates multiple image conditions with 3D constraints and camera control to improve video generation quality and coherence while maintaining faithful scene representation.

DetailsMotivation: Existing image-to-video models animate static images without extending beyond provided context, and adding camera constraints often degrades visual quality, limiting their applicability for tasks requiring accurate scene representation.

Method: Proposes CamC2V that integrates multiple image conditions as context with 3D constraints alongside camera control to enrich both global semantics and fine-grained visual details, emphasizing temporal awareness for effective context representation.

Result: Comprehensive study on RealEstate10K dataset demonstrates improvements in visual quality and camera controllability compared to existing approaches.

Conclusion: The proposed CamC2V model enables more coherent and context-aware video generation with better camera control while maintaining high visual quality, addressing limitations of current image-to-video diffusion models.

Abstract: Recently, image-to-video (I2V) diffusion models have demonstrated impressive scene understanding and generative quality, incorporating image conditions to guide generation. However, these models primarily animate static images without extending beyond their provided context. Introducing additional constraints, such as camera trajectories, can enhance diversity but often degrade visual quality, limiting their applicability for tasks requiring faithful scene representation. We propose CamC2V, a context-to-video (C2V) model that integrates multiple image conditions as context with 3D constraints alongside camera control to enrich both global semantics and fine-grained visual details. This enables more coherent and context-aware video generation. Moreover, we motivate the necessity of temporal awareness for an effective context representation. Our comprehensive study on the RealEstate10K dataset demonstrates improvements in visual quality and camera controllability. We will publish our code upon acceptance.

[149] TerraMind: Large-Scale Generative Multimodality for Earth Observation

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

Main category: cs.CV

TL;DR: TerraMind is the first any-to-any generative multimodal foundation model for Earth observation that uses dual-scale token and pixel representations, achieves SOTA performance, and enables zero-shot/few-shot applications with novel Thinking-in-Modalities capability.

DetailsMotivation: To create a comprehensive multimodal foundation model for Earth observation that can handle multiple geospatial modalities and enable various downstream applications through zero-shot and few-shot learning.

Method: Dual-scale pretraining approach combining token-level (contextual information) and pixel-level (spatial nuances) representations across nine geospatial modalities. Introduces Thinking-in-Modalities (TiM) for generating artificial data during finetuning and inference.

Result: Achieves beyond state-of-the-art performance on community-standard benchmarks like PANGAEA. Enables zero-shot and few-shot applications for Earth observation tasks.

Conclusion: TerraMind represents a significant advancement in Earth observation AI, providing an open-source foundation model with dual-scale representations and novel TiM capability that outperforms existing methods and supports diverse applications.

Abstract: We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind’s dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces “Thinking-in-Modalities” (TiM) – the capability of generating additional artificial data during finetuning and inference to improve the model output – and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

[150] Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning

Zhongyu Chen, Rong Zhao, Xie Han, Xindong Guo, Song Wang, Zherui Qiao

Main category: cs.CV

TL;DR: A physics-driven approach is integrated with data-driven methods to enhance point cloud representation by modeling relationships between local features and global structure through elastic deformation modeling.

DetailsMotivation: Existing point cloud learning methods focus only on spatial distribution and overlook the relationship between local information and global structure, limiting accuracy. Real-world object deformation shows that local changes propagate to affect the whole structure.

Method: Dual-task encoder-decoder framework combining data-driven implicit fields with physics-driven elastic deformation. Uses physics-based loss functions to predict localized deformation and capture correspondence between local changes and global shape variations.

Result: Outperforms existing approaches in object classification and segmentation tasks.

Conclusion: Incorporating physics-driven mechanisms effectively compensates for data-driven limitations, enhances generalization and interpretability, and improves point cloud representation accuracy for downstream tasks.

Abstract: Existing point cloud representation learning methods primarily rely on data-driven strategies to extract geometric information from large amounts of scattered data. However, most methods focus solely on the spatial distribution features of point clouds while overlooking the relationship between local information and the whole structure, which limits the accuracy of point cloud representation. Local information reflect the fine-grained variations of an object, while the whole structure is determined by the interaction and combination of these local features, collectively defining the object’s shape. In real-world, objects undergo deformation under external forces, and this deformation gradually affects the whole structure through the propagation of forces from local regions, thereby altering the object’s geometric features. Therefore, the appropriate introduction of physics-driven mechanism can effectively compensate for the limitations of data-driven methods in structural modeling and significantly enhance the generalization and interpretability of point cloud representations in downstream tasks such as understanding and recognition. Inspired by this, we incorporate a physics-driven mechanism into the data-driven method to learn fine-grained features in point clouds and model the structural relationship between local regions and the whole shape. Specifically, we design a dual-task encoder-decoder framework that combines the geometric modeling capability of data-driven implicit fields with physics-driven elastic deformation. Through the integration of physics-based loss functions, the framework is guided to predict localized deformation and explicitly capture the correspondence between local structural changes and whole shape variations. Experimental results show that our method outperforms existing approaches in object classification and segmentation, demonstrating its effectiveness.

[151] Rethinking Random Masking in Self-Distillation on ViT

Jihyeon Seong, Hyunkyung Han

Main category: cs.CV

TL;DR: Random masking applied only to student’s global view in DINO framework improves attention maps and downstream performance while preserving clean supervision from teacher and local views.

DetailsMotivation: Random masking in self-distillation frameworks may eliminate critical semantic information, so researchers want to develop more informed masking strategies that preserve essential information while maintaining training efficiency.

Method: Apply random masking exclusively to student’s global view while preserving student’s local views and teacher’s global view in original unmasked forms, leveraging DINO’s multi-view augmentation scheme.

Result: Random masking under asymmetric setup yields more robust and fine-grained attention maps and enhances downstream performance on mini-ImageNet dataset using DINO-Tiny.

Conclusion: Asymmetric random masking strategy in self-distillation frameworks can effectively improve model robustness and performance while maintaining clean supervision signals.

Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a wide range of vision tasks. In particular, self-distillation frameworks such as DINO have contributed significantly to these advances. Within such frameworks, random masking is often utilized to improve training efficiency and introduce regularization. However, recent studies have raised concerns that indiscriminate random masking may inadvertently eliminate critical semantic information, motivating the development of more informed masking strategies. In this study, we explore the role of random masking in the self-distillation setting, focusing on the DINO framework. Specifically, we apply random masking exclusively to the student’s global view, while preserving the student’s local views and the teacher’s global view in their original, unmasked forms. This design leverages DINO’s multi-view augmentation scheme to retain clean supervision while inducing robustness through masked inputs. We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.

[152] GenFlow: Interactive Modular System for Image Generation

Duc-Hung Nguyen, Huu-Phuc Huynh, Minh-Triet Tran, Trung-Nghia Le

Main category: cs.CV

TL;DR: GenFlow is a modular framework with node-based editor and NLP assistant that makes advanced generative art accessible to all skill levels by simplifying workflow creation and reducing technical barriers.

DetailsMotivation: Generative art has untapped potential due to technical expertise requirements for advanced architectural concepts and computational workflows, creating accessibility barriers for users.

Method: Developed GenFlow framework featuring a node-based editor for customization and an intelligent NLP-powered assistant to transform complex workflow creation into intuitive experience, with automated deployment processes.

Result: User study demonstrated optimized workflows, reduced task completion times, and enhanced user understanding through intuitive interface and adaptive features.

Conclusion: GenFlow is a groundbreaking solution that redefines accessibility and efficiency in generative art by making cutting-edge tools available to everyone regardless of technical skill level.

Abstract: Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present GenFlow, a novel modular framework that empowers users of all skill levels to generate images with precision and ease. Featuring a node-based editor for seamless customization and an intelligent assistant powered by natural language processing, GenFlow transforms the complexity of workflow creation into an intuitive and accessible experience. By automating deployment processes and minimizing technical barriers, our framework makes cutting-edge generative art tools available to everyone. A user study demonstrated GenFlow’s ability to optimize workflows, reduce task completion times, and enhance user understanding through its intuitive interface and adaptive features. These results position GenFlow as a groundbreaking solution that redefines accessibility and efficiency in the realm of generative art.

[153] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion

Xiaoyang Zhang, jinjiang Li, Guodong Fan, Yakun Ju, Linwei Fan, Jun Liu, Alex C. Kot

Main category: cs.CV

TL;DR: SGDFuse is a novel infrared and visible image fusion method that uses SAM-generated semantic masks to guide a conditional diffusion model, achieving state-of-the-art performance with explicit semantic directionality and high fidelity.

DetailsMotivation: Existing infrared and visible image fusion methods often fail to preserve key targets due to lack of deep semantic understanding and introduce artifacts/detail loss, compromising both image quality and downstream task performance.

Method: Two-stage conditional diffusion model guided by SAM: 1) preliminary fusion of multi-modal features, 2) uses SAM semantic masks with preliminary fused image as condition to drive diffusion model’s coarse-to-fine denoising generation.

Result: Extensive experiments show SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, with strong adaptability to downstream tasks.

Conclusion: SGDFuse provides a powerful solution to core challenges in image fusion by ensuring explicit semantic directionality and high fidelity through SAM-guided conditional diffusion modeling.

Abstract: Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model’s coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.

[154] MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

Chenghao Liu, Zhimu Zhou, Jiachen Zhang, Minghao Zhang, Songfang Huang, Huiling Duan

Main category: cs.CV

TL;DR: MSNav is a novel framework that addresses VLN challenges by integrating memory, spatial reasoning, and decision modules, achieving state-of-the-art performance on navigation benchmarks.

DetailsMotivation: Current VLN approaches using single LLMs suffer from poor spatial reasoning, weak cross-modal grounding, and memory overload in long-horizon tasks, requiring a more systematic solution.

Method: Proposes MSNav framework with three modules: Memory Module for dynamic map memory with selective pruning, Spatial Module for spatial reasoning and object relationships, and Decision Module for LLM-based path planning. Also introduces I-O-S dataset and fine-tunes Qwen3-4B into Qwen-Sp model.

Result: Achieves state-of-the-art performance on R2R and REVERIE datasets with significant improvements in Success Rate (SR) and Success weighted by Path Length (SPL). Qwen-Sp outperforms commercial LLMs in object list extraction with higher F1 and NDCG scores.

Conclusion: MSNav successfully addresses critical VLN vulnerabilities through its synergistic three-module architecture, transforming fragile inference into robust integrated intelligence for vision-and-language navigation tasks.

Abstract: Vision-and-Language Navigation (VLN) requires an agent to interpret natural language instructions and navigate complex environments. Current approaches often adopt a “black-box” paradigm, where a single Large Language Model (LLM) makes end-to-end decisions. However, it is plagued by critical vulnerabilities, including poor spatial reasoning, weak cross-modal grounding, and memory overload in long-horizon tasks. To systematically address these issues, we propose Memory Spatial Navigation(MSNav), a framework that fuses three modules into a synergistic architecture, which transforms fragile inference into a robust, integrated intelligence. MSNav integrates three modules: Memory Module, a dynamic map memory module that tackles memory overload through selective node pruning, enhancing long-range exploration; Spatial Module, a module for spatial reasoning and object relationship inference that improves endpoint recognition; and Decision Module, a module using LLM-based path planning to execute robust actions. Powering Spatial Module, we also introduce an Instruction-Object-Space (I-O-S) dataset and fine-tune the Qwen3-4B model into Qwen-Spatial (Qwen-Sp), which outperforms leading commercial LLMs in object list extraction, achieving higher F1 and NDCG scores on the I-O-S test set. Extensive experiments on the Room-to-Room (R2R) and REVERIE datasets demonstrate MSNav’s state-of-the-art performance with significant improvements in Success Rate (SR) and Success weighted by Path Length (SPL).

[155] GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability

Zhenghao He, Sanchit Sinha, Guangzhi Xiong, Aidong Zhang

Main category: cs.CV

TL;DR: GCAV unifies CAVs across layers into a single consistent representation using contrastive learning and attention fusion, reducing TCAV variance and improving concept consistency.

DetailsMotivation: Existing CAVs computed independently at different layers show inconsistencies, making cross-layer comparisons unreliable for interpreting neural networks.

Method: Proposes Global Concept Activation Vector (GCAV) framework using contrastive learning to align concept representations across layers and attention-based fusion to create globally integrated CAVs.

Result: Significantly reduces variance in TCAV scores, preserves concept relevance, improves concept localization, and enhances robustness against adversarial perturbations.

Conclusion: GCAV provides a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts by integrating cross-layer information coherently.

Abstract: Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at https://github.com/Zhenghao-He/GCAV.

[156] Bidirectional Sparse Attention for Faster Video Diffusion Training

Chenlu Zhan, Wen Li, Chuyu Shen, Jun Zhang, Suhui Wu, Hao Zhang

Main category: cs.CV

TL;DR: BSA framework uses bidirectional sparse attention to dramatically speed up video DiT training by dynamically sparsifying both queries and key-value pairs, reducing FLOPs by 20x while maintaining generative quality.

DetailsMotivation: Video diffusion Transformer models face computational bottlenecks due to quadratic complexity of full attention, making high-resolution, long-duration video generation prohibitively expensive in training and inference.

Method: Proposes Bidirectional Sparse Attention (BSA) framework that dynamically sparsifies both queries (via semantic similarity selection) and key-value pairs (via statistical dynamic thresholding) within 3D full attention.

Result: Achieves up to 20x FLOPs reduction and 17.79x faster attention training while preserving or even surpassing the generative quality of full attention across long sequences.

Conclusion: BSA successfully overcomes computational bottlenecks in video DiTs by efficiently handling attention sparsity, enabling faster training and inference without compromising on generation quality.

Abstract: Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT’s dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.

[157] PrediTree: A Multi-Temporal Sub-meter Dataset of Multi-Spectral Imagery Aligned With Canopy Height Maps

Hiyam Debary, Mustansar Fiaz, Levente Klein

Main category: cs.CV

TL;DR: PrediTree is the first open-source dataset for tree height prediction at sub-meter resolution, combining high-resolution LiDAR canopy height maps with multi-temporal multi-spectral imagery across diverse French forests.

DetailsMotivation: Address the critical gap in forest monitoring capabilities by enabling training of deep learning methods that can predict tree growth based on multiple past observations.

Method: Encoder-decoder framework using multi-temporal multi-spectral imagery and relative time differences to predict canopy height. U-Net architecture trained on the PrediTree dataset.

Result: U-Net achieved highest masked mean squared error of 11.78%, outperforming ResNet-50 by ~12% and RGB-only experiments by ~30%.

Conclusion: PrediTree dataset enables effective tree height prediction and is publicly available with processing and training codebases.

Abstract: We present PrediTree, the first comprehensive open-source dataset designed for training and evaluating tree height prediction models at sub-meter resolution. This dataset combines very high-resolution (0.5m) LiDAR-derived canopy height maps, spatially aligned with multi-temporal and multi-spectral imagery, across diverse forest ecosystems in France, totaling 3,141,568 images. PrediTree addresses a critical gap in forest monitoring capabilities by enabling the training of deep learning methods that can predict tree growth based on multiple past observations. To make use of this PrediTree dataset, we propose an encoder-decoder framework that requires the multi-temporal multi-spectral imagery and the relative time differences in years between the canopy height map timestamp (target) and each image acquisition date for which this framework predicts the canopy height. The conducted experiments demonstrate that a U-Net architecture trained on the PrediTree dataset provides the highest masked mean squared error of $11.78%$, outperforming the next-best architecture, ResNet-50, by around $12%$, and cutting the error of the same experiments but on fewer bands (red, green, blue only), by around $30%$. This dataset is publicly available on https://huggingface.co/datasets/hiyam-d/PrediTree, and both processing and training codebases are available on {GitHub}.

[158] PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Jiale Tao, Qixun Wang, Ruihuang Li, Comi Chen, Xin Li, Mingrui Wu, Xinchi Deng, Chunyu Wang, Qinglin Lu

Main category: cs.CV

TL;DR: PromptEnhancer is a universal prompt rewriting framework that enhances text-to-image models by generating more precise prompts through reinforcement learning with a specialized reward model, improving image-text alignment without modifying model weights.

DetailsMotivation: Text-to-image diffusion models often fail to faithfully render complex user prompts, leading to mismatches between user intent and generated output, particularly in attribute binding, negation, and compositional relationships.

Method: A Chain-of-Thought rewriter trained through reinforcement learning guided by AlignEvaluator - a dedicated reward model providing explicit feedback based on 24 key points derived from common T2I failure modes.

Result: Extensive experiments on HunyuanImage 2.1 model show significant improvements in image-text alignment across various semantic and compositional challenges.

Conclusion: PromptEnhancer effectively enhances any pretrained T2I model without weight modifications and introduces a new human preference benchmark for future research.

Abstract: Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.

[159] Delta Velocity Rectified Flow for Text-to-Image Editing

Gaspard Beaudouin, Minghan Li, Jaeyeon Kim, Sung-Hoon Yoon, Mengyu Wang

Main category: cs.CV

TL;DR: DVRF is a novel inversion-free text-to-image editing framework that uses velocity field discrepancy modeling and time-dependent shifts to improve editing quality without architectural changes.

DetailsMotivation: To address over-smoothing artifacts in prior distillation sampling approaches for text-to-image editing and provide a principled theoretical framework that bridges score-based diffusion and velocity-based optimization methods.

Method: Distillation-based approach that explicitly models source-target velocity field discrepancies, introduces time-dependent shift terms to push noisy latents toward target trajectory, and operates within rectified flow models without inversion.

Result: Superior editing quality, fidelity, and controllability compared to previous methods, with no architectural modifications required, making it efficient and broadly applicable.

Conclusion: DVRF successfully bridges theoretical gaps between score-based diffusion and rectified-flow optimization, provides theoretical interpretation for existing methods, and delivers state-of-the-art performance in text-to-image editing tasks.

Abstract: We propose Delta Velocity Rectified Flow (DVRF), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DVRF is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that when this shift is disabled, DVRF reduces to Delta Denoising Score, thereby bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, when the shift term follows a linear schedule under rectified-flow dynamics, DVRF generalizes the Inversion-free method FlowEdit and provides a principled theoretical interpretation for it. Experimental results indicate that DVRF achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications, making it efficient and broadly applicable to text-to-image editing tasks. Code is available at https://github.com/Harvard-AI-and-Robotics-Lab/DeltaVelocityRectifiedFlow.

[160] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang

Main category: cs.CV

TL;DR: BranchGRPO reduces computational costs and improves training stability for image/video generative models by introducing branch sampling, tree-based advantage estimation, and pruning strategies.

DetailsMotivation: Existing GRPO methods face high computational costs from on-policy rollouts and excessive SDE sampling steps, as well as training instability due to sparse rewards.

Method: Introduces branch sampling policy for SDE sampling process, shares computation across common prefixes, prunes low-reward paths and redundant depths, and uses tree-based advantage estimator with dense process-level rewards.

Result: Improves alignment scores by 16% over strong baselines while cutting training time by 50% in image and video preference alignment experiments.

Conclusion: BranchGRPO substantially lowers computational costs while maintaining or improving exploration diversity, making it an efficient solution for preference alignment in generative models.

Abstract: Recent advancements in aligning image and video generative models via GRPO have achieved remarkable gains in enhancing human preference alignment. However, these methods still face high computational costs from on-policy rollouts and excessive SDE sampling steps, as well as training instability due to sparse rewards. In this paper, we propose BranchGRPO, a novel method that introduces a branch sampling policy updating the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO substantially lowers the per-update compute cost while maintaining or improving exploration diversity. This work makes three main contributions: (1) a branch sampling scheme that reduces rollout and training cost; (2) a tree-based advantage estimator incorporating dense process-level rewards; and (3) pruning strategies exploiting path and depth redundancy to accelerate convergence and boost performance. Experiments on image and video preference alignment show that BranchGRPO improves alignment scores by 16% over strong baselines, while cutting training time by 50%.

[161] RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Zhengquan Luo, Chi Liu, Dongfu Xiao, Zhen Yu, Yueye Wang, Tianqing Zhu

Main category: cs.CV

TL;DR: RetinaGuard is a privacy-enhancing framework that protects retinal age information in fundus images while maintaining diagnostic utility through adversarial masking and knowledge distillation.

DetailsMotivation: AI-derived biomarkers like retinal age from medical images pose significant privacy risks, as unauthorized use could lead to bioinformation leakage and privacy breaches.

Method: Feature-level generative adversarial masking mechanism to obscure retinal age, combined with multiple-to-one knowledge distillation using retinal foundation model and diverse surrogate age encoders for universal defense.

Result: Successfully obfuscates retinal age prediction with minimal impact on image quality and pathological feature representation.

Conclusion: RetinaGuard provides effective privacy protection for retinal age data and can be extended to protect other medical image-derived biomarkers.

Abstract: The integration of AI with medical images enables the extraction of implicit image-derived biomarkers for a precise health assessment. Recently, retinal age, a biomarker predicted from fundus images, is a proven predictor of systemic disease risks, behavioral patterns, aging trajectory and even mortality. However, the capability to infer such sensitive biometric data raises significant privacy risks, where unauthorized use of fundus images could lead to bioinformation leakage, breaching individual privacy. In response, we formulate a new research problem of biometric privacy associated with medical images and propose RetinaGuard, a novel privacy-enhancing framework that employs a feature-level generative adversarial masking mechanism to obscure retinal age while preserving image visual quality and disease diagnostic utility. The framework further utilizes a novel multiple-to-one knowledge distillation strategy incorporating a retinal foundation model and diverse surrogate age encoders to enable a universal defense against black-box age prediction models. Comprehensive evaluations confirm that RetinaGuard successfully obfuscates retinal age prediction with minimal impact on image quality and pathological feature representation. RetinaGuard is also flexible for extension to other medical image derived biomarkers. RetinaGuard is also flexible for extension to other medical image biomarkers.

[162] Detection of trade in products derived from threatened species using machine learning and a smartphone

Ritwik Kulkarni, WU Hanqin, Enrico Di Minin

Main category: cs.CV

TL;DR: Machine learning models developed to automatically detect illegal wildlife products (elephant ivory, pangolin scales, tiger parts) in images with 84.2% overall accuracy, deployed via smartphone app for real-time monitoring.

DetailsMotivation: Unsustainable wildlife trade is a major biodiversity threat, increasingly moving to digital marketplaces. Automated detection methods are needed to handle the large volume of digital content and identify wildlife products like ivory.

Method: Developed machine learning-based object recognition models using images of illegally sold/confiscated wildlife products. Tested various training strategies and loss functions, creating both species-specific models and a combined model for all three species.

Result: Best model achieved 84.2% overall accuracy with species-specific accuracies: 71.1% (elephant), 90.2% (pangolin), 93.5% (tiger). Smartphone app implementation showed 91.3% accuracy for real-time detection.

Conclusion: The method is effective for monitoring both online and physical wildlife trade, with practical deployment through smartphone applications for stakeholders like law enforcement agencies.

Abstract: Unsustainable trade in wildlife is a major threat to biodiversity and is now increasingly prevalent in digital marketplaces and social media. With the sheer volume of digital content, the need for automated methods to detect wildlife trade listings is growing. These methods are especially needed for the automatic identification of wildlife products, such as ivory. We developed machine learning-based object recognition models that can identify wildlife products within images and highlight them. The data consists of images of elephant, pangolin, and tiger products that were identified as being sold illegally or that were confiscated by authorities. Specifically, the wildlife products included elephant ivory and skins, pangolin scales, and claws (raw and crafted), and tiger skins and bones. We investigated various combinations of training strategies and two loss functions to identify the best model to use in the automatic detection of these wildlife products. Models were trained for each species while also developing a single model to identify products from all three species. The best model showed an overall accuracy of 84.2% with accuracies of 71.1%, 90.2% and 93.5% in detecting products derived from elephants, pangolins, and tigers, respectively. We further demonstrate that the machine learning model can be made easily available to stakeholders, such as government authorities and law enforcement agencies, by developing a smartphone-based application that had an overall accuracy of 91.3%. The application can be used in real time to click images and help identify potentially prohibited products of target species. Thus, the proposed method is not only applicable for monitoring trade on the web but can also be used e.g. in physical markets for monitoring wildlife trade.

[163] Hybrid Swin Attention Networks for Simultaneously Low-Dose PET and CT Denoising

Yichao Liu, Hengzhi Xue, YueYang Teng

Main category: cs.CV

TL;DR: HSANet is a novel hybrid network for LDCT/PET denoising that uses Efficient Global Attention modules and hybrid upsampling to improve image quality while keeping model size small for practical clinical deployment.

DetailsMotivation: Low-dose CT and PET imaging reduce radiation exposure but introduce noise and artifacts that compromise diagnostic accuracy, creating a need for effective denoising methods.

Method: Hybrid Swin Attention Network (HSANet) with Efficient Global Attention modules for spatial/channel interaction and hybrid upsampling module to prevent overfitting to noise.

Result: HSANet achieves superior denoising performance compared to existing methods while maintaining lightweight model size suitable for standard GPU deployment.

Conclusion: The approach is highly practical for real-world clinical applications, providing effective denoising without requiring specialized hardware.

Abstract: Low-dose computed tomography (LDCT) and positron emission tomography (PET) have emerged as safer alternatives to conventional imaging modalities by significantly reducing radiation exposure. However, this reduction often results in increased noise and artifacts, which can compromise diagnostic accuracy. Consequently, denoising for LDCT/PET has become a vital area of research aimed at enhancing image quality while maintaining radiation safety. In this study, we introduce a novel Hybrid Swin Attention Network (HSANet), which incorporates Efficient Global Attention (EGA) modules and a hybrid upsampling module. The EGA modules enhance both spatial and channel-wise interaction, improving the network’s capacity to capture relevant features, while the hybrid upsampling module mitigates the risk of overfitting to noise. We validate the proposed approach using a publicly available LDCT/PET dataset. Experimental results demonstrate that HSANet achieves superior denoising performance compared to existing methods, while maintaining a lightweight model size suitable for deployment on GPUs with standard memory configurations. This makes our approach highly practical for real-world clinical applications.

[164] VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes

Shengkai Zhang, Yuhe Liu, Guanjun Wu, Jianhua He, Xinggang Wang, Mozi Chen, Kezhong Liu

Main category: cs.CV

TL;DR: VIM-GS is a Gaussian Splatting framework that generates accurate depth from monocular images for novel-view synthesis in large scenes by combining sparse SfM depth with coarse foundation model depth.

DetailsMotivation: Traditional Gaussian Splatting requires accurate depth from RGB-D/stereo cameras with limited range, while monocular images lack depth guidance. Large foundation models for monocular depth estimation suffer from inconsistency, inaccuracy for distant scenes, and texture ambiguity.

Method: Leverages sparse but accurate depth from visual-inertial SfM to refine dense but coarse depth from large foundation models. Uses object-segmented depth propagation algorithm and dynamic depth refinement module to handle sparse inputs and dynamic objects.

Result: Superior rendering quality demonstrated on public and customized datasets for large scenes.

Conclusion: VIM-GS effectively bridges the gap between sparse SfM depth and dense foundation model depth, enabling high-quality Gaussian Splatting rendering from monocular images in large-scale environments.

Abstract: VIM-GS is a Gaussian Splatting (GS) framework using monocular images for novel-view synthesis (NVS) in large scenes. GS typically requires accurate depth to initiate Gaussian ellipsoids using RGB-D/stereo cameras. Their limited depth sensing range makes it difficult for GS to work in large scenes. Monocular images, however, lack depth to guide the learning and lead to inferior NVS results. Although large foundation models (LFMs) for monocular depth estimation are available, they suffer from cross-frame inconsistency, inaccuracy for distant scenes, and ambiguity in deceptive texture cues. This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering. The key idea is to leverage the accurate but sparse depth from visual-inertial Structure-from-Motion (SfM) to refine the dense but coarse depth from LFMs. To bridge the sparse input and dense output, we propose an object-segmented depth propagation algorithm that renders the depth of pixels of structured objects. Then we develop a dynamic depth refinement module to handle the crippled SfM depth of dynamic objects and refine the coarse LFM depth. Experiments using public and customized datasets demonstrate the superior rendering quality of VIM-GS in large scenes.

[165] P3-SAM: Native 3D Part Segmentation

Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, Chunchao Guo

Main category: cs.CV

TL;DR: P3-SAM is a native 3D point-promptable part segmentation model that automates 3D object segmentation into components, achieving state-of-the-art performance on complex objects.

DetailsMotivation: Current 3D part segmentation methods have poor robustness with complex objects and lack full automation, limiting their practical applications in 3D understanding and model reuse.

Method: Inspired by SAM, P3-SAM uses a feature extractor, multiple segmentation heads, and an IoU predictor for interactive segmentation, plus an algorithm for automatic mask selection and merging.

Result: The model achieves precise segmentation results and strong robustness on complex objects, attaining state-of-the-art performance when trained on a dataset of 3.7 million models.

Conclusion: P3-SAM provides a fully automated solution for 3D part segmentation that handles complex objects effectively and outperforms existing methods.

Abstract: Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point-promptable part segmentation model termed P3-SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P3-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance. Our code will be released soon.

[166] Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models

Jisung Hwang, Jaihoon Kim, Minhyuk Sung

Main category: cs.CV

TL;DR: Novel regularization loss that enforces standard Gaussian distribution in latent space of text-to-image models, combining moment-based spatial regularization with power spectrum-based spectral regularization for improved downstream optimization tasks.

DetailsMotivation: To facilitate downstream tasks involving optimization in the latent space of text-to-image models by ensuring samples align with a standard Gaussian distribution, which enables better performance in applications like test-time reward alignment.

Method: Treat high-dimensional sample elements as 1D standard Gaussian variables, define composite loss combining moment-based regularization (spatial domain) and power spectrum-based regularization (spectral domain) using analytically known expected values. Apply losses to randomly permuted inputs for permutation invariance.

Result: Outperforms previous Gaussianity regularization methods, effectively prevents reward hacking, accelerates convergence, and shows superior performance in generative modeling for test-time reward alignment (enhancing aesthetics and text alignment).

Conclusion: The proposed unified framework for Gaussianity regularization provides an efficient and effective approach that encompasses existing methods while offering improved performance and computational efficiency for latent space optimization in text-to-image models.

Abstract: We propose a novel regularization loss that enforces standard Gaussianity, encouraging samples to align with a standard Gaussian distribution. This facilitates a range of downstream tasks involving optimization in the latent space of text-to-image models. We treat elements of a high-dimensional sample as one-dimensional standard Gaussian variables and define a composite loss that combines moment-based regularization in the spatial domain with power spectrum-based regularization in the spectral domain. Since the expected values of moments and power spectrum distributions are analytically known, the loss promotes conformity to these properties. To ensure permutation invariance, the losses are applied to randomly permuted inputs. Notably, existing Gaussianity-based regularizations fall within our unified framework: some correspond to moment losses of specific orders, while the previous covariance-matching loss is equivalent to our spectral loss but incurs higher time complexity due to its spatial-domain computation. We showcase the application of our regularization in generative modeling for test-time reward alignment with a text-to-image model, specifically to enhance aesthetics and text alignment. Our regularization outperforms previous Gaussianity regularization, effectively prevents reward hacking and accelerates convergence.

[167] TextlessRAG: End-to-End Visual Document RAG by Speech Without Text

Peijin Xie, Shun Qian, Bingquan Liu, Dexin Wang, Lin Sun, Xiangzheng Zhang

Main category: cs.CV

TL;DR: TextlessRAG is the first end-to-end framework for speech-based question answering over document images that eliminates ASR, TTS and OCR, using a fully textless pipeline with layout-aware reranking.

DetailsMotivation: Document images contain rich knowledge and spoken queries offer flexible application scenarios, but no prior work has explored speech-based question answering over visual document images.

Method: End-to-end framework that directly interprets speech, retrieves relevant visual knowledge, and generates answers without ASR, TTS or OCR. Includes layout-aware reranking mechanism to refine retrieval.

Result: Experiments show substantial improvements in both efficiency and accuracy compared to prior methods.

Conclusion: The framework advances speech-document QA and includes release of the first bilingual speech-document RAG dataset with Chinese and English voice queries paired with multimodal document content.

Abstract: Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech–document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag

[168] Nearest Neighbor Projection Removal Adversarial Training

Himanshu Singh, A. V. Subramanyam, Shivank Rajput, Mohan Kankanhalli

Main category: cs.CV

TL;DR: Novel adversarial training framework that reduces inter-class feature overlap by projecting out inter-class dependencies from feature space, improving robustness and generalization.

DetailsMotivation: Standard adversarial training fails to address inter-class feature overlap, which is a significant contributor to adversarial vulnerability in deep neural networks.

Method: Identify nearest inter-class neighbors for each adversarial sample and remove projections onto these neighbors to enforce feature separability, with theoretical analysis showing reduced Lipschitz constant and Rademacher complexity.

Result: Strong performance competitive with leading adversarial training techniques on CIFAR-10, CIFAR-100, and SVHN benchmarks, achieving improvements in both robust and clean accuracy.

Conclusion: Explicitly addressing inter-class feature proximity is crucial for enhancing adversarial robustness in deep neural networks.

Abstract: Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.

cs.AI

[169] Learning-Based Planning for Improving Science Return of Earth Observation Satellites

Abigail Breitfeld, Alberto Candela, Juan Delfa, Akseli Kangaslahti, Itai Zilberstein, Steve Chien, David Wettergreen

Main category: cs.AI

TL;DR: Learning-based dynamic targeting methods for Earth observation satellites outperform traditional heuristic approaches, with reinforcement learning achieving 13.7% better performance and imitation learning 10.0% better in optimizing data collection.

DetailsMotivation: Earth observing satellites have limitations in orbital flexibility, sensor field of view, and resource allocation. They need to optimize data collection by focusing on the most informative measurements, which requires intelligent targeting strategies.

Method: Two learning-based approaches: reinforcement learning and imitation learning, building on a dynamic programming solution to plan sampling location sequences. These methods use satellite resources and lookahead instrument data to intelligently reconfigure primary instruments.

Result: Both learning methods significantly outperform existing heuristic methods - imitation learning performs 10.0% better than the best heuristic, while reinforcement learning performs 13.7% better. Both can be trained effectively with relatively small amounts of data.

Conclusion: Learning-based approaches, particularly reinforcement learning, provide superior performance for dynamic targeting in Earth observation satellites, enabling more efficient and informative data collection compared to conventional methods.

Abstract: Earth observing satellites are powerful tools for collecting scientific information about our planet, however they have limitations: they cannot easily deviate from their orbital trajectories, their sensors have a limited field of view, and pointing and operating these sensors can take a large amount of the spacecraft’s resources. It is important for these satellites to optimize the data they collect and include only the most important or informative measurements. Dynamic targeting is an emerging concept in which satellite resources and data from a lookahead instrument are used to intelligently reconfigure and point a primary instrument. Simulation studies have shown that dynamic targeting increases the amount of scientific information gathered versus conventional sampling strategies. In this work, we present two different learning-based approaches to dynamic targeting, using reinforcement and imitation learning, respectively. These learning methods build on a dynamic programming solution to plan a sequence of sampling locations. We evaluate our approaches against existing heuristic methods for dynamic targeting, showing the benefits of using learning for this application. Imitation learning performs on average 10.0% better than the best heuristic method, while reinforcement learning performs on average 13.7% better. We also show that both learning methods can be trained effectively with relatively small amounts of data.

[170] EnvX: Agentize Everything with Agentic AI

Linyao Chen, Zimian Peng, Yingxuan Yang, Yikun Wang, Wenzheng Tom Tang, Hiroki H. Kobayashi, Weinan Zhang

Main category: cs.AI

TL;DR: EnvX is an AI framework that transforms GitHub repositories into intelligent agents capable of natural language interaction and collaboration, automating software reuse through TODO-guided initialization, human-aligned automation, and A2A protocols.

DetailsMotivation: Current software reuse from open-source repositories is manual, error-prone, and disconnected, requiring developers to navigate documentation, understand APIs, and write integration code, creating significant barriers to efficient software component utilization.

Method: EnvX uses Agentic AI to agentize repositories through a three-phase process: TODO-guided environment initialization (setting up dependencies and validation), human-aligned agentic automation (autonomous task performance), and Agent-to-Agent (A2A) protocol for multi-agent collaboration, combining LLMs with structured tool integration.

Result: EnvX achieves 74.07% execution completion rate and 51.85% task pass rate on GitTaskBench benchmark with 18 repositories across image processing, speech recognition, document analysis, and video manipulation domains, outperforming existing frameworks and enabling multi-repository collaboration.

Conclusion: EnvX represents a paradigm shift from treating repositories as passive code resources to intelligent, interactive agents, significantly improving accessibility and collaboration within the open-source ecosystem by automating the entire process of understanding, initializing, and operationalizing repository functionality.

Abstract: The widespread availability of open-source repositories has led to a vast collection of reusable software components, yet their utilization remains manual, error-prone, and disconnected. Developers must navigate documentation, understand APIs, and write integration code, creating significant barriers to efficient software reuse. To address this, we present EnvX, a framework that leverages Agentic AI to agentize GitHub repositories, transforming them into intelligent, autonomous agents capable of natural language interaction and inter-agent collaboration. Unlike existing approaches that treat repositories as static code resources, EnvX reimagines them as active agents through a three-phase process: (1) TODO-guided environment initialization, which sets up the necessary dependencies, data, and validation datasets; (2) human-aligned agentic automation, allowing repository-specific agents to autonomously perform real-world tasks; and (3) Agent-to-Agent (A2A) protocol, enabling multiple agents to collaborate. By combining large language model capabilities with structured tool integration, EnvX automates not just code generation, but the entire process of understanding, initializing, and operationalizing repository functionality. We evaluate EnvX on the GitTaskBench benchmark, using 18 repositories across domains such as image processing, speech recognition, document analysis, and video manipulation. Our results show that EnvX achieves a 74.07% execution completion rate and 51.85% task pass rate, outperforming existing frameworks. Case studies further demonstrate EnvX’s ability to enable multi-repository collaboration via the A2A protocol. This work marks a shift from treating repositories as passive code resources to intelligent, interactive agents, fostering greater accessibility and collaboration within the open-source ecosystem.

[171] Trust Semantics Distillation for Collaborator Selection via Memory-Augmented Agentic AI

Botao Zhu, Jeslyn Wang, Dusit Niyato, Xianbin Wang

Main category: cs.AI

TL;DR: A trust evaluation model using AI-driven teacher-student architecture to efficiently select collaborators for computing tasks, reducing evaluation time and improving accuracy.

DetailsMotivation: Traditional independent trust evaluation by each task owner causes significant overhead due to frequent data exchange, complex reasoning, and dynamic changes, leading to deteriorated trust assessment.

Method: Proposes a 2TSD model with LAM-driven teacher-student architecture. Teacher agent on server collects multidimensional trust data, extracts task-specific trust semantics, and performs matching analysis. Student agents receive distilled trust semantics for rapid collaborator selection.

Result: Experimental results show reduced collaborator evaluation time, decreased device resource consumption, and improved accuracy of collaborator selection.

Conclusion: The 2TSD model effectively addresses trust evaluation challenges in collaborative computing by leveraging AI-driven semantics distillation, enabling efficient and accurate collaborator selection with reduced overhead.

Abstract: Accurate trustworthiness evaluation of potential collaborating devices is essential for the effective execution of complex computing tasks. This evaluation process involves collecting diverse trust-related data from potential collaborators, including historical performance and available resources, for collaborator selection. However, when each task owner independently assesses all collaborators’ trustworthiness, frequent data exchange, complex reasoning, and dynamic situation changes can result in significant overhead and deteriorated trust evaluation. To overcome these challenges, we propose a task-specific trust semantics distillation (2TSD) model based on a large AI model (LAM)-driven teacher-student agent architecture. The teacher agent is deployed on a server with powerful computational capabilities and an augmented memory module dedicated to multidimensional trust-related data collection, task-specific trust semantics extraction, and task-collaborator matching analysis. Upon receiving task-specific requests from device-side student agents, the teacher agent transfers the trust semantics of potential collaborators to the student agents, enabling rapid and accurate collaborator selection. Experimental results demonstrate that the proposed 2TSD model can reduce collaborator evaluation time, decrease device resource consumption, and improve the accuracy of collaborator selection.

[172] Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following

Minjong Yoo, Jinwoo Jang, Wei-jin Park, Honguk Woo

Main category: cs.AI

TL;DR: ExRAP framework enhances LLM-based embodied agents for continual instruction following in dynamic environments through exploration-augmented planning and memory-based query evaluation.

DetailsMotivation: To address the challenge of embodied agents performing continual instruction following tasks in dynamic, non-stationary environments where traditional LLMs struggle with grounding task planning in time-varying contexts.

Method: Proposes Exploratory Retrieval-Augmented Planning (ExRAP) framework that combines information-based exploration with LLM-based planning, uses environmental context memory, and implements temporal consistency refinement for query evaluation.

Result: Demonstrates robustness across various embodied instruction following scenarios in VirtualHome, ALFRED, and CARLA, consistently outperforming state-of-the-art LLM-based approaches in goal success rate and execution efficiency.

Conclusion: ExRAP effectively balances environmental context memory validity with exploration load, addresses knowledge decay in memory, and significantly improves embodied agent performance in dynamic environments.

Abstract: This study presents an Exploratory Retrieval-Augmented Planning (ExRAP) framework, designed to tackle continual instruction following tasks of embodied agents in dynamic, non-stationary environments. The framework enhances Large Language Models’ (LLMs) embodied reasoning capabilities by efficiently exploring the physical environment and establishing the environmental context memory, thereby effectively grounding the task planning process in time-varying environment contexts. In ExRAP, given multiple continual instruction following tasks, each instruction is decomposed into queries on the environmental context memory and task executions conditioned on the query results. To efficiently handle these multiple tasks that are performed continuously and simultaneously, we implement an exploration-integrated task planning scheme by incorporating the {information-based exploration} into the LLM-based planning process. Combined with memory-augmented query evaluation, this integrated scheme not only allows for a better balance between the validity of the environmental context memory and the load of environment exploration, but also improves overall task performance. Furthermore, we devise a {temporal consistency refinement} scheme for query evaluation to address the inherent decay of knowledge in the memory. Through experiments with VirtualHome, ALFRED, and CARLA, our approach demonstrates robustness against a variety of embodied instruction following scenarios involving different instruction scales and types, and non-stationarity degrees, and it consistently outperforms other state-of-the-art LLM-based task planning approaches in terms of both goal success rate and execution efficiency.

[173] Real-world Music Plagiarism Detection With Music Segment Transcription System

Seonghyeon Go

Main category: cs.AI

TL;DR: A music plagiarism detection system combining MIR technologies to extract musically meaningful segments and compute similarity scores across different formats, with promising results and a publicly available dataset.

DetailsMotivation: Growing interest in music intellectual property protection due to advances in Music Information Retrieval technology and increased accessibility of music generation/distribution.

Method: Developed music segment transcription system that extracts meaningful segments from audio, computes similarity scores based on multiple musical features through comprehensive musical analysis.

Result: Demonstrated promising results in music plagiarism detection experiments, applicable to real-world music scenarios. Created and publicly released Similar Music Pair (SMP) dataset for musical similarity research.

Conclusion: The proposed system effectively detects music plagiarism across different formats using MIR technologies and provides a valuable dataset for future research in musical similarity analysis.

Abstract: As a result of continuous advances in Music Information Retrieval (MIR) technology, generating and distributing music has become more diverse and accessible. In this context, interest in music intellectual property protection is increasing to safeguard individual music copyrights. In this work, we propose a system for detecting music plagiarism by combining various MIR technologies. We developed a music segment transcription system that extracts musically meaningful segments from audio recordings to detect plagiarism across different musical formats. With this system, we compute similarity scores based on multiple musical features that can be evaluated through comprehensive musical analysis. Our approach demonstrated promising results in music plagiarism detection experiments, and the proposed method can be applied to real-world music scenarios. We also collected a Similar Music Pair (SMP) dataset for musical similarity research using real-world cases. The dataset are publicly available.

[174] Narrative-Guided Reinforcement Learning: A Platform for Studying Language Model Influence on Decision Making

Anup Tuladhar, Araz Minhas, Adam Kirton, Eli Kinney-Lang

Main category: cs.AI

TL;DR: A dual-system platform combining RL and language models to study how narrative frameworks influence AI decision-making in a configurable gridworld environment.

DetailsMotivation: To bridge the gap between AI's decision-making capabilities and narrative reasoning, exploring how narrative elements can shape reward-based learning.

Method: Dual-system architecture with RL policy for action suggestions and language model processing through narrative frameworks, implemented in configurable gridworld with modular design for controlled testing.

Result: Preliminary implementation providing foundation for studying narrative effects on decision-making and interactions between optimization-based learning and symbolic reasoning.

Conclusion: The platform enables initial experimentation with narrative elements while maintaining consistent environment structures, facilitating future research on narrative-AI interactions.

Abstract: We present a preliminary experimental platform that explores how narrative elements might shape AI decision-making by combining reinforcement learning (RL) with language model reasoning. While AI systems can now both make decisions and engage in narrative reasoning, these capabilities have mostly been studied separately. Our platform attempts to bridge this gap using a dual-system architecture to examine how narrative frameworks could influence reward-based learning. The system comprises a reinforcement learning policy that suggests actions based on past experience, and a language model that processes these suggestions through different narrative frameworks to guide decisions. This setup enables initial experimentation with narrative elements while maintaining consistent environment and reward structures. We implement this architecture in a configurable gridworld environment, where agents receive both policy suggestions and information about their surroundings. The platform’s modular design facilitates controlled testing of environmental complexity, narrative parameters, and the interaction between reinforcement learning and narrative-based decisions. Our logging system captures basic decision metrics, from RL policy values to language model reasoning to action selection patterns. While preliminary, this implementation provides a foundation for studying how different narrative frameworks might affect reward-based decisions and exploring potential interactions between optimization-based learning and symbolic reasoning in AI systems.

[175] Leveraging AI Agents for Autonomous Networks: A Reference Architecture and Empirical Studies

Binghan Wu, Shoufeng Wang, Yunxin Liu, Ya-Qin Zhang, Joseph Sifakis, Ye Ouyang

Main category: cs.AI

TL;DR: Implementation of cognitive autonomous network agent architecture achieving 6% higher throughput and 67% BLER reduction in 5G RAN through real-time MCS optimization.

DetailsMotivation: Bridging the gap between architectural theory and operational reality for Level 4 autonomous networks that require self-configuring, self-healing, and self-optimizing capabilities.

Method: Implemented Joseph Sifakis’s AN Agent reference architecture with coordinated proactive-reactive runtimes using hybrid knowledge representation, tested through RAN Link Adaptation Agent case study.

Result: Achieved sub-10 ms real-time control in 5G NR sub-6 GHz, 6% higher downlink throughput than OLLA algorithms, and 67% BLER reduction for ultra-reliable services.

Conclusion: The framework demonstrates transformative potential in overcoming traditional autonomy barriers and advancing critical L4-enabling capabilities for next-generation networks.

Abstract: The evolution toward Level 4 (L4) Autonomous Networks (AN) represents a strategic inflection point in telecommunications, where networks must transcend reactive automation to achieve genuine cognitive capabilities–fulfilling TM Forum’s vision of self-configuring, self-healing, and self-optimizing systems that deliver zero-wait, zero-touch, and zero-fault services. This work bridges the gap between architectural theory and operational reality by implementing Joseph Sifakis’s AN Agent reference architecture in a functional cognitive system, deploying coordinated proactive-reactive runtimes driven by hybrid knowledge representation. Through an empirical case study of a Radio Access Network (RAN) Link Adaptation (LA) Agent, we validate this framework’s transformative potential: demonstrating sub-10 ms real-time control in 5G NR sub-6 GHz while achieving 6% higher downlink throughput than Outer Loop Link Adaptation (OLLA) algorithms and 67% Block Error Rate (BLER) reduction for ultra-reliable services through dynamic Modulation and Coding Scheme (MCS) optimization. These improvements confirm the architecture’s viability in overcoming traditional autonomy barriers and advancing critical L4-enabling capabilities toward next-generation objectives.

[176] Co-Investigator AI: The Rise of Agentic AI for Smarter, Trustworthy AML Compliance Narratives

Prathamesh Vasudeo Naik, Naresh Kumar Dintakurthi, Zhanghao Hu, Yue Wang, Robby Qiu

Main category: cs.AI

TL;DR: Co-Investigator AI is an agentic framework that automates Suspicious Activity Report generation with improved speed and accuracy while maintaining regulatory compliance and human oversight.

DetailsMotivation: Traditional SAR generation is costly, low-scalability, and LLMs suffer from factual hallucination, limited crime typology alignment, and poor explainability in compliance-critical domains.

Method: Agentic framework with specialized agents for planning, crime type detection, external intelligence gathering, compliance validation, dynamic memory management, AI-Privacy Guard, and real-time validation using Agent-as-a-Judge paradigm.

Result: Produces SARs significantly faster and with greater accuracy than traditional methods, streamlines drafting, aligns narratives with regulatory expectations, and enables compliance teams to focus on higher-order analytical work.

Conclusion: Marks the beginning of a new era in compliance reporting, bringing AI agent benefits to regulatory processes for scalable, reliable, and transparent SAR generation with human-in-the-loop collaboration.

Abstract: Generating regulatorily compliant Suspicious Activity Report (SAR) remains a high-cost, low-scalability bottleneck in Anti-Money Laundering (AML) workflows. While large language models (LLMs) offer promising fluency, they suffer from factual hallucination, limited crime typology alignment, and poor explainability – posing unacceptable risks in compliance-critical domains. This paper introduces Co-Investigator AI, an agentic framework optimized to produce Suspicious Activity Reports (SARs) significantly faster and with greater accuracy than traditional methods. Drawing inspiration from recent advances in autonomous agent architectures, such as the AI Co-Scientist, our approach integrates specialized agents for planning, crime type detection, external intelligence gathering, and compliance validation. The system features dynamic memory management, an AI-Privacy Guard layer for sensitive data handling, and a real-time validation agent employing the Agent-as-a-Judge paradigm to ensure continuous narrative quality assurance. Human investigators remain firmly in the loop, empowered to review and refine drafts in a collaborative workflow that blends AI efficiency with domain expertise. We demonstrate the versatility of Co-Investigator AI across a range of complex financial crime scenarios, highlighting its ability to streamline SAR drafting, align narratives with regulatory expectations, and enable compliance teams to focus on higher-order analytical work. This approach marks the beginning of a new era in compliance reporting – bringing the transformative benefits of AI agents to the core of regulatory processes and paving the way for scalable, reliable, and transparent SAR generation.

[177] TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making

Kechen Jiao, Zhirui Fang, Jiahao Liu, Bei Li, Qifan Wang, Xinyu Liu, Junhao Ruan, Zhongjian Qiao, Yifan Zhu, Yaxin Xu, Jingang Wang, Xiu Li

Main category: cs.AI

TL;DR: TCPO method improves embodied AI decision-making by aligning intermediate reasoning processes with stepwise preference optimization and action consistency constraints, achieving 26.67% success rate in ALFWorld.

DetailsMotivation: Vision language models struggle with sluggish responses and hallucinations in dynamic embodied tasks, and existing post-SFT methods suffer from sparse rewards, poor consistency, and model degradation.

Method: Proposes Thought-Centric Preference Optimization (TCPO) with stepwise preference-based optimization to transform sparse rewards into richer step samples, plus Action Policy Consistency Constraint (APC) for output consistency.

Result: Achieves 26.67% average success rate in ALFWorld environment, representing a 6% improvement over RL4VLM baseline, while mitigating model degradation issues.

Conclusion: TCPO demonstrates the effectiveness of integrating preference-based learning with chain-of-thought processes for enhancing embodied decision-making capabilities in vision-language models.

Abstract: Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches, are constrained by sparse rewards and action-only optimization, resulting in low sample efficiency, poor consistency, and model degradation. To address these issues, this paper proposes Thought-Centric Preference Optimization (TCPO) for effective embodied decision-making. Specifically, TCPO introduces a stepwise preference-based optimization approach, transforming sparse reward signals into richer step sample pairs. It emphasizes the alignment of the model’s intermediate reasoning process, mitigating the problem of model degradation. Moreover, by incorporating Action Policy Consistency Constraint (APC), it further imposes consistency constraints on the model output. Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM and validating the effectiveness of our approach in mitigating model degradation after fine-tuning. These results highlight the potential of integrating preference-based learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.

[178] Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems

Manish Shukla

Main category: cs.AI

TL;DR: This paper introduces an Adaptive Multi-Dimensional Monitoring (AMDM) algorithm for agentic AI systems, addressing the gap in human-centered evaluations and providing empirical evidence for improved anomaly detection performance.

DetailsMotivation: Current evaluations of agentic AI systems predominantly focus on technical metrics (83% of papers) while neglecting human-centered and economic considerations (only 30%). The authors aim to address this imbalance by developing a comprehensive monitoring framework.

Method: The authors formalize an AMDM algorithm that normalizes heterogeneous metrics, applies per-axis exponentially weighted moving-average thresholds, and performs joint anomaly detection via Mahalanobis distance. They conduct simulations and real-world experiments to validate their approach.

Result: AMDM significantly improves performance: reduces anomaly-detection latency from 12.3s to 5.6s on simulated goal drift, and cuts false-positive rates from 4.5% to 0.9% compared to static thresholds. The paper includes comparison tables, ROC/PR curves, and case study reanalysis.

Conclusion: The AMDM algorithm provides an effective solution for multi-dimensional monitoring of agentic AI systems, bridging the gap between technical capabilities and human-centered evaluations. The authors provide code, data, and reproducibility materials to facilitate adoption and replication.

Abstract: Agentic artificial intelligence (AI) – multi-agent systems that combine large language models with external tools and autonomous planning – are rapidly transitioning from research laboratories into high-stakes domains. Our earlier “Basic” paper introduced a five-axis framework and proposed preliminary metrics such as goal drift and harm reduction but did not provide an algorithmic instantiation or empirical evidence. This “Advanced” sequel fills that gap. First, we revisit recent benchmarks and industrial deployments to show that technical metrics still dominate evaluations: a systematic review of 84 papers from 2023–2025 found that 83% report capability metrics while only 30% consider human-centred or economic axes [2]. Second, we formalise an Adaptive Multi-Dimensional Monitoring (AMDM) algorithm that normalises heterogeneous metrics, applies per-axis exponentially weighted moving-average thresholds and performs joint anomaly detection via the Mahalanobis distance. Third, we conduct simulations and real-world experiments. AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false-positive rates from 4.5% to 0.9% compared with static thresholds. We present a comparison table and ROC/PR curves, and we reanalyse case studies to surface missing metrics. Code, data and a reproducibility checklist accompany this paper to facilitate replication. The code supporting this work is available at https://github.com/Manishms18/Adaptive-Multi-Dimensional-Monitoring.

[179] No-Knowledge Alarms for Misaligned LLMs-as-Judges

Andrés Corrada-Emmanuel

Main category: cs.AI

TL;DR: Using logical consistency between disagreeing LLM judges to detect misaligned evaluation systems without ground truth knowledge.

DetailsMotivation: Addressing the infinite monitoring chain problem when using LLMs to evaluate other LLMs without knowing ground truth or trusting experts completely.

Method: Formalizing logical consistency between disagreeing LLM judges as a Linear Programming problem in integer response count space to compute possible evaluation outcomes.

Result: Development of no-knowledge alarms that can detect with no false positives when at least one member of an LLM judge ensemble violates specified grading ability requirements.

Conclusion: Logical consistency analysis provides a reliable method to monitor LLM judges without requiring ground truth knowledge, enabling detection of misaligned evaluation systems.

Abstract: If we use LLMs as judges to evaluate the complex decisions of other LLMs, who or what monitors the judges? Infinite monitoring chains are inevitable whenever we do not know the ground truth of the decisions by experts and we do not want to trust them. One way to ameliorate our evaluation uncertainty is to exploit the use of logical consistency between disagreeing experts. By observing how LLM judges agree and disagree while grading other LLMs, we can compute the only possible evaluations of their grading ability. For example, if two LLM judges disagree on which tasks a third one completed correctly, they cannot both be 100% correct in their judgments. This logic can be formalized as a Linear Programming problem in the space of integer response counts for any finite test. We use it here to develop no-knowledge alarms for misaligned LLM judges. The alarms can detect, with no false positives, that at least one member or more of an ensemble of judges are violating a user specified grading ability requirement.

[180] Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference

Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Jiawei Shen, Jingjiang Liu, Yidan Liang

Main category: cs.AI

TL;DR: A novel failure attribution framework for multi-agent systems using multi-granularity causal inference, achieving 36.2% step-level accuracy and boosting task success rates by 22.4% through automated optimizations.

DetailsMotivation: Current diagnostic tools for multi-agent systems rely on statistical correlations and perform poorly (less than 15% accuracy), creating a critical gap in failure attribution that hampers practical deployment of MAS.

Method: Two key contributions: 1) performance causal inversion principle that models performance dependencies by reversing data flow in execution logs with Shapley values for agent-level blame assignment; 2) CDC-MAS causal discovery algorithm that handles non-stationary MAS interaction data to identify critical failure steps.

Result: Significant performance leap on Who&When and TRAIL benchmarks - achieves up to 36.2% step-level accuracy (vs <15% for state-of-the-art). Generated optimizations boost overall task success rates by average of 22.4%.

Conclusion: Provides a principled and effective solution for debugging complex agent interactions, enabling more reliable and interpretable multi-agent systems through causal inference-based failure attribution and automated optimization.

Abstract: Multi-agent systems (MAS) are critical for automating complex tasks, yet their practical deployment is severely hampered by the challenge of failure attribution. Current diagnostic tools, which rely on statistical correlations, are fundamentally inadequate; on challenging benchmarks like Who&When, state-of-the-art methods achieve less than 15% accuracy in locating the root-cause step of a failure. To address this critical gap, we introduce the first failure attribution framework for MAS grounded in multi-granularity causal inference. Our approach makes two key technical contributions: (1) a performance causal inversion principle, which correctly models performance dependencies by reversing the data flow in execution logs, combined with Shapley values to accurately assign agent-level blame; (2) a novel causal discovery algorithm, CDC-MAS, that robustly identifies critical failure steps by tackling the non-stationary nature of MAS interaction data. The framework’s attribution results directly fuel an automated optimization loop, generating targeted suggestions whose efficacy is validated via counterfactual simulations. Evaluations on the Who&When and TRAIL benchmarks demonstrate a significant leap in performance. Our method achieves up to 36.2% step-level accuracy. Crucially, the generated optimizations boost overall task success rates by an average of 22.4%. This work provides a principled and effective solution for debugging complex agent interactions, paving the way for more reliable and interpretable multi-agent systems.

[181] One Model, Two Minds: A Context-Gated Graph Learner that Recreates Human Biases

Shalima Binta Manir, Tim Oates

Main category: cs.AI

TL;DR: A dual-process Theory of Mind framework combining fast graph-based reasoning (System 1) with slower meta-adaptive learning (System 2) using context gates, validated on false-belief tasks and cognitive bias replication.

DetailsMotivation: To bridge AI and cognitive theory by creating systems that exhibit human-like social cognition and adaptive decision-making through dual-process reasoning inspired by cognitive science.

Method: Integrates graph convolutional networks (GCNs) for fast habitual reasoning (System 1) with meta-learning techniques for context-sensitive deliberative reasoning (System 2), using learned context gate mechanism to balance both systems.

Result: The model closely mirrors human adaptive behavior, achieves robust generalization to unseen contexts, and successfully replicates cognitive biases including anchoring, cognitive-load fatigue, framing effects, and priming effects.

Conclusion: This work provides a framework for AI systems with nuanced human-like social cognition and demonstrates how dual-process theories can be effectively implemented in artificial intelligence systems.

Abstract: We introduce a novel Theory of Mind (ToM) framework inspired by dual-process theories from cognitive science, integrating a fast, habitual graph-based reasoning system (System 1), implemented via graph convolutional networks (GCNs), and a slower, context-sensitive meta-adaptive learning system (System 2), driven by meta-learning techniques. Our model dynamically balances intuitive and deliberative reasoning through a learned context gate mechanism. We validate our architecture on canonical false-belief tasks and systematically explore its capacity to replicate hallmark cognitive biases associated with dual-process theory, including anchoring, cognitive-load fatigue, framing effects, and priming effects. Experimental results demonstrate that our dual-process approach closely mirrors human adaptive behavior, achieves robust generalization to unseen contexts, and elucidates cognitive mechanisms underlying reasoning biases. This work bridges artificial intelligence and cognitive theory, paving the way for AI systems exhibiting nuanced, human-like social cognition and adaptive decision-making capabilities.

[182] The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems

Ziming Luo, Atoosa Kasirzadeh, Nihar B. Shah

Main category: cs.AI

TL;DR: AI scientist systems have potential but their internal workflows need scrutiny to prevent flaws that undermine research integrity. This paper identifies four failure modes and shows they exist in current systems, recommending trace logs and code submission for better detection.

DetailsMotivation: AI scientist systems can accelerate scientific discovery but their internal workflows haven't been closely examined, risking flaws that could compromise research integrity and trustworthiness.

Method: Designed controlled experiments to isolate four potential failure modes (inappropriate benchmark selection, data leakage, metric misuse, post-hoc selection bias) and assessed two prominent open-source AI scientist systems.

Result: Found presence of several failures across a spectrum of severity that can be easily overlooked in practice. Demonstrated that access to trace logs and code enables far more effective failure detection than examining final papers alone.

Conclusion: Journals and conferences should mandate submission of trace logs and code alongside AI-generated research papers to ensure transparency, accountability, and reproducibility.

Abstract: AI scientist systems, capable of autonomously executing the full research workflow from hypothesis generation and experimentation to paper writing, hold significant potential for accelerating scientific discovery. However, the internal workflow of these systems have not been closely examined. This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs. In this paper, we identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias. To examine these risks, we design controlled experiments that isolate each failure mode while addressing challenges unique to evaluating AI scientist systems. Our assessment of two prominent open-source AI scientist systems reveals the presence of several failures, across a spectrum of severity, which can be easily overlooked in practice. Finally, we demonstrate that access to trace logs and code from the full automated workflow enables far more effective detection of such failures than examining the final paper alone. We thus recommend journals and conferences evaluating AI-generated research to mandate submission of these artifacts alongside the paper to ensure transparency, accountability, and reproducibility.

[183] Depth-Bounded Epistemic Planning

Thomas Bolander, Alessandro Burigana, Marco Montali

Main category: cs.AI

TL;DR: A novel epistemic planning algorithm that limits reasoning depth to bound b, using canonical b-bisimulation contraction for efficient state space reduction and planning.

DetailsMotivation: To address the computational complexity of epistemic planning by limiting the depth of higher-order knowledge reasoning, making planning more efficient while maintaining completeness.

Method: Iteratively increment reasoning depth bound b, use canonical b-bisimulation contraction to create minimal unique models, and implement in DAEDALUS planner with state visitation checks.

Result: Algorithm runs in (b+1)-EXPTIME complexity, shows soundness and completeness, and demonstrates performance improvements over EFP 2.0 planner on standard benchmarks.

Conclusion: Limiting reasoning depth with b-bisimulation contraction enables efficient epistemic planning while maintaining theoretical guarantees, with practical performance gains demonstrated.

Abstract: We propose a novel algorithm for epistemic planning based on dynamic epistemic logic (DEL). The novelty is that we limit the depth of reasoning of the planning agent to an upper bound b, meaning that the planning agent can only reason about higher-order knowledge to at most (modal) depth b. We then compute a plan requiring the lowest reasoning depth by iteratively incrementing the value of b. The algorithm relies at its core on a new type of “canonical” b-bisimulation contraction that guarantees unique minimal models by construction. This yields smaller states wrt. standard bisimulation contractions, and enables to efficiently check for visited states. We show soundness and completeness of our planning algorithm, under suitable bounds on reasoning depth, and that, for a bound b, it runs in (b+1)-EXPTIME. We implement the algorithm in a novel epistemic planner, DAEDALUS, and compare it to the EFP 2.0 planner on several benchmarks from the literature, showing effective performance improvements.

[184] Associative Knowledge Graphs for Efficient Sequence Storage and Retrieval

Przemysław Stokłosa, Janusz A. Starzyk, Paweł Raif, Adrian Horzyk, Marcin Kowalik

Main category: cs.AI

TL;DR: Novel method using Sequential Structural Associative Knowledge Graphs (SSAKGs) for efficient sequence storage and retrieval with high memory capacity and context-based accuracy.

DetailsMotivation: Address challenges in storing and retrieving sequences for applications like anomaly detection, behavior prediction, and genetic information analysis using associative knowledge graphs.

Method: Developed SSAKGs that encode sequences as transitive tournaments with nodes representing objects and edges defining order. Created four ordering algorithms: Simple Sort, Node Ordering, Enhanced Node Ordering, and Weighted Edges Node Ordering. Evaluated on synthetic and real-world datasets including NLTK sentences and miRNA sequences.

Result: SSAKGs exhibited quadratic growth in memory capacity relative to graph size. Achieved high performance in precision, sensitivity, and specificity metrics for sequence retrieval.

Conclusion: The approach offers key advantages including no training requirements, flexible context-based reconstruction, and high efficiency in sparse memory graphs, with broad applications in computational neuroscience and bioinformatics.

Abstract: The paper addresses challenges in storing and retrieving sequences in contexts like anomaly detection, behavior prediction, and genetic information analysis. Associative Knowledge Graphs (AKGs) offer a promising approach by leveraging sparse graph structures to encode sequences. The objective was to develop a method for sequence storage and retrieval using AKGs that maintain high memory capacity and context-based retrieval accuracy while introducing algorithms for efficient element ordering. The study utilized Sequential Structural Associative Knowledge Graphs (SSAKGs). These graphs encode sequences as transitive tournaments with nodes representing objects and edges defining the order. Four ordering algorithms were developed and tested: Simple Sort, Node Ordering, Enhanced Node Ordering, and Weighted Edges Node Ordering. The evaluation was conducted on synthetic datasets consisting of random sequences of varying lengths and distributions, and real-world datasets, including sentence-based sequences from the NLTK library and miRNA sequences mapped symbolically with a window-based approach. Metrics such as precision, sensitivity, and specificity were employed to assess performance. SSAKGs exhibited quadratic growth in memory capacity relative to graph size. This study introduces a novel structural approach for sequence storage and retrieval. Key advantages include no training requirements, flexible context-based reconstruction, and high efficiency in sparse memory graphs. With broad applications in computational neuroscience and bioinformatics, the approach offers scalable solutions for sequence-based memory tasks.

[185] Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research

Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang

Main category: cs.AI

TL;DR: A knowledge-enhanced system for perovskite solar cells research that includes a knowledge graph from 1,517 papers, two datasets with Q&A pairs and reasoning problems, and two specialized LLMs that outperform existing models.

DetailsMotivation: The exponential growth in perovskite solar cell research publications creates an urgent need for efficient knowledge management and reasoning systems in this domain.

Method: Developed Perovskite-KG knowledge graph (23,789 entities, 22,272 relationships from 1,517 papers), created Perovskite-Chat (55,101 Q&A pairs) and Perovskite-Reasoning (2,217 materials science problems) datasets, and introduced two specialized LLMs for knowledge assistance and scientific reasoning.

Result: The system significantly outperforms existing models in both domain-specific knowledge retrieval and scientific reasoning tasks.

Conclusion: Provides researchers with effective tools for literature review, experimental design, and complex problem-solving in perovskite solar cell research.

Abstract: The rapid advancement of perovskite solar cells (PSCs) has led to an exponential growth in research publications, creating an urgent need for efficient knowledge management and reasoning systems in this domain. We present a comprehensive knowledge-enhanced system for PSCs that integrates three key components. First, we develop Perovskite-KG, a domain-specific knowledge graph constructed from 1,517 research papers, containing 23,789 entities and 22,272 relationships. Second, we create two complementary datasets: Perovskite-Chat, comprising 55,101 high-quality question-answer pairs generated through a novel multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully curated materials science problems. Third, we introduce two specialized large language models: Perovskite-Chat-LLM for domain-specific knowledge assistance and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental results demonstrate that our system significantly outperforms existing models in both domain-specific knowledge retrieval and scientific reasoning tasks, providing researchers with effective tools for literature review, experimental design, and complex problem-solving in PSC research.

[186] Meta-Semantics Augmented Few-Shot Relational Learning

Han Wu, Jie Yin

Main category: cs.AI

TL;DR: PromptMeta framework integrates meta-semantics with relational information for few-shot learning on knowledge graphs, using meta-semantic prompts and learnable fusion tokens to enable effective knowledge transfer to rare relations.

DetailsMotivation: Existing few-shot relational learning methods focus primarily on specific relational information but overlook the rich semantics inherent in knowledge graphs, creating a critical gap in leveraging comprehensive KG semantics.

Method: Proposes PromptMeta framework with: 1) Meta-Semantic Prompt pool that learns high-level meta-semantics for knowledge transfer, and 2) Learnable fusion token that dynamically combines meta-semantics with task-specific relational information. Both components are optimized jointly within meta-learning.

Result: Extensive experiments on two real-world KG datasets demonstrate PromptMeta’s effectiveness in adapting to new relations with limited data.

Conclusion: The framework successfully integrates meta-semantics with relational information, enabling effective few-shot relational learning and adaptation to rare and newly emerging relations in knowledge graphs.

Abstract: Few-shot relational learning on knowledge graph (KGs) aims to perform reasoning over relations with only a few training examples. While existing methods have primarily focused on leveraging specific relational information, rich semantics inherent in KGs have been largely overlooked. To address this critical gap, we propose a novel prompted meta-learning (PromptMeta) framework that seamlessly integrates meta-semantics with relational information for few-shot relational learning. PromptMeta has two key innovations: (1) a Meta-Semantic Prompt (MSP) pool that learns and consolidates high-level meta-semantics, enabling effective knowledge transfer and adaptation to rare and newly emerging relations; and (2) a learnable fusion token that dynamically combines meta-semantics with task-specific relational information tailored to different few-shot tasks. Both components are optimized jointly with model parameters within a meta-learning framework. Extensive experiments and analyses on two real-world KG datasets demonstrate the effectiveness of PromptMeta in adapting to new relations with limited data.

[187] Context-Driven Knowledge Graph Completion with Semantic-Aware Relational Message Passing

Siyuan Li, Yan Wen, Ruitong Liu, Te Sun, Ruihao Zhou, Jingyi Kang, Yunjia Wu

Main category: cs.AI

TL;DR: Proposes semantic-aware relational message passing with Top-K neighbor selection to improve Knowledge Graph Completion by focusing on semantically relevant edges and reducing noise from indiscriminate aggregation.

DetailsMotivation: Traditional node-based message passing in knowledge graphs introduces noise and suffers from information dilution/over-smoothing by aggregating all neighboring edges indiscriminately, missing crucial semantic context for link prediction.

Method: Introduces semantic-aware Top-K neighbor selection that evaluates semantic relevance between central nodes and incident edges in a shared latent space, then fuses selected edge information with node representation using multi-head attention aggregator.

Result: Extensive experiments show superior performance compared to existing approaches on several established benchmarks.

Conclusion: The proposed semantic-aware relational message passing framework effectively captures and propagates contextually relevant information while mitigating interference from irrelevant data, improving Knowledge Graph Completion performance.

Abstract: Semantic context surrounding a triplet $(h, r, t)$ is crucial for Knowledge Graph Completion (KGC), providing vital cues for prediction. However, traditional node-based message passing mechanisms, when applied to knowledge graphs, often introduce noise and suffer from information dilution or over-smoothing by indiscriminately aggregating information from all neighboring edges. To address this challenge, we propose a semantic-aware relational message passing. A core innovation of this framework is the introduction of a semantic-aware Top-K neighbor selection strategy. Specifically, this strategy first evaluates the semantic relevance between a central node and its incident edges within a shared latent space, selecting only the Top-K most pertinent ones. Subsequently, information from these selected edges is effectively fused with the central node’s own representation using a multi-head attention aggregator to generate a semantically focused node message. In this manner, our model not only leverages the structure and features of edges within the knowledge graph but also more accurately captures and propagates the contextual information most relevant to the specific link prediction task, thereby effectively mitigating interference from irrelevant information. Extensive experiments demonstrate that our method achieves superior performance compared to existing approaches on several established benchmarks.

[188] Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Jungkoo Kang

Main category: cs.AI

TL;DR: NL2Flow is an automated pipeline for generating and evaluating workflow planning problems to address the scarcity of scalable evaluation data for LLM planning and reasoning.

DetailsMotivation: Progress in LLM planning and reasoning is hindered by lack of scalable evaluation data. Robust workflow composition is critical for effective agent performance.

Method: NL2Flow generates problems parametrically in structured intermediate representation, translates them to natural language and PDDL. Evaluates open-source LLMs on 2296 low-difficulty problems, with neuro-symbolic integration approach.

Result: Best model achieved 86% success in valid plans and 69% in optimal plans. Translating natural language to structured JSON before symbolic planning significantly improved success rates.

Conclusion: Understanding error sources in LLM reasoning is crucial as systems scale to complex tasks. Neuro-symbolic integration shows benefits, and problem characteristics’ influence depends on model and prompt design.

Abstract: Robust workflow composition is critical for effective agent performance, yet progress in Large Language Model (LLM) planning and reasoning is hindered by a scarcity of scalable evaluation data. This work introduces NL2Flow, a fully automated pipeline for generating and evaluating workflow planning problems. NL2Flow generates problems parametrically in a structured intermediate representation, translating them into both natural language and formal PDDL. I evaluate several open-source, instruct-tuned LLMs on a dataset of 2296 low-difficulty problems generated by NL2Flow. Results demonstrate that the best-performing model achieved 86% success in generating valid plans and 69% in generating optimal plans (for solvable problems). Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Importantly, translating natural language problems into a structured JSON representation prior to symbolic planning significantly improved success rates, suggesting a benefit from neuro-symbolic integration. These findings underscore the importance of understanding error sources within LLM reasoning as systems scale to more complex tasks. As LLM reasoning scales to increasingly complex problems, understanding the shifting bottlenecks and sources of error within these systems will be crucial.

[189] Working with AI: Measuring the Applicability of Generative AI to Occupations

Kiran Tomlinson, Sonia Jaffe, Will Wang, Scott Counts, Siddharth Suri

Main category: cs.AI

TL;DR: Analysis of 200k AI conversations shows generative AI most commonly assists with information gathering and writing tasks, with highest applicability in knowledge work occupations like computer/mathematical fields and office support.

DetailsMotivation: Understand the economic impact of generative AI by analyzing how people use AI for work activities and which occupations are most affected.

Method: Analyzed 200k anonymized conversations from Microsoft Bing Copilot users, classified work activities, measured task success and scope, and computed AI applicability scores for occupations.

Result: Highest AI applicability scores found in knowledge work occupations (computer/mathematical, office/administrative support) and sales occupations. Information provision, writing, teaching, and advising were the most common successful AI activities.

Conclusion: Generative AI shows strongest applicability in information-intensive occupations, with real-world usage patterns differing from previous occupational impact predictions, highlighting the need for empirical analysis of AI’s economic effects.

Abstract: Given the rapid adoption of generative AI and its potential to impact a wide range of tasks, understanding the effects of AI on the economy is one of society’s most important questions. In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations do those activities. We analyze a dataset of 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot, a publicly available generative AI system. We find the most common work activities people seek AI assistance for involve gathering information and writing, while the most common activities that AI itself is performing are providing information and assistance, writing, teaching, and advising. Combining these activity classifications with measurements of task success and scope of impact, we compute an AI applicability score for each occupation. We find the highest AI applicability scores for knowledge work occupation groups such as computer and mathematical, and office and administrative support, as well as occupations such as sales whose work activities involve providing and communicating information. Additionally, we characterize the types of work activities performed most successfully, how wage and education correlate with AI applicability, and how real-world usage compares to predictions of occupational AI impact.

[190] Understanding visual attention beehind bee-inspired UAV navigation

Pranav Rajbhandari, Abhi Veda, Matthew Garratt, Mandyam Srinivasan, Sridhar Ravi

Main category: cs.AI

TL;DR: RL agents trained with optic flow navigation show attention patterns similar to honeybees, focusing on flow discontinuities and large magnitudes to avoid obstacles and center in tunnels.

DetailsMotivation: Bio-inspired UAV navigation using honeybee optic flow capabilities for obstacle avoidance with limited sensory input.

Method: Train Reinforcement Learning agents to navigate cluttered tunnels using only optic flow as sensory input, then analyze their attention patterns.

Result: Agents primarily attend to optic flow discontinuities and large magnitude regions, resembling honeybee navigation behavior by avoiding obstacles and maintaining centered position.

Conclusion: This attention pattern is consistent across trained agents and could inform development of simple explicit control laws for physical UAVs.

Abstract: Bio-inspired design is often used in autonomous UAV navigation due to the capacity of biological systems for flight and obstacle avoidance despite limited sensory and computational capabilities. In particular, honeybees mainly use the sensory input of optic flow, the apparent motion of objects in their visual field, to navigate cluttered environments. In our work, we train a Reinforcement Learning agent to navigate a tunnel with obstacles using only optic flow as sensory input. We inspect the attention patterns of trained agents to determine the regions of optic flow on which they primarily base their motor decisions. We find that agents trained in this way pay most attention to regions of discontinuity in optic flow, as well as regions with large optic flow magnitude. The trained agents appear to navigate a cluttered tunnel by avoiding the obstacles that produce large optic flow, while maintaining a centered position in their environment, which resembles the behavior seen in flying insects. This pattern persists across independently trained agents, which suggests that this could be a good strategy for developing a simple explicit control law for physical UAVs.

[191] Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning

Ruiqi Wu, Yuang Yao, Tengfei Ma, Chenran Zhang, Na Su, Tao Zhou, Geng Chen, Wen Fan, Yi Zhou

Main category: cs.AI

TL;DR: Introduces MM-Retinal-Reason dataset and OphthaReason model for ophthalmology-specific multimodal reasoning with uncertainty-aware dynamic thinking method.

DetailsMotivation: Existing medical multimodal models focus only on basic visual feature matching, but real clinical diagnosis requires integrating heterogeneous clinical information with multimodal imaging data.

Method: Proposes Uncertainty-Aware Dynamic Thinking (UADT) that estimates sample-level uncertainty via entropy and dynamically modulates exploration depth using shaped advantage mechanism.

Result: Achieves state-of-the-art performance, outperforming general-purpose MLLMs by 24.92%, medical MLLMs by 15.00%, RL-based medical MLLMs by 21.20%, and ophthalmic MLLMs by 17.66%.

Conclusion: The proposed approach successfully bridges the gap between basic and complex clinical reasoning in ophthalmology, demonstrating superior performance across all benchmark comparisons.

Abstract: Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning abilities with reinforcement learning paradigm. Although several multimodal reasoning models have been explored in the medical domain, most of them focus exclusively on basic reasoning, which refers to shallow inference based on visual feature matching. However, real-world clinical diagnosis extends beyond basic reasoning, demanding reasoning processes that integrate heterogeneous clinical information (such as chief complaints and medical history) with multimodal medical imaging data. To bridge this gap, we introduce MM-Retinal-Reason, the first ophthalmic multimodal dataset with the full spectrum of perception and reasoning. It encompasses both basic reasoning tasks and complex reasoning tasks, aiming to enhance visual-centric fundamental reasoning capabilities and emulate realistic clinical thinking patterns. Building upon MM-Retinal-Reason, we propose OphthaReason, the first ophthalmology-specific multimodal reasoning model with step-by-step reasoning traces. To enable flexible adaptation to both basic and complex reasoning tasks, we specifically design a novel method called Uncertainty-Aware Dynamic Thinking (UADT), which estimates sample-level uncertainty via entropy and dynamically modulates the model’s exploration depth using a shaped advantage mechanism. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance on both basic and complex reasoning tasks, outperforming general-purpose MLLMs, medical MLLMs, RL-based medical MLLMs, and ophthalmic MLLMs by at least 24.92%, 15.00%, 21.20%, and 17.66%. Project Page: \href{https://github.com/lxirich/OphthaReason}{link}.

[192] Murphys Laws of AI Alignment: Why the Gap Always Wins

Madhava Gaikwad

Main category: cs.AI

TL;DR: Formal impossibility result for RLHF showing irreducible performance gap in misspecified environments without calibration oracle

DetailsMotivation: To understand the fundamental limitations of reinforcement learning from human feedback (RLHF) and explain observed alignment failures

Method: Information-theoretic proof establishing tight lower bounds, with empirical illustrations and analysis of alignment regularities

Result: Any RLHF-style learner suffers irreducible Ω(γ) performance gap in misspecified environments with bounded query budgets unless calibration oracle is available

Conclusion: Murphys Gap represents both a diagnostic limit of RLHF and a guide for future work on calibration and causal preference checks

Abstract: We prove a formal impossibility result for reinforcement learning from human feedback (RLHF). In misspecified environments with bounded query budgets, any RLHF-style learner suffers an irreducible performance gap Omega(gamma) unless it has access to a calibration oracle. We give tight lower bounds via an information-theoretic proof and show that a minimal calibration oracle suffices to eliminate the gap. Small-scale empirical illustrations and a catalogue of alignment regularities (Murphy’s Laws) indicate that many observed alignment failures are consistent with this structural mechanism. Our results position Murphys Gap as both a diagnostic limit of RLHF and a guide for future work on calibration and causal preference checks.

Quinten Steenhuis

Main category: cs.AI

TL;DR: FETCH classifier for legal issue classification achieves 97.37% accuracy using hybrid LLM/ML ensemble and automatic follow-up questions, outperforming GPT-5 at lower cost.

DetailsMotivation: Millions seeking legal help face consequences from misdirection - missed deadlines, abuse, housing loss, custody issues while waiting for proper legal assistance.

Method: Hybrid LLM/ML ensemble classification method with automatic generation of follow-up questions to enrich initial problem narratives, using 419 real-world queries dataset.

Result: Achieved classification accuracy (hits@2) of 97.37% using inexpensive models, exceeding state-of-the-art GPT-5 model performance.

Conclusion: Approach significantly reduces cost of guiding legal system users to appropriate resources while maintaining high accuracy, promising for legal aid applications.

Abstract: Each year millions of people seek help for their legal problems by calling a legal aid program hotline, walking into a legal aid office, or using a lawyer referral service. The first step to match them to the right help is to identify the legal problem the applicant is experiencing. Misdirection has consequences. Applicants may miss a deadline, experience physical abuse, lose housing or lose custody of children while waiting to connect to the right legal help. We introduce and evaluate the FETCH classifier for legal issue classification and describe two methods for improving accuracy: a hybrid LLM/ML ensemble classification method, and the automatic generation of follow-up questions to enrich the initial problem narrative. We employ a novel data set of 419 real-world queries to a nonprofit lawyer referral service. Ultimately, we show classification accuracy (hits@2) of 97.37% using a mix of inexpensive models, exceeding the performance of the current state-of-the-art GPT-5 model. Our approach shows promise in significantly reducing the cost of guiding users of the legal system to the right resource for their problem while achieving high accuracy.

[194] BlendedNet: A Blended Wing Body Aircraft Dataset and Surrogate Model for Aerodynamic Predictions

Nicholas Sung, Steven Spreizer, Mohamed Elrefaie, Kaira Samuel, Matthew C. Jones, Faez Ahmed

Main category: cs.AI

TL;DR: BlendedNet is a large public dataset of 999 blended wing body geometries with aerodynamic simulations, plus an end-to-end surrogate model for pointwise aerodynamic prediction.

DetailsMotivation: Address data scarcity for unconventional aircraft configurations like blended wing bodies and enable research on data-driven surrogate modeling for aerodynamic design.

Method: Created dataset by sampling geometric design parameters and flight conditions, generating 8830 RANS simulations. Developed surrogate framework with PointNet regressor to predict geometric parameters from point clouds, then FiLM network conditioned on parameters and flight conditions to predict pointwise coefficients.

Result: Generated comprehensive dataset with detailed surface quantities. Surrogate model shows low errors in surface predictions across diverse blended wing body configurations.

Conclusion: BlendedNet successfully provides valuable public data resource and demonstrates effective end-to-end surrogate modeling approach for aerodynamic prediction of unconventional aircraft configurations.

Abstract: BlendedNet is a publicly available aerodynamic dataset of 999 blended wing body (BWB) geometries. Each geometry is simulated across about nine flight conditions, yielding 8830 converged RANS cases with the Spalart-Allmaras model and 9 to 14 million cells per case. The dataset is generated by sampling geometric design parameters and flight conditions, and includes detailed pointwise surface quantities needed to study lift and drag. We also introduce an end-to-end surrogate framework for pointwise aerodynamic prediction. The pipeline first uses a permutation-invariant PointNet regressor to predict geometric parameters from sampled surface point clouds, then conditions a Feature-wise Linear Modulation (FiLM) network on the predicted parameters and flight conditions to predict pointwise coefficients Cp, Cfx, and Cfz. Experiments show low errors in surface predictions across diverse BWBs. BlendedNet addresses data scarcity for unconventional configurations and enables research on data-driven surrogate modeling for aerodynamic design.

[195] Towards explainable decision support using hybrid neural models for logistic terminal automation

Riccardo D’Elia, Alberto Termine, Francesco Flammini

Main category: cs.AI

TL;DR: A novel interpretable-by-design neural system dynamics framework that combines deep learning with interpretability techniques to maintain explainability and causal reliability in transportation logistics modeling.

DetailsMotivation: Deep learning in system dynamics modeling improves scalability and predictive accuracy but sacrifices explainability and causal reliability, which are critical for decision-making in transportation logistics.

Method: Hybrid approach integrating deep learning with Concept-Based Interpretability, Mechanistic Interpretability, and Causal Machine Learning to create neural network models with semantically meaningful variables.

Result: The framework enables construction of neural network models that retain causal grounding and transparency while operating on actionable variables, applied to real-world multimodal logistic terminal case studies.

Conclusion: Neuro-symbolic methods can bridge the gap between black-box predictive models and the need for explainable critical decision support in complex cyber-physical systems enabled by industrial IoT.

Abstract: The integration of Deep Learning (DL) in System Dynamics (SD) modeling for transportation logistics offers significant advantages in scalability and predictive accuracy. However, these gains are often offset by the loss of explainability and causal reliability $-$ key requirements in critical decision-making systems. This paper presents a novel framework for interpretable-by-design neural system dynamics modeling that synergizes DL with techniques from Concept-Based Interpretability, Mechanistic Interpretability, and Causal Machine Learning. The proposed hybrid approach enables the construction of neural network models that operate on semantically meaningful and actionable variables, while retaining the causal grounding and transparency typical of traditional SD models. The framework is conceived to be applied to real-world case-studies from the EU-funded project AutoMoTIF, focusing on data-driven decision support, automation, and optimization of multimodal logistic terminals. We aim at showing how neuro-symbolic methods can bridge the gap between black-box predictive models and the need for critical decision support in complex dynamical environments within cyber-physical systems enabled by the industrial Internet-of-Things.

[196] HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, Yu Cheng, Tao Chen, Lei Bai, Dongzhan Zhou, Yun Luo, Ganqu Cui, Peng Ye

Main category: cs.AI

TL;DR: HiPhO is the first benchmark for high school physics Olympiads that enables direct comparison between AI models and human contestants using official grading schemes and medal thresholds.

DetailsMotivation: Existing physics benchmarks lack systematic coverage of real-world physics competitions like Olympiads and don't allow direct performance comparison with human contestants.

Method: Compiled 13 latest Olympiad exams (2024-2025) with mixed modalities, adopted official marking schemes for fine-grained grading, and assigned medals based on official thresholds to compare models with human performance.

Result: Open-source MLLMs mostly remain at/below bronze level; open-source LLMs show occasional golds; closed-source reasoning MLLMs achieve 6-12 gold medals; most models still have significant gap from full marks.

Conclusion: Substantial performance gap exists between open-source models and top students, closed-source reasoning models show strong physical reasoning, and significant room for improvement remains. HiPhO provides rigorous human-aligned evaluation for advancing multimodal physical reasoning.

Abstract: Recently, the physical capabilities of (M)LLMs have garnered increasing attention. However, existing benchmarks for physics suffer from two major gaps: they neither provide systematic and up-to-date coverage of real-world physics competitions such as physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. Specifically, HiPhO highlights three key innovations. (1) Comprehensive Data: It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based. (2) Professional Evaluation: We adopt official marking schemes to perform fine-grained grading at both the answer and step level, fully aligned with human examiners to ensure high-quality and domain-specific evaluation. (3) Comparison with Human Contestants: We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that: across 13 exams, open-source MLLMs mostly remain at or below the bronze level; open-source LLMs show promising progress with occasional golds; closed-source reasoning MLLMs can achieve 6 to 12 gold medals; and most models still have a significant gap from full marks. These results highlight a substantial performance gap between open-source models and top students, the strong physical reasoning capabilities of closed-source reasoning models, and the fact that there is still significant room for improvement. HiPhO, as a rigorous, human-aligned, and Olympiad-focused benchmark for advancing multimodal physical reasoning, is open-source and available at https://github.com/SciYu/HiPhO.

cs.SD

[197] LALM-Eval: An Open-Source Toolkit for Holistic Evaluation of Large Audio Language Models

Sidharth Surapaneni, Hoang Nguyen, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Akshay Kalkunte, Sai Rajeswar, Sathwik Tejaswi Madhusudhan

Main category: cs.SD

TL;DR: LALM-Eval is a new efficient evaluation framework for Large Audio Language Models that addresses speed, reproducibility, and coverage limitations of existing toolkits, enabling large-scale assessments and revealing significant gaps in current models.

DetailsMotivation: Current LALM evaluation frameworks suffer from slow processing, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities, limiting fair comparison and systematic assessment.

Method: Developed LALM-Eval with optimized batch processing and parallel execution for speed improvements, standardized prompting protocols for fair comparisons, and introduced two new evaluation categories: LLM-Adaptive Diarization for temporal understanding and Spoken Language Reasoning for complex cognitive tasks.

Result: Achieved up to 127% speedup over existing toolkits, enabling previously impractical large-scale evaluations. Evaluation across 380+ tasks revealed significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning. Found lack of standardization in instruction modality causing up to 9.5% performance differences.

Conclusion: LALM-Eval provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development by addressing critical evaluation challenges and enabling comprehensive model assessment.

Abstract: Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce LALM-Eval, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. LALM-Eval provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

[198] Segment Transformer: AI-Generated Music Detection via Music Structural Analysis

Yumin Kim, Seonghyeon Go

Main category: cs.SD

TL;DR: The paper proposes a transformer-based framework for detecting AI-generated music by analyzing structural patterns in music segments, achieving high accuracy on both short and long audio clips.

DetailsMotivation: Address copyright concerns and ownership ambiguity in AI-generated music by developing accurate detection methods to distinguish between AI-created and human-composed music.

Method: Integrated various pre-trained models (SSL models and audio effect encoder) into a transformer-based framework for short audio clips, and developed a segment transformer for long audio that analyzes inter-segment relationships.

Result: Achieved high accuracy in both short-audio and full-audio detection experiments using FakeMusicCaps and SONICS datasets.

Conclusion: Integrating segment-level musical features with long-range temporal analysis effectively enhances the performance and robustness of AI-generated music detection systems.

Abstract: Audio and music generation systems have been remarkably developed in the music information retrieval (MIR) research field. The advancement of these technologies raises copyright concerns, as ownership and authorship of AI-generated music (AIGM) remain unclear. Also, it can be difficult to determine whether a piece was generated by AI or composed by humans clearly. To address these challenges, we aim to improve the accuracy of AIGM detection by analyzing the structural patterns of music segments. Specifically, to extract musical features from short audio clips, we integrated various pre-trained models, including self-supervised learning (SSL) models or an audio effect encoder, each within our suggested transformer-based framework. Furthermore, for long audio, we developed a segment transformer that divides music into segments and learns inter-segment relationships. We used the FakeMusicCaps and SONICS datasets, achieving high accuracy in both the short-audio and full-audio detection experiments. These findings suggest that integrating segment-level musical features into long-range temporal analysis can effectively enhance both the performance and robustness of AIGM detection systems.

[199] LatentVoiceGrad: Nonparallel Voice Conversion with Latent Diffusion/Flow-Matching Models

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Yuto Kondo

Main category: cs.SD

TL;DR: Improved VoiceGrad voice conversion using latent diffusion models and flow matching to enhance audio quality and speed up conversion process.

DetailsMotivation: Original VoiceGrad had challenges with audio quality and slow conversion speed compared to modern VC methods.

Method: Introduced latent diffusion models with reverse diffusion in autoencoder bottleneck, and proposed flow matching as alternative to diffusion for faster conversion.

Result: Experimental results showed enhanced speech quality and accelerated conversion compared to original VoiceGrad.

Conclusion: The improved VoiceGrad with latent diffusion and flow matching successfully addresses quality and speed limitations of the original approach.

Abstract: Previously, we introduced VoiceGrad, a nonparallel voice conversion (VC) technique enabling mel-spectrogram conversion from source to target speakers using a score-based diffusion model. The concept involves training a score network to predict the gradient of the log density of mel-spectrograms from various speakers. VC is executed by iteratively adjusting an input mel-spectrogram until resembling the target speaker’s. However, challenges persist: audio quality needs improvement, and conversion is slower compared to modern VC methods designed to operate at very high speeds. To address these, we introduce latent diffusion models into VoiceGrad, proposing an improved version with reverse diffusion in the autoencoder bottleneck. Additionally, we propose using a flow matching model as an alternative to the diffusion model to further speed up the conversion process without compromising the conversion quality. Experimental results show enhanced speech quality and accelerated conversion compared to the original.

[200] Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition

Yujian Ma, Jinqiu Sang, Ruizhe Li

Main category: cs.SD

TL;DR: First systematic study of LoRA mechanisms in Whisper encoder for speech emotion recognition, revealing delayed specialization and forward-backward matrix dynamics.

DetailsMotivation: Understand the underlying mechanisms of Low-Rank Adaptation (LoRA) in speech tasks, as it remains poorly understood despite being a popular parameter-efficient fine-tuning method for large pre-trained speech models like Whisper.

Method: Used analytical tools including layer contribution probing, logit-lens inspection, and representational similarity analysis via SVD and CKA to study LoRA adaptation in Whisper encoder for speech emotion recognition.

Result: Revealed two key mechanisms: delayed specialization process (preserving general features in early layers before consolidating task-specific information) and forward alignment, backward differentiation dynamic between LoRA’s matrices.

Conclusion: Provides empirical insights and mechanistic understanding for designing efficient and interpretable adaptation strategies in large speech models, clarifying how LoRA reshapes encoder hierarchies.

Abstract: Large pre-trained speech models such as Whisper offer strong generalization but pose significant challenges for resource-efficient adaptation. Low-Rank Adaptation (LoRA) has become a popular parameter-efficient fine-tuning method, yet its underlying mechanisms in speech tasks remain poorly understood. In this work, we conduct the first systematic mechanistic interpretability study of LoRA within the Whisper encoder for speech emotion recognition (SER). Using a suite of analytical tools, including layer contribution probing, logit-lens inspection, and representational similarity via singular value decomposition (SVD) and centered kernel alignment (CKA), we reveal two key mechanisms: a delayed specialization process that preserves general features in early layers before consolidating task-specific information, and a forward alignment, backward differentiation dynamic between LoRA’s matrices. Our findings clarify how LoRA reshapes encoder hierarchies, providing both empirical insights and a deeper mechanistic understanding for designing efficient and interpretable adaptation strategies in large speech models.

[201] PianoVAM: A Multimodal Piano Performance Dataset

Yonghyun Kim, Junhyung Park, Joonhyung Bae, Kirak Kim, Taegyun Kwon, Alexander Lerch, Juhan Nam

Main category: cs.SD

TL;DR: PianoVAM is a comprehensive multimodal piano performance dataset containing videos, audio, MIDI, hand landmarks, fingering labels, and metadata recorded from amateur pianists’ daily practice sessions using a Disklavier piano.

DetailsMotivation: The multimodal nature of music performance has driven interest in data beyond audio in music information retrieval, necessitating comprehensive datasets that capture various performance aspects.

Method: Dataset recorded using Disklavier piano capturing audio and MIDI from amateur pianists during daily practice, with synchronized top-view videos. Hand landmarks extracted using pretrained hand pose estimation model, and fingering labels generated via semi-automated annotation algorithm.

Result: Created PianoVAM dataset with aligned multimodal data. Presented benchmarking results for both audio-only and audio-visual piano transcription using the dataset.

Conclusion: The dataset enables various applications in music information retrieval and provides a foundation for multimodal piano performance analysis, with potential for additional research applications beyond transcription tasks.

Abstract: The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.

[202] Explainability of CNN Based Classification Models for Acoustic Signal

Zubair Faruqui, Mackenzie S. McIntire, Rahul Dubey, Jay McEntee

Main category: cs.SD

TL;DR: This paper applies multiple XAI techniques to interpret a CNN model’s predictions on bird vocalizations, achieving 94.8% accuracy and showing that combined XAI methods provide more complete insights than individual techniques alone.

DetailsMotivation: XAI has been underutilized in bioacoustics despite its importance for interpreting complex deep learning models. The researchers wanted to explore how different XAI techniques can help understand model decisions in analyzing bird vocalizations with geographic variations.

Method: Converted audio recordings into spectrogram images, trained a deep CNN for classification, and applied both model-agnostic (LIME, SHAP) and model-specific (DeepLIFT, Grad-CAM) XAI techniques to interpret the model’s predictions.

Result: Achieved 94.8% classification accuracy and found that different XAI techniques produced complementary explanations that together provided more complete and interpretable insights into the model’s decision-making process.

Conclusion: Using a combination of XAI techniques improves trust and interpretability in bioacoustic analysis, and this approach has broader applicability across different domain-specific tasks beyond acoustics.

Abstract: Explainable Artificial Intelligence (XAI) has emerged as a critical tool for interpreting the predictions of complex deep learning models. While XAI has been increasingly applied in various domains within acoustics, its use in bioacoustics, which involves analyzing audio signals from living organisms, remains relatively underexplored. In this paper, we investigate the vocalizations of a bird species with strong geographic variation throughout its range in North America. Audio recordings were converted into spectrogram images and used to train a deep Convolutional Neural Network (CNN) for classification, achieving an accuracy of 94.8%. To interpret the model’s predictions, we applied both model-agnostic (LIME, SHAP) and model-specific (DeepLIFT, Grad-CAM) XAI techniques. These techniques produced different but complementary explanations, and when their explanations were considered together, they provided more complete and interpretable insights into the model’s decision-making. This work highlights the importance of using a combination of XAI techniques to improve trust and interoperability, not only in broader acoustics signal analysis but also argues for broader applicability in different domain specific tasks.

[203] Neural-Enhanced Dynamic Range Compression Inversion: A Hybrid Approach for Restoring Audio Dynamics

Haoran Sun, Dominique Fourer, Hichem Maaref

Main category: cs.SD

TL;DR: Hybrid approach combining model-based DRC inversion with neural networks for robust parameter estimation and audio restoration, outperforming state-of-the-art methods.

DetailsMotivation: Existing DRC inversion methods either overlook key parameters or rely on precise parameter values that are challenging to estimate accurately, limiting their effectiveness in restoring original audio dynamics.

Method: Combines model-based DRC inversion with tailored neural network architectures (classification and regression) integrated into a framework to simultaneously estimate DRC parameters and reconstruct the original signal.

Result: Experimental evaluations on various music and speech datasets confirm the approach’s effectiveness and robustness, outperforming several state-of-the-art techniques.

Conclusion: The hybrid neural-network and model-based approach provides a robust solution for DRC inversion that addresses limitations of existing methods, enabling better audio restoration and dynamic range recovery.

Abstract: Dynamic Range Compression (DRC) is a widely used audio effect that adjusts signal dynamics for applications in music production, broadcasting, and speech processing. Inverting DRC is of broad importance for restoring the original dynamics, enabling remixing, and enhancing the overall audio quality. Existing DRC inversion methods either overlook key parameters or rely on precise parameter values, which can be challenging to estimate accurately. To address this limitation, we introduce a hybrid approach that combines model-based DRC inversion with neural networks to achieve robust DRC parameter estimation and audio restoration simultaneously. Our method uses tailored neural network architectures (classification and regression), which are then integrated into a model-based inversion framework to reconstruct the original signal. Experimental evaluations on various music and speech datasets confirm the effectiveness and robustness of our approach, outperforming several state-of-the-art techniques.

[204] QR-VC: Leveraging Quantization Residuals for Linear Disentanglement in Zero-Shot Voice Conversion

Youngjun Sim, Jinsung Yoon, Wooyeol Jeong, Young-Joo Suh

Main category: cs.SD

TL;DR: Novel voice conversion method that uses quantization residuals to preserve phonetic and prosodic details while removing speaker identity, achieving superior intelligibility and speaker similarity.

DetailsMotivation: Existing zero-shot voice conversion methods using self-supervised features with K-means quantization remove speaker identity but also eliminate fine-grained phonetic and prosodic variations, degrading speech quality.

Method: Proposes a Linear Disentangler module that fully utilizes quantization residuals by leveraging temporal speech properties, using only K-means quantization and linear projections for simple yet effective disentanglement without complex architectures.

Result: Outperforms existing methods in both subjective and objective metrics, achieving superior intelligibility, speaker similarity, and improved prosody preservation.

Conclusion: The approach demonstrates that effectively utilizing quantization residuals enables high-fidelity voice conversion with simple linear operations, highlighting the importance of preserving fine-grained speech details.

Abstract: Zero-shot voice conversion is a technique that alters the speaker identity of an input speech to match a target speaker using only a single reference utterance, without requiring additional training. Recent approaches extensively utilize self-supervised learning features with K-means quantization to extract high-quality content representations while removing speaker identity. However, this quantization process also eliminates fine-grained phonetic and prosodic variations, degrading intelligibility and prosody preservation. While prior works have primarily focused on quantized representations, quantization residuals remain underutilized and deserve further exploration. In this paper, we introduce a novel approach that fully utilizes quantization residuals by leveraging temporal properties of speech components. This facilitates the disentanglement of speaker identity and the recovery of phonetic and prosodic details lost during quantization. By applying only K-means quantization and linear projections, our method achieves simple yet effective disentanglement, without requiring complex architectures or explicit supervision. This allows for high-fidelity voice conversion trained solely with reconstruction losses. Experiments show that the proposed model outperforms existing methods across both subjective and objective metrics. It achieves superior intelligibility and speaker similarity, along with improved prosody preservation, highlighting the impact of our Linear Disentangler module.

cs.LG

[205] Revisiting Deepfake Detection: Chronological Continual Learning and the Limits of Generalization

Federico Fontana, Anxhelo Diko, Romeo Lanzino, Marco Raoul Marini, Bachir Kaddar, Gian Luca Foresti, Luigi Cinque

Main category: cs.LG

TL;DR: This paper reframes deepfake detection as a continual learning problem, proposing an efficient framework that adapts to new manipulation techniques while retaining knowledge of past generators, with extensive testing showing fast adaptation but limited future generalization.

DetailsMotivation: Address the challenge of frequent and expensive retraining required by non-continual learning methods for deepfake detection as new generation technologies rapidly evolve.

Method: Propose a continual learning framework that simulates real-world chronological evolution of deepfake technologies over 7 years, using lightweight visual backbones for real-time performance and introducing two novel metrics (C-AUC and FWT-AUC).

Result: The framework achieves efficient adaptation (155x faster than full retraining) and robust retention of historical knowledge, but generalization to future generators without additional training remains near-random (FWT-AUC ≈ 0.5) due to unique generator characteristics.

Conclusion: Current approaches can efficiently adapt to new deepfake technologies but struggle with future generalization, leading to the proposal of the Non-Universal Deepfake Distribution Hypothesis that each generator leaves a unique imprint.

Abstract: The rapid evolution of deepfake generation technologies poses critical challenges for detection systems, as non-continual learning methods demand frequent and expensive retraining. We reframe deepfake detection (DFD) as a Continual Learning (CL) problem, proposing an efficient framework that incrementally adapts to emerging visual manipulation techniques while retaining knowledge of past generators. Our framework, unlike prior approaches that rely on unreal simulation sequences, simulates the real-world chronological evolution of deepfake technologies in extended periods across 7 years. Simultaneously, our framework builds upon lightweight visual backbones to allow for the real-time performance of DFD systems. Additionally, we contribute two novel metrics: Continual AUC (C-AUC) for historical performance and Forward Transfer AUC (FWT-AUC) for future generalization. Through extensive experimentation (over 600 simulations), we empirically demonstrate that while efficient adaptation (+155 times faster than full retraining) and robust retention of historical knowledge is possible, the generalization of current approaches to future generators without additional training remains near-random (FWT-AUC $\approx$ 0.5) due to the unique imprint characterizing each existing generator. Such observations are the foundation of our newly proposed Non-Universal Deepfake Distribution Hypothesis. \textbf{Code will be released upon acceptance.}

[206] How Far Are We from True Unlearnability?

Kai Ye, Liangcai Su, Chenxiong Qian

Main category: cs.LG

TL;DR: The paper investigates why existing unlearnable examples (UEs) fail to maintain cross-task unlearnability, particularly in multi-task scenarios like Taskonomy. It analyzes the optimization process and loss landscape differences between clean and poisoned models, proposes Sharpness-Aware Learnability (SAL) to quantify unlearnability, and introduces Unlearnable Distance (UD) to measure data unlearnability.

DetailsMotivation: Existing unlearnable methods generate examples that compromise training availability, but they unexpectedly fail to maintain unlearnability across multiple tasks, raising questions about the true effectiveness of current unlearnable techniques.

Method: The authors analyze model optimization differences between clean and poisoned models, examine loss landscapes, propose Sharpness-Aware Learnability (SAL) to quantify parameter unlearnability, and develop Unlearnable Distance (UD) to measure data unlearnability based on SAL distributions.

Result: The study finds that only part of critical parameter optimization paths show significant differences between clean and poisoned models, revealing a close relationship between loss landscape and unlearnability. The proposed UD metric enables benchmarking of existing unlearnable methods.

Conclusion: Current unlearnable methods have limitations in achieving true cross-task unlearnability. The proposed SAL and UD metrics provide new ways to quantify and measure unlearnability, helping to understand the capability boundaries of existing unlearnable techniques.

Abstract: High-quality data plays an indispensable role in the era of large models, but the use of unauthorized data for model training greatly damages the interests of data owners. To overcome this threat, several unlearnable methods have been proposed, which generate unlearnable examples (UEs) by compromising the training availability of data. Clearly, due to unknown training purposes and the powerful representation learning capabilities of existing models, these data are expected to be unlearnable for models across multiple tasks, i.e., they will not help improve the model’s performance. However, unexpectedly, we find that on the multi-task dataset Taskonomy, UEs still perform well in tasks such as semantic segmentation, failing to exhibit cross-task unlearnability. This phenomenon leads us to question: How far are we from attaining truly unlearnable examples? We attempt to answer this question from the perspective of model optimization. To this end, we observe the difference in the convergence process between clean and poisoned models using a simple model architecture. Subsequently, from the loss landscape we find that only a part of the critical parameter optimization paths show significant differences, implying a close relationship between the loss landscape and unlearnability. Consequently, we employ the loss landscape to explain the underlying reasons for UEs and propose Sharpness-Aware Learnability (SAL) to quantify the unlearnability of parameters based on this explanation. Furthermore, we propose an Unlearnable Distance (UD) to measure the unlearnability of data based on the SAL distribution of parameters in clean and poisoned models. Finally, we conduct benchmark tests on mainstream unlearnable methods using the proposed UD, aiming to promote community awareness of the capability boundaries of existing unlearnable methods.

[207] JEL: A Novel Model Linking Knowledge Graph entities to News Mentions

Michael Kishelev, Pranab Bhadani, Wanying Ding, Vinay Chaudhri

Main category: cs.LG

TL;DR: JEL is a computationally efficient end-to-end multi-neural network entity linking model that outperforms current state-of-the-art models for linking text mentions to knowledge graph entities.

DetailsMotivation: Entity linking is crucial for connecting unstructured text with knowledge graphs, enabling access to curated data. JPMorgan spends over $2M annually on external vendor costs for news analytics, with 25 teams seeking solutions. Efficient EL bridges news text with knowledge graphs to facilitate daily work.

Method: Novel computationally efficient end-to-end multi-neural network based entity linking model architecture.

Result: Beats current state-of-the-art entity linking models in performance.

Conclusion: JEL provides an effective solution for entity linking tasks, particularly valuable for financial news analytics applications where connecting text mentions to knowledge graph entities is essential.

Abstract: We present JEL, a novel computationally efficient end-to-end multi-neural network based entity linking model, which beats current state-of-art model. Knowledge Graphs have emerged as a compelling abstraction for capturing critical relationships among the entities of interest and integrating data from multiple heterogeneous sources. A core problem in leveraging a knowledge graph is linking its entities to the mentions (e.g., people, company names) that are encountered in textual sources (e.g., news, blogs., etc) correctly, since there are thousands of entities to consider for each mention. This task of linking mentions and entities is referred as Entity Linking (EL). It is a fundamental task in natural language processing and is beneficial in various uses cases, such as building a New Analytics platform. News Analytics, in JPMorgan, is an essential task that benefits multiple groups across the firm. According to a survey conducted by the Innovation Digital team 1 , around 25 teams across the firm are actively looking for news analytics solutions, and more than $2 million is being spent annually on external vendor costs. Entity linking is critical for bridging unstructured news text with knowledge graphs, enabling users access to vast amounts of curated data in a knowledge graph and dramatically facilitating their daily work.

[208] Performance Assessment Strategies for Generative AI Applications in Healthcare

Victor Garcia, Mariia Sidulova, Aldo Badano

Main category: cs.LG

TL;DR: Current quantitative benchmarks for evaluating GenAI in healthcare have limitations including overfitting and lack of generalizability, prompting interest in human-expert and computational evaluation strategies.

DetailsMotivation: GenAI represents an emerging paradigm in medical AI, but current evaluation methods using quantitative benchmarks suffer from train-to-test overfitting and poor generalizability across different clinical tasks and data distributions.

Method: The paper discusses state-of-the-art methodologies including evaluation strategies that leverage human expertise and utilize cost-effective computational models as evaluators, moving beyond traditional quantitative benchmarks.

Result: The analysis reveals limitations in current benchmark approaches and highlights the need for more comprehensive evaluation frameworks that consider clinical context and real-world implementation variability.

Conclusion: Effective assessment of GenAI applications in healthcare requires a comprehensive understanding of clinical tasks and awareness of performance variability in actual clinical environments, with emerging interest in human-expert and computational evaluation approaches.

Abstract: Generative artificial intelligence (GenAI) represent an emerging paradigm within artificial intelligence, with applications throughout the medical enterprise. Assessing GenAI applications necessitates a comprehensive understanding of the clinical task and awareness of the variability in performance when implemented in actual clinical environments. Presently, a prevalent method for evaluating the performance of generative models relies on quantitative benchmarks. Such benchmarks have limitations and may suffer from train-to-the-test overfitting, optimizing performance for a specified test set at the cost of generalizability across other task and data distributions. Evaluation strategies leveraging human expertise and utilizing cost-effective computational models as evaluators are gaining interest. We discuss current state-of-the-art methodologies for assessing the performance of GenAI applications in healthcare and medical devices.

[209] Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nuño, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright

Main category: cs.LG

TL;DR: SAPO is a decentralized RL algorithm for post-training language models that enables scalable learning across heterogeneous compute networks without central bottlenecks.

DetailsMotivation: Traditional RL post-training for LMs requires significant parallelization with high technical challenges and costs. Current approaches face bottlenecks in scaling due to latency, memory, and reliability issues.

Method: Swarm sAmpling Policy Optimization (SAPO) - a fully decentralized and asynchronous algorithm where nodes manage their own policies while sharing rollouts across a network. No assumptions about latency, model homogeneity, or hardware required.

Result: Achieved cumulative reward gains of up to 94% in controlled experiments. Successfully tested on a network with thousands of nodes running on diverse hardware and models during an open-source demo.

Conclusion: SAPO effectively addresses scaling challenges in RL post-training by enabling decentralized learning across heterogeneous networks, allowing knowledge propagation and bootstrapping while avoiding common bottlenecks.

Abstract: Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while “sharing” rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts “shared” across the network, it enables “Aha moments” to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.

[210] Hammer and Anvil: A Principled Defense Against Backdoors in Federated Learning

Lucas Fenaux, Zheng Wang, Jacob Yan, Nathan Chung, Florian Kerschbaum

Main category: cs.LG

TL;DR: New adaptive backdoor attack breaks existing FL defenses with just 1-2 malicious clients, while proposed Hammer and Anvil defense with Krum+ successfully counters these attacks.

DetailsMotivation: Federated Learning is vulnerable to backdoor attacks from malicious clients, and existing defenses fail against adaptive attackers who can bypass current protection mechanisms.

Method: Developed a new adaptive adversary that outperforms existing attackers, then created Hammer and Anvil - a principled defense combining two orthogonal defense approaches to create robust protection.

Result: The new adaptive attack breaks state-of-the-art defenses with only 1-2 malicious clients out of 20, while the Krum+ defense successfully counters both the new adaptive adversary and existing attacks.

Conclusion: The proposed Hammer and Anvil framework provides effective defense against sophisticated backdoor attacks in Federated Learning, demonstrating the need for combined orthogonal defense strategies to combat adaptive adversaries.

Abstract: Federated Learning is a distributed learning technique in which multiple clients cooperate to train a machine learning model. Distributed settings facilitate backdoor attacks by malicious clients, who can embed malicious behaviors into the model during their participation in the training process. These malicious behaviors are activated during inference by a specific trigger. No defense against backdoor attacks has stood the test of time, especially against adaptive attackers, a powerful but not fully explored category of attackers. In this work, we first devise a new adaptive adversary that surpasses existing adversaries in capabilities, yielding attacks that only require one or two malicious clients out of 20 to break existing state-of-the-art defenses. Then, we present Hammer and Anvil, a principled defense approach that combines two defenses orthogonal in their underlying principle to produce a combined defense that, given the right set of parameters, must succeed against any attack. We show that our best combined defense, Krum+, is successful against our new adaptive adversary and state-of-the-art attacks.

[211] Domain Knowledge is Power: Leveraging Physiological Priors for Self Supervised Representation Learning in Electrocardiography

Nooshin Maghsoodi, Sarah Nassar, Paul F R Wilson, Minh Nguyen Nhat To, Sophia Mannina, Shamel Addas, Stephanie Sibley, David Maslove, Purang Abolmaesumi, Parvin Mousavi

Main category: cs.LG

TL;DR: PhysioCLR is a physiology-aware contrastive learning framework that incorporates ECG domain knowledge to improve arrhythmia classification by learning clinically meaningful representations from unlabeled ECG data.

DetailsMotivation: AI-based ECG analysis is limited by scarce labeled data. Self-supervised learning can leverage large unlabeled datasets, but existing methods lack integration of ECG physiological knowledge, hindering clinical relevance and generalizability.

Method: PhysioCLR integrates ECG physiological similarity cues into contrastive learning, uses ECG-specific augmentations that preserve category information, and employs a hybrid loss function to refine learned representations during pretraining.

Result: Evaluated on Chapman, Georgia, and private ICU datasets, PhysioCLR achieved 12% higher mean AUROC compared to the strongest baseline, demonstrating robust cross-dataset generalization for multilabel and binary ECG classification tasks.

Conclusion: Embedding physiological knowledge into contrastive learning enables learning of clinically meaningful and transferable ECG features, showing promise for more effective and label-efficient ECG diagnostics through physiology-informed SSL.

Abstract: Objective: Electrocardiograms (ECGs) play a crucial role in diagnosing heart conditions; however, the effectiveness of artificial intelligence (AI)-based ECG analysis is often hindered by the limited availability of labeled data. Self-supervised learning (SSL) can address this by leveraging large-scale unlabeled data. We introduce PhysioCLR (Physiology-aware Contrastive Learning Representation for ECG), a physiology-aware contrastive learning framework that incorporates domain-specific priors to enhance the generalizability and clinical relevance of ECG-based arrhythmia classification. Methods: During pretraining, PhysioCLR learns to bring together embeddings of samples that share similar clinically relevant features while pushing apart those that are dissimilar. Unlike existing methods, our method integrates ECG physiological similarity cues into contrastive learning, promoting the learning of clinically meaningful representations. Additionally, we introduce ECG- specific augmentations that preserve the ECG category post augmentation and propose a hybrid loss function to further refine the quality of learned representations. Results: We evaluate PhysioCLR on two public ECG datasets, Chapman and Georgia, for multilabel ECG diagnoses, as well as a private ICU dataset labeled for binary classification. Across the Chapman, Georgia, and private cohorts, PhysioCLR boosts the mean AUROC by 12% relative to the strongest baseline, underscoring its robust cross-dataset generalization. Conclusion: By embedding physiological knowledge into contrastive learning, PhysioCLR enables the model to learn clinically meaningful and transferable ECG eatures. Significance: PhysioCLR demonstrates the potential of physiology-informed SSL to offer a promising path toward more effective and label-efficient ECG diagnostics.

[212] Optimization Methods and Software for Federated Learning

Konstantin Burlachenko

Main category: cs.LG

TL;DR: This thesis addresses five key challenges in Federated Learning (FL) - data/device heterogeneity, communication issues, and privacy concerns - by proposing novel approaches that bridge theoretical advancements with practical implementations.

DetailsMotivation: FL operates in decentralized, less controlled environments compared to data centers, creating unique challenges that require solutions beyond traditional machine learning approaches. The inclusion of FL in the US National AI Research and Development Strategic Plan further highlights its importance.

Method: The research identifies five core FL challenges and develops novel approaches to address them, focusing on translating theoretical methods into practical implementations while also adapting practical insights back into algorithm design.

Result: The work advances FL algorithms and systems by creating a bidirectional bridge between theory and practice, enabling more efficient real-world implementations and uncovering new dimensions of algorithms through practical perspectives.

Conclusion: This thesis provides a comprehensive guide for researchers to navigate the complexities of FL implementation, demonstrating that practical considerations can inform and enrich theoretical algorithm design, ultimately advancing the field of federated learning.

Abstract: Federated Learning (FL) is a novel, multidisciplinary Machine Learning paradigm where multiple clients, such as mobile devices, collaborate to solve machine learning problems. Initially introduced in Kone{\v{c}}n{'y} et al. (2016a,b); McMahan et al. (2017), FL has gained further attention through its inclusion in the National AI Research and Development Strategic Plan (2023 Update) of the United States (Science and on Artificial Intelligence, 2023). The FL training process is inherently decentralized and often takes place in less controlled settings compared to data centers, posing unique challenges distinct from those in fully controlled environments. In this thesis, we identify five key challenges in Federated Learning and propose novel approaches to address them. These challenges arise from the heterogeneity of data and devices, communication issues, and privacy concerns for clients in FL training. Moreover, even well-established theoretical advances in FL require diverse forms of practical implementation to enhance their real-world applicability. Our contributions advance FL algorithms and systems, bridging theoretical advancements and practical implementations. More broadly, our work serves as a guide for researchers navigating the complexities of translating theoretical methods into efficient real-world implementations and software. Additionally, it offers insights into the reverse process of adapting practical implementation aspects back into theoretical algorithm design. This reverse process is particularly intriguing, as the practical perspective compels us to examine the underlying mechanics and flexibilities of algorithms more deeply, often uncovering new dimensions of the algorithms under study.

[213] In-Context Learning Enhanced Credibility Transformer

Kishan Padayachy, Ronald Richman, Salvatore Scognamiglio, Mario V. Wüthrich

Main category: cs.LG

TL;DR: The paper extends the Credibility Transformer with in-context learning using similar instances to enhance CLS token representations and improve predictive accuracy, particularly for handling unseen categorical features.

DetailsMotivation: To improve model learning and predictive performance by incorporating additional context from similar instances, enabling better generalization to new data with previously unseen feature levels.

Method: Augments the Credibility Transformer architecture with an in-context learning mechanism that uses a context batch of similar instances to enhance CLS token representations through additional information and fine-tuning.

Result: Empirical verification shows that in-context learning enhances predictive accuracy by adapting to similar risk patterns and allows generalization to new instances with previously unseen categorical feature levels.

Conclusion: The proposed in-context learning paradigm successfully extends the Credibility Transformer, improving both predictive performance and generalization capabilities to handle novel data scenarios.

Abstract: The starting point of our network architecture is the Credibility Transformer which extends the classical Transformer architecture by a credibility mechanism to improve model learning and predictive performance. This Credibility Transformer learns credibilitized CLS tokens that serve as learned representations of the original input features. In this paper we present a new paradigm that augments this architecture by an in-context learning mechanism, i.e., we increase the information set by a context batch consisting of similar instances. This allows the model to enhance the CLS token representations of the instances by additional in-context information and fine-tuning. We empirically verify that this in-context learning enhances predictive accuracy by adapting to similar risk patterns. Moreover, this in-context learning also allows the model to generalize to new instances which, e.g., have feature levels in the categorical covariates that have not been present when the model was trained – for a relevant example, think of a new vehicle model which has just been developed by a car manufacturer.

[214] torchmil: A PyTorch-based library for deep Multiple Instance Learning

Francisco M. Castro-Macías, Francisco J. Sáez-Maldonado, Pablo Morales-Álvarez, Rafael Molina

Main category: cs.LG

TL;DR: torchmil is an open-source PyTorch library that provides standardized tools for Multiple Instance Learning (MIL) to address reproducibility and accessibility issues in the field.

DetailsMotivation: The field of deep MIL methods lacks standardized tools for model development, evaluation, and comparison, which hinders reproducibility and accessibility despite growing interest.

Method: Developed torchmil, an open-source Python library built on PyTorch that offers a unified, modular, and extensible framework with basic building blocks for MIL models, standardized data format, benchmark datasets, and comprehensive documentation.

Result: Created a complete library with standardized tools, curated benchmark datasets, models, and comprehensive documentation/tutorials to support both practitioners and researchers.

Conclusion: torchmil aims to accelerate progress in MIL and lower the entry barrier for new users by providing standardized, accessible tools for the community.

Abstract: Multiple Instance Learning (MIL) is a powerful framework for weakly supervised learning, particularly useful when fine-grained annotations are unavailable. Despite growing interest in deep MIL methods, the field lacks standardized tools for model development, evaluation, and comparison, which hinders reproducibility and accessibility. To address this, we present torchmil, an open-source Python library built on PyTorch. torchmil offers a unified, modular, and extensible framework, featuring basic building blocks for MIL models, a standardized data format, and a curated collection of benchmark datasets and models. The library includes comprehensive documentation and tutorials to support both practitioners and researchers. torchmil aims to accelerate progress in MIL and lower the entry barrier for new users. Available at https://torchmil.readthedocs.io.

[215] From Limited Data to Rare-event Prediction: LLM-powered Feature Engineering and Multi-model Learning in Venture Capital

Mihir Kumar, Aaron Ontoyin Yin, Zakari Salifu, Kelvin Amoaba, Afriyie Kwesi Samuel, Fuat Alican, Yigit Ihlamur

Main category: cs.LG

TL;DR: Framework combining LLMs with multi-model ML architecture for predicting rare, high-impact outcomes like startup success in VC, achieving 9.8X-11.1X precision over baseline.

DetailsMotivation: To address the challenge of predicting rare events with limited and noisy early-stage data in Venture Capital, where investors need both predictive accuracy and interpretability for reliable decision-making.

Method: Integrates LLM-powered feature engineering to extract signals from unstructured data, then uses layered ensemble of models (XGBoost, Random Forest, Linear Regression) to produce continuous success likelihood estimates that are thresholded for binary rare-event prediction.

Result: Strong performance with precision 9.8X-11.1X better than random classifier baseline across three test subsets. Feature analysis shows startup category (15.6% influence) and number of founders as key predictors.

Conclusion: The framework successfully combines black-box model strength with interpretability for rare-event prediction, demonstrating practical value in VC startup evaluation with clear feature importance insights.

Abstract: This paper presents a framework for predicting rare, high-impact outcomes by integrating large language models (LLMs) with a multi-model machine learning (ML) architecture. The approach combines the predictive strength of black-box models with the interpretability required for reliable decision-making. We use LLM-powered feature engineering to extract and synthesize complex signals from unstructured data, which are then processed within a layered ensemble of models including XGBoost, Random Forest, and Linear Regression. The ensemble first produces a continuous estimate of success likelihood, which is then thresholded to produce a binary rare-event prediction. We apply this framework to the domain of Venture Capital (VC), where investors must evaluate startups with limited and noisy early-stage data. The empirical results show strong performance: the model achieves precision between 9.8X and 11.1X the random classifier baseline in three independent test subsets. Feature sensitivity analysis further reveals interpretable success drivers: the startup’s category list accounts for 15.6% of predictive influence, followed by the number of founders, while education level and domain expertise contribute smaller yet consistent effects.

[216] MMM-fair: An Interactive Toolkit for Exploring and Operationalizing Multi-Fairness Trade-offs

Swati Swati, Arjun Roy, Emmanouil Panagiotou, Eirini Ntoutsi

Main category: cs.LG

TL;DR: mmm-fair is an open-source toolkit that addresses multi-dimensional fairness in AI classification through boosting-based ensemble approaches, enabling flexible multi-objective optimization with a no-code interface and LLM-powered explanations.

DetailsMotivation: Growing regulatory and societal demands for equitable AI, combined with limited support in existing toolkits for exploring multi-dimensional fairness and trade-offs, particularly intersectional biases that are often missed by current methods.

Method: Leverages boosting-based ensemble approaches that dynamically optimize model weights to jointly minimize classification errors and diverse fairness violations, enabling flexible multi-objective optimization with custom fairness constraint definition.

Result: The system empowers users to deploy context-specific fair models while reliably uncovering intersectional biases, offering interactive Pareto exploration for model selection and deployment-ready models.

Conclusion: mmm-fair uniquely combines comprehensive multi-attribute fairness capabilities, multi-objective optimization, no-code interface, and advanced features in a single open-source toolkit, addressing gaps in existing fairness tools.

Abstract: Fairness-aware classification requires balancing performance and fairness, often intensified by intersectional biases. Conflicting fairness definitions further complicate the task, making it difficult to identify universally fair solutions. Despite growing regulatory and societal demands for equitable AI, popular toolkits offer limited support for exploring multi-dimensional fairness and related trade-offs. To address this, we present mmm-fair, an open-source toolkit leveraging boosting-based ensemble approaches that dynamically optimizes model weights to jointly minimize classification errors and diverse fairness violations, enabling flexible multi-objective optimization. The system empowers users to deploy models that align with their context-specific needs while reliably uncovering intersectional biases often missed by state-of-the-art methods. In a nutshell, mmm-fair uniquely combines in-depth multi-attribute fairness, multi-objective optimization, a no-code, chat-based interface, LLM-powered explanations, interactive Pareto exploration for model selection, custom fairness constraint definition, and deployment-ready models in a single open-source toolkit, a combination rarely found in existing fairness tools. Demo walkthrough available at: https://youtu.be/_rcpjlXFqkw.

[217] Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation

Ho Ming Lee, Katrien Antonio, Benjamin Avanzi, Lorenzo Marchi, Rui Zhou

Main category: cs.LG

TL;DR: A distance covariance regularization framework for achieving fairness in regression and classification tasks with multiple protected attributes, addressing fairness gerrymandering and handling both linear and nonlinear dependencies.

DetailsMotivation: Existing fairness methods focus on binary classification and single protected attributes, but real-world applications require fairness in regression tasks and multiple simultaneous protected attributes (e.g., gender, ethnicity, age) while addressing intersectional subgroup disparities.

Method: Proposed distance covariance regularization to mitigate association between predictions and protected attributes. Extended framework with two multivariate dependence measures: joint distance covariance (JdCov) and novel concatenated distance covariance (CCdCov) to handle multiple attributes and prevent fairness gerrymandering.

Result: Framework effectively addresses fairness gerrymandering in both regression and classification tasks with various protected attribute types. Applied successfully to COMPAS recidivism dataset and large motor insurance claims dataset.

Conclusion: The distance covariance regularization framework provides a comprehensive solution for achieving demographic parity fairness across multiple protected attributes in both regression and classification settings, effectively handling intersectional subgroup disparities and various attribute types.

Abstract: Ensuring equitable treatment (fairness) across protected attributes (such as gender or ethnicity) is a critical issue in machine learning. Most existing literature focuses on binary classification, but achieving fairness in regression tasks-such as insurance pricing or hiring score assessments-is equally important. Moreover, anti-discrimination laws also apply to continuous attributes, such as age, for which many existing methods are not applicable. In practice, multiple protected attributes can exist simultaneously; however, methods targeting fairness across several attributes often overlook so-called “fairness gerrymandering”, thereby ignoring disparities among intersectional subgroups (e.g., African-American women or Hispanic men). In this paper, we propose a distance covariance regularisation framework that mitigates the association between model predictions and protected attributes, in line with the fairness definition of demographic parity, and that captures both linear and nonlinear dependencies. To enhance applicability in the presence of multiple protected attributes, we extend our framework by incorporating two multivariate dependence measures based on distance covariance: the previously proposed joint distance covariance (JdCov) and our novel concatenated distance covariance (CCdCov), which effectively address fairness gerrymandering in both regression and classification tasks involving protected attributes of various types. We discuss and illustrate how to calibrate regularisation strength, including a method based on Jensen-Shannon divergence, which quantifies dissimilarities in prediction distributions across groups. We apply our framework to the COMPAS recidivism dataset and a large motor insurance claims dataset.

[218] MARLINE: Multi-Source Mapping Transfer Learning for Non-Stationary Environments

Honghui Du, Leandro Minku, Huiyu Zhou

Main category: cs.LG

TL;DR: MARLINE is a novel approach that enables knowledge transfer from multiple data sources in non-stationary environments even when source and target concepts don’t match, using projection and ensemble methods to improve prediction accuracy.

DetailsMotivation: Existing approaches assume at least one source model represents a concept similar to the target, which may not hold in real-world scenarios. Concept drift in online learning impacts predictive performance of data stream mining systems.

Method: Projects target concept to the space of each source concept, enabling multiple source sub-classifiers to contribute towards prediction as part of an ensemble.

Result: Experiments on synthetic and real-world datasets show MARLINE was more accurate than several state-of-the-art data stream learning approaches.

Conclusion: MARLINE effectively tackles concept drift by leveraging knowledge from multiple sources even when concepts don’t match, demonstrating superior performance over existing methods.

Abstract: Concept drift is a major problem in online learning due to its impact on the predictive performance of data stream mining systems. Recent studies have started exploring data streams from different sources as a strategy to tackle concept drift in a given target domain. These approaches make the assumption that at least one of the source models represents a concept similar to the target concept, which may not hold in many real-world scenarios. In this paper, we propose a novel approach called Multi-source mApping with tRansfer LearnIng for Non-stationary Environments (MARLINE). MARLINE can benefit from knowledge from multiple data sources in non-stationary environments even when source and target concepts do not match. This is achieved by projecting the target concept to the space of each source concept, enabling multiple source sub-classifiers to contribute towards the prediction of the target concept as part of an ensemble. Experiments on several synthetic and real-world datasets show that MARLINE was more accurate than several state-of-the-art data stream learning approaches.

[219] The Domain Mixed Unit: A New Neural Arithmetic Layer

Paul Curry

Main category: cs.LG

TL;DR: The Domain Mixed Unit (DMU) is a novel neural arithmetic unit that learns to mix log-space and linear-space representations using a single parameter gate, achieving state-of-the-art performance on arithmetic generalization tasks.

DetailsMotivation: To create a neural arithmetic unit that can effectively generalize arithmetic operations by combining different mathematical representations, addressing limitations of existing units in handling operations like multiplication and division.

Method: The DMU uses a single parameter gate to mix between log-space and linear-space representations, with specialized initializations for addition/multiplication and subtraction/division operations.

Result: Achieved state-of-the-art performance on the NALM Benchmark with the highest percentage solved over all seeds for multiplication and division tasks.

Conclusion: The DMU successfully demonstrates improved generalization capabilities for arithmetic operations and will be contributed to the open-source NALM benchmark community.

Abstract: The Domain Mixed Unit (DMU) is a new neural arithmetic unit that learns a single parameter gate that mixes between log-space and linear-space representations while performing either addition (DMU add) or subtraction (DMU sub). Two initializations are proposed for the DMU: one covering addition and multiplication, and another covering subtraction and division. The DMU achieves state-of-the-art performance on the NALM Benchmark, a dataset designed to test the ability of neural arithmetic units to generalize arithmetic operations, specifically performing with the highest percentage solved over all seeds on multiplication and division. The DMU will be submitted as a pull request to the open-source NALM benchmark, and its code is available on GitHub at https://github.com/marict?tab=repositories

[220] Multi-Label Transfer Learning in Non-Stationary Data Streams

Honghui Du, Leandro Minku, Aonghus Lawlor, Huiyu Zhou

Main category: cs.LG

TL;DR: Two novel transfer learning methods for multi-label data streams that leverage knowledge transfer between labels to improve adaptation in non-stationary environments.

DetailsMotivation: Label concepts in multi-label data streams often experience drift, and transferring knowledge between related labels can accelerate adaptation, but research on multi-label transfer learning for data streams remains limited.

Method: Proposed two methods: BR-MARLENE (leverages knowledge from different labels in both source and target streams) and BRPW-MARLENE (explicitly models and transfers pairwise label dependencies to enhance learning performance).

Result: Comprehensive experiments show both methods outperform state-of-the-art multi-label stream approaches in non-stationary environments.

Conclusion: The methods demonstrate effectiveness of inter-label knowledge transfer for improved predictive performance in multi-label data streams with concept drift.

Abstract: Label concepts in multi-label data streams often experience drift in non-stationary environments, either independently or in relation to other labels. Transferring knowledge between related labels can accelerate adaptation, yet research on multi-label transfer learning for data streams remains limited. To address this, we propose two novel transfer learning methods: BR-MARLENE leverages knowledge from different labels in both source and target streams for multi-label classification; BRPW-MARLENE builds on this by explicitly modelling and transferring pairwise label dependencies to enhance learning performance. Comprehensive experiments show that both methods outperform state-of-the-art multi-label stream approaches in non-stationary environments, demonstrating the effectiveness of inter-label knowledge transfer for improved predictive performance.

[221] Selective Induction Heads: How Transformers Select Causal Structures In Context

Francesco D’Angelo, Francesco Croce, Nicolas Flammarion

Main category: cs.LG

TL;DR: Transformers can dynamically handle varying causal structures through Selective Induction Heads, enabling them to identify correct context-dependent token relationships instead of relying on fixed Markov chains.

DetailsMotivation: Existing approaches use fixed Markov chains to study induction heads, but this fails to capture the dynamic, context-dependent causal structures found in natural languages where token relationships change with context.

Method: Proposed a framework using interleaved Markov chains with different lags while keeping transition probabilities fixed. Constructed a 3-layer transformer to implement Selective Induction Heads that can identify the correct causal structure in-context.

Result: Transformers learn to predict next tokens by identifying the correct lag and copying corresponding tokens from the past. The mechanism asymptotically converges to the maximum likelihood solution.

Conclusion: The work advances understanding of how transformers select causal structures, providing new insights into their functioning and interpretability through the discovery of Selective Induction Heads.

Abstract: Transformers have exhibited exceptional capabilities in sequence modeling tasks, leveraging self-attention and in-context learning. Critical to this success are induction heads, attention circuits that enable copying tokens based on their previous occurrences. In this work, we introduce a novel framework that showcases transformers’ ability to dynamically handle causal structures. Existing works rely on Markov Chains to study the formation of induction heads, revealing how transformers capture causal dependencies and learn transition probabilities in-context. However, they rely on a fixed causal structure that fails to capture the complexity of natural languages, where the relationship between tokens dynamically changes with context. To this end, our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed. This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context. We empirically demonstrate that transformers learn this mechanism to predict the next token by identifying the correct lag and copying the corresponding token from the past. We provide a detailed construction of a 3-layer transformer to implement the selective induction head, and a theoretical analysis proving that this mechanism asymptotically converges to the maximum likelihood solution. Our findings advance the understanding of how transformers select causal structures, providing new insights into their functioning and interpretability.

[222] ArtifactGen: Benchmarking WGAN-GP vs Diffusion for Label-Aware EEG Artifact Synthesis

Hritik Arasu, Faisal R Jahangiri

Main category: cs.LG

TL;DR: This paper compares two generative models (WGAN-GP and diffusion model) for synthesizing realistic EEG artifact segments to address the high cost of manual labeling, finding that WGAN-GP achieves better spectral alignment but both models show weak class-conditional recovery.

DetailsMotivation: EEG artifacts are costly to label at scale but confound automated analysis, so the researchers want to explore whether modern generative models can synthesize realistic, label-aware artifact segments for augmentation and stress-testing.

Method: Used TUH EEG Artifact corpus with subject-wise splits and fixed-length windows. Compared conditional WGAN-GP with projection discriminator to 1D denoising diffusion model with classifier-free guidance. Evaluated along fidelity (spectral metrics, covariance, autocorrelation), specificity (class-conditional recovery), and utility (augmentation effects).

Result: WGAN-GP achieved closer spectral alignment and lower MMD to real data. Both models exhibited weak class-conditional recovery, limiting immediate augmentation gains. The researchers released a reproducible pipeline for future work.

Conclusion: While WGAN-GP shows better performance in spectral metrics, both generative models need stronger conditioning and better coverage for effective EEG artifact synthesis. The released pipeline establishes a baseline and identifies actionable failure modes for future research.

Abstract: Artifacts in electroencephalography (EEG) – muscle, eye movement, electrode, chewing, and shiver – confound automated analysis yet are costly to label at scale. We study whether modern generative models can synthesize realistic, label-aware artifact segments suitable for augmentation and stress-testing. Using the TUH EEG Artifact (TUAR) corpus, we curate subject-wise splits and fixed-length multi-channel windows (e.g., 250 samples) with preprocessing tailored to each model (per-window min-max for adversarial training; per-recording/channel $z$-score for diffusion). We compare a conditional WGAN-GP with a projection discriminator to a 1D denoising diffusion model with classifier-free guidance, and evaluate along three axes: (i) fidelity via Welch band-power deltas ($\Delta\delta,\ \Delta\theta,\ \Delta\alpha,\ \Delta\beta$), channel-covariance Frobenius distance, autocorrelation $L_2$, and distributional metrics (MMD/PRD); (ii) specificity via class-conditional recovery with lightweight $k$NN/classifiers; and (iii) utility via augmentation effects on artifact recognition. In our setting, WGAN-GP achieves closer spectral alignment and lower MMD to real data, while both models exhibit weak class-conditional recovery, limiting immediate augmentation gains and revealing opportunities for stronger conditioning and coverage. We release a reproducible pipeline – data manifests, training configurations, and evaluation scripts – to establish a baseline for EEG artifact synthesis and to surface actionable failure modes for future work.

[223] Rollout-LaSDI: Enhancing the long-term accuracy of Latent Space Dynamics

Robert Stephany, Youngsoo Choi

Main category: cs.LG

TL;DR: A new approach for long-term accurate reduced-order modeling of PDEs using high-order finite-difference and rollout training loss.

DetailsMotivation: Traditional reduced-order models (ROMs) for PDEs degrade in accuracy over long time horizons despite being computationally efficient.

Method: Introduces a flexible high-order finite-difference scheme and a Rollout loss function to train ROMs for accurate predictions over arbitrary time horizons.

Result: Demonstrated effectiveness on the 2D Burgers equation, showing improved long-term predictive performance.

Conclusion: The proposed method enables ROMs to maintain accuracy over extended time periods while preserving computational efficiency.

Abstract: Solving complex partial differential equations is vital in the physical sciences, but often requires computationally expensive numerical methods. Reduced-order models (ROMs) address this by exploiting dimensionality reduction to create fast approximations. While modern ROMs can solve parameterized families of PDEs, their predictive power degrades over long time horizons. We address this by (1) introducing a flexible, high-order, yet inexpensive finite-difference scheme and (2) proposing a Rollout loss that trains ROMs to make accurate predictions over arbitrary time horizons. We demonstrate our approach on the 2D Burgers equation.

[224] Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization

Caio de Prospero Iglesias, Kimberly Villalobos Carballo, Dimitris Bertsimas

Main category: cs.LG

TL;DR: A modular framework called Prescribe-then-Select (PS) that first builds a library of feasible candidate policies and then learns a meta-policy to select the best policy for given covariates in contextual stochastic optimization problems.

DetailsMotivation: In contextual stochastic optimization, multiple candidate policies from different modeling paradigms show heterogeneous performance across covariate space with no single policy uniformly dominating others, requiring an adaptive selection approach.

Method: Proposes PS framework: 1) constructs library of feasible candidate policies, 2) learns meta-policy using ensembles of Optimal Policy Trees trained via cross-validation to select best policy for observed covariates.

Result: PS consistently outperforms the best single policy in heterogeneous regimes across two benchmark problems (newsvendor and shipment planning) and converges to the dominant policy when heterogeneity is absent.

Conclusion: The Prescribe-then-Select framework provides an effective data-driven approach for policy selection in contextual stochastic optimization problems with heterogeneous performance across covariates.

Abstract: We address the problem of policy selection in contextual stochastic optimization (CSO), where covariates are available as contextual information and decisions must satisfy hard feasibility constraints. In many CSO settings, multiple candidate policies–arising from different modeling paradigms–exhibit heterogeneous performance across the covariate space, with no single policy uniformly dominating. We propose Prescribe-then-Select (PS), a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy for the observed covariates. We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven. Across two benchmark CSO problems–single-stage newsvendor and two-stage shipment planning–PS consistently outperforms the best single policy in heterogeneous regimes of the covariate space and converges to the dominant policy when such heterogeneity is absent. All the code to reproduce the results can be found at https://anonymous.4open.science/r/Prescribe-then-Select-TMLR.

[225] Sketched Gaussian Mechanism for Private Federated Learning

Qiaobo Li, Zhijie Chen, Arindam Banerjee

Main category: cs.LG

TL;DR: Sketched Gaussian Mechanism (SGM) combines gradient compression sketching with Gaussian noise for federated learning, providing stronger privacy guarantees than isolated approaches while maintaining communication efficiency.

DetailsMotivation: Address the dual challenges of communication cost and privacy in federated learning by jointly analyzing sketching and Gaussian mechanisms rather than treating them separately.

Method: Propose SGM that integrates sketching (dimension b) with Gaussian noise, using Rényi-DP analysis to prove privacy scales as 1/√b, enabling stronger privacy with same noise budget.

Result: SGM provides significantly stronger privacy guarantees than original Gaussian mechanism, with optimization convergence showing only logarithmic dependence on parameter count d.

Conclusion: SGM-based FL outperforms non-sketching private variants in some settings, and adaptive server optimization further improves performance while maintaining privacy guarantees.

Abstract: Communication cost and privacy are two major considerations in federated learning (FL). For communication cost, gradient compression by sketching the clients’ transmitted model updates is often used for reducing per-round communication. For privacy, the Gaussian mechanism (GM), which consists of clipping updates and adding Gaussian noise, is commonly used to guarantee client-level differential privacy. Existing literature on private FL analyzes privacy of sketching and GM in an isolated manner, illustrating that sketching provides privacy determined by the sketching dimension and that GM has to supply any additional desired privacy. In this paper, we introduce the Sketched Gaussian Mechanism (SGM), which directly combines sketching and the Gaussian mechanism for privacy. Using R'enyi-DP tools, we present a joint analysis of SGM’s overall privacy guarantee, which is significantly more flexible and sharper compared to isolated analysis of sketching and GM privacy. In particular, we prove that the privacy level of SGM for a fixed noise magnitude is proportional to $1/\sqrt{b}$, where $b$ is the sketching dimension, indicating that (for moderate $b$) SGM can provide much stronger privacy guarantees than the original GM under the same noise budget. We demonstrate the application of SGM to FL with either gradient descent or adaptive server optimizers, and establish theoretical results on optimization convergence, which exhibits only a logarithmic dependence on the number of parameters $d$. Experimental results confirm that at the same privacy level, SGM based FL is at least competitive with non-sketching private FL variants and outperforms them in some settings. Moreover, using adaptive optimization at the server improves empirical performance while maintaining the privacy guarantees.

[226] Ensemble Distribution Distillation for Self-Supervised Human Activity Recognition

Matthew Nolan, Lina Yao, Robert Davidson

Main category: cs.LG

TL;DR: Self-supervised Ensemble Distribution Distillation framework for Human Activity Recognition that improves accuracy, uncertainty estimation, and robustness against adversarial attacks without increasing inference complexity.

DetailsMotivation: Address challenges in Human Activity Recognition including data requirements, reliability, and robustness issues that persist despite deep learning advancements.

Method: Uses Ensemble Distribution Distillation within a self-supervised learning framework with unlabeled data and partially supervised training strategy, plus innovative data augmentation techniques designed specifically for HAR.

Result: Shows increased predictive accuracy, robust uncertainty estimates, and substantial improvements in robustness against adversarial perturbations on multiple public datasets.

Conclusion: The proposed self-supervised EDD framework significantly improves reliability in real-world HAR scenarios while maintaining computational efficiency at inference time.

Abstract: Human Activity Recognition (HAR) has seen significant advancements with the adoption of deep learning techniques, yet challenges remain in terms of data requirements, reliability and robustness. This paper explores a novel application of Ensemble Distribution Distillation (EDD) within a self-supervised learning framework for HAR aimed at overcoming these challenges. By leveraging unlabeled data and a partially supervised training strategy, our approach yields an increase in predictive accuracy, robust estimates of uncertainty, and substantial increases in robustness against adversarial perturbation; thereby significantly improving reliability in real-world scenarios without increasing computational complexity at inference. We demonstrate this with an evaluation on several publicly available datasets. The contributions of this work include the development of a self-supervised EDD framework, an innovative data augmentation technique designed for HAR, and empirical validation of the proposed method’s effectiveness in increasing robustness and reliability.

[227] Strategies for Improving Communication Efficiency in Distributed and Federated Learning: Compression, Local Training, and Personalization

Kai Yi

Main category: cs.LG

TL;DR: This dissertation develops communication-efficient strategies for distributed and federated learning through model compression, adaptive local training with personalization, and privacy-preserving pruning frameworks.

DetailsMotivation: Communication overhead is a major bottleneck in distributed and federated learning paradigms that preserve privacy while training models across decentralized data sources.

Method: Proposes a unified framework for biased/unbiased compression operators, adaptive local training with personalization (Scafflix), privacy-preserving pruning frameworks (Cohort-Squeeze), and symmetric post-training pruning (SymWanda).

Result: Achieves superior performance under both IID and non-IID settings, reduces communication costs while maintaining accuracy, and enhances robustness under high sparsity without retraining.

Conclusion: Extensive experiments demonstrate favorable trade-offs among accuracy, convergence, and communication, providing theoretical and practical insights for scalable, efficient distributed learning.

Abstract: Distributed and federated learning are essential paradigms for training models across decentralized data sources while preserving privacy, yet communication overhead remains a major bottleneck. This dissertation explores strategies to improve communication efficiency, focusing on model compression, local training, and personalization. We establish a unified framework for biased and unbiased compression operators with convergence guarantees, then propose adaptive local training strategies that incorporate personalization to accelerate convergence and mitigate client drift. In particular, Scafflix balances global and personalized objectives, achieving superior performance under both IID and non-IID settings. We further introduce privacy-preserving pruning frameworks that optimize sparsity while minimizing communication costs, with Cohort-Squeeze leveraging hierarchical aggregation to reduce cross-device overhead. Finally, SymWanda, a symmetric post-training pruning method, enhances robustness under high sparsity and maintains accuracy without retraining. Extensive experiments on benchmarks and large-scale language models demonstrate favorable trade-offs among accuracy, convergence, and communication, offering theoretical and practical insights for scalable, efficient distributed learning.

[228] The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data

Xiaolong Luo, Michael Lingzhi Li

Main category: cs.LG

TL;DR: CRISP is a data processing pipeline that transforms raw multi-institutional critical care EHR data (CRITICAL dataset) into ML-ready datasets with standardized terminologies, quality management, and efficient processing.

DetailsMotivation: Existing critical care datasets lack the scale, diversity, and longitudinal perspective needed for generalizable AI models. The CRITICAL dataset provides extensive multi-institutional data but requires sophisticated preprocessing due to heterogeneous collection practices and vocabulary usage.

Method: CRISP systematically processes OMOP CDM data through: 1) transparent data quality management with audit trails, 2) cross-vocabulary mapping to SNOMED-CT standards with deduplication and unit standardization, 3) modular architecture with parallel optimization for efficient processing, and 4) comprehensive baseline model benchmarks.

Result: The pipeline enables complete dataset processing in <1 day on standard hardware, provides ML-ready datasets with standardized terminologies, and establishes reproducible performance standards through baseline benchmarks.

Conclusion: CRISP democratizes access to large-scale multi-institutional critical care data by saving researchers months of preprocessing effort, enabling them to focus on advancing clinical AI research with standardized, high-quality datasets.

Abstract: While existing critical care EHR datasets such as MIMIC and eICU have enabled significant advances in clinical AI research, the CRITICAL dataset opens new frontiers by providing extensive scale and diversity – containing 1.95 billion records from 371,365 patients across four geographically diverse CTSA institutions. CRITICAL’s unique strength lies in capturing full-spectrum patient journeys, including pre-ICU, ICU, and post-ICU encounters across both inpatient and outpatient settings. This multi-institutional, longitudinal perspective creates transformative opportunities for developing generalizable predictive models and advancing health equity research. However, the richness of this multi-site resource introduces substantial complexity in data harmonization, with heterogeneous collection practices and diverse vocabulary usage patterns requiring sophisticated preprocessing approaches. We present CRISP to unlock the full potential of this valuable resource. CRISP systematically transforms raw Observational Medical Outcomes Partnership Common Data Model data into ML-ready datasets through: (1) transparent data quality management with comprehensive audit trails, (2) cross-vocabulary mapping of heterogeneous medical terminologies to unified SNOMED-CT standards, with deduplication and unit standardization, (3) modular architecture with parallel optimization enabling complete dataset processing in $<$1 day even on standard computing hardware, and (4) comprehensive baseline model benchmarks spanning multiple clinical prediction tasks to establish reproducible performance standards. By providing processing pipeline, baseline implementations, and detailed transformation documentation, CRISP saves researchers months of preprocessing effort and democratizes access to large-scale multi-institutional critical care data, enabling them to focus on advancing clinical AI.

[229] Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning

Wei Huang, Anda Cheng, Yinggui Wang

Main category: cs.LG

TL;DR: FAPM is a novel pruning metric that uses task vector overlap with pre-trained parameters to quantify catastrophic forgetting, achieving only 0.25% forgetting while maintaining 99.67% downstream task accuracy.

DetailsMotivation: Large language models suffer from catastrophic forgetting during fine-tuning, which degrades their performance on original capabilities while adapting to new tasks.

Method: Proposes Forgetting-Aware Pruning Metric (FAPM) that uses the ratio of task vectors to pre-trained parameters as a pruning criterion, requiring no training modifications or auxiliary data.

Result: Extensive experiments across 8 datasets show FAPM limits catastrophic forgetting to just 0.25% while maintaining 99.67% accuracy on downstream tasks.

Conclusion: FAPM effectively balances catastrophic forgetting and downstream performance through a novel pruning approach based on task vector analysis.

Abstract: Recent advancements in large language models (LLMs) have shown impressive capabilities in various downstream tasks but typically face Catastrophic Forgetting (CF) during fine-tuning. In this paper, we propose the Forgetting-Aware Pruning Metric (FAPM), a novel pruning-based approach to balance CF and downstream task performance. Our investigation reveals that the degree to which task vectors (i.e., the subtraction of pre-trained weights from the weights fine-tuned on downstream tasks) overlap with pre-trained model parameters is a critical factor for CF. Based on this finding, FAPM employs the ratio of the task vector to pre-trained model parameters as a metric to quantify CF, integrating this measure into the pruning criteria. Importantly, FAPM does not necessitate modifications to the training process or model architecture, nor does it require any auxiliary data. We conducted extensive experiments across eight datasets, covering natural language inference, General Q&A, Medical Q&A, Math Q&A, reading comprehension, and cloze tests. The results demonstrate that FAPM limits CF to just 0.25% while maintaining 99.67% accuracy on downstream tasks. We provide the code to reproduce our results.

[230] Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

Pranav Pawar, Kavish Shah, Akshat Bhalani, Komal Kasat, Dev Mittal, Hadi Gala, Deepali Patil, Nikita Raichada, Monali Deshmukh

Main category: cs.LG

TL;DR: A new framework evaluates Vision-Language Models’ understanding of 2D physics across 400+ problems in four domains, showing model scale correlates with reasoning ability but models struggle with abstract spatial reasoning.

DetailsMotivation: As VLMs become more sophisticated, their grasp of fundamental scientific principles like physics remains underexplored, requiring rigorous evaluation methods.

Method: Developed a novel framework with pragmatic scenario generator creating diverse testbed across Projectile Motion, Collision Dynamics, Mechanics, and Fluid Dynamics domains.

Result: Strong correlation between model scale and reasoning ability observed, with top model Qwen2.5-VL-7B achieving 0.815 overall score. Models excel at formulaic problems but struggle with abstract spatial reasoning domains.

Conclusion: The framework democratizes study of scientific reasoning in VLMs and provides deeper insights into their capabilities and limitations, particularly highlighting challenges in spatial reasoning.

Abstract: As Vision-Language Models (VLMs) grow in sophistication, their ability to perform reasoning is coming under increasing supervision. While they excel at many tasks, their grasp of fundamental scientific principles, such as physics, remains an underexplored frontier. To reflect the advancements in these capabilities, we introduce a novel and accessible framework designed to rigorously evaluate VLMs on their understanding of 2D physics. Our framework features a pragmatic scenario generator that creates a diverse testbed of over 400 problems across four core domains: Projectile Motion, Collision Dynamics, Mechanics, and Fluid Dynamics. Through comprehensive evaluation of four state-of-the-art VLMs, we demonstrate a strong correlation between model scale and reasoning ability, with our top-performing model, Qwen2.5-VL-7B, achieving an overall score of 0.815. We find that while models excel at formulaic problems, they struggle significantly with domains requiring abstract spatial reasoning. By designing this framework, we aim to democratize the study of scientific reasoning in VLMs and foster deeper insights into their capabilities and limitations.

[231] Adaptive Rainfall Forecasting from Multiple Geographical Models Using Matrix Profile and Ensemble Learning

Dung T. Tran, Huyen Ngoc Huyen, Hong Nguyen, Xuan-Vu Phan, Nam-Phong Nguyen

Main category: cs.LG

TL;DR: MPWE framework combines matrix profile analysis with weighted ensemble methods to improve rainfall forecasting accuracy and stability across Vietnam’s diverse river basins.

DetailsMotivation: Accurate rainfall forecasting is crucial for flood management and disaster preparedness in Vietnam, but challenging due to diverse climatic conditions and geographical variability across river basins.

Method: Matrix Profile-based Weighted Ensemble (MPWE) - a regime-switching framework that dynamically captures covariant dependencies among multiple geographical model forecasts with redundancy-aware weighting to balance model contributions.

Result: MPWE consistently achieves lower mean and standard deviation of prediction errors compared to geographical models and ensemble baselines across eight major basins and multiple forecast horizons (1 hour to 84 hours).

Conclusion: The proposed MPWE framework demonstrates both improved accuracy and stability in rainfall forecasting, making it valuable for flood management and disaster preparedness applications in Vietnam.

Abstract: Rainfall forecasting in Vietnam is highly challenging due to its diverse climatic conditions and strong geographical variability across river basins, yet accurate and reliable forecasts are vital for flood management, hydropower operation, and disaster preparedness. In this work, we propose a Matrix Profile-based Weighted Ensemble (MPWE), a regime-switching framework that dynamically captures covariant dependencies among multiple geographical model forecasts while incorporating redundancy-aware weighting to balance contributions across models. We evaluate MPWE using rainfall forecasts from eight major basins in Vietnam, spanning five forecast horizons (1 hour and accumulated rainfall over 12, 24, 48, 72, and 84 hours). Experimental results show that MPWE consistently achieves lower mean and standard deviation of prediction errors compared to geographical models and ensemble baselines, demonstrating both improved accuracy and stability across basins and horizons.

[232] \emph{FoQuS}: A Forgetting-Quality Coreset Selection Framework for Automatic Modulation Recognition

Yao Lu, Chunfeng Sun, Dongwei Xu, Yun Lin, Qi Xuan, Guan Gui

Main category: cs.LG

TL;DR: FoQuS is a coreset selection method that reduces training overhead for Automatic Modulation Recognition by selecting only 1%-30% of data while maintaining accuracy.

DetailsMotivation: Deep learning AMR models require massive labeled data, making repeated training and hyperparameter tuning time-consuming and energy-intensive.

Method: Records prediction trajectories during full training, constructs three importance metrics based on training dynamics to select the most informative samples.

Result: Maintains high recognition accuracy and good cross-architecture generalization using only 1%-30% of original data across multiple AMR datasets.

Conclusion: FoQuS effectively reduces training overhead while preserving model performance, making AMR development more efficient.

Abstract: Deep learning-based Automatic Modulation Recognition (AMR) model has made significant progress with the support of large-scale labeled data. However, when developing new models or performing hyperparameter tuning, the time and energy consumption associated with repeated training using massive amounts of data are often unbearable. To address the above challenges, we propose \emph{FoQuS}, which approximates the effect of full training by selecting a coreset from the original dataset, thereby significantly reducing training overhead. Specifically, \emph{FoQuS} records the prediction trajectory of each sample during full-dataset training and constructs three importance metrics based on training dynamics. Experiments show that \emph{FoQuS} can maintain high recognition accuracy and good cross-architecture generalization on multiple AMR datasets using only 1%-30% of the original data.

[233] EvolKV: Evolutionary KV Cache Compression for LLM Inference

Bohan Yu, Yekun Chai

Main category: cs.LG

TL;DR: EvolKV is an adaptive framework for layer-wise KV cache compression that uses evolutionary search to optimize memory efficiency and task performance, outperforming heuristic methods across 11 tasks and achieving better performance than full cache with only 1.5% budget.

DetailsMotivation: Existing KV cache compression methods rely on heuristics like uniform allocation or static eviction policies, which ignore layer-specific feature patterns and task performance interactions, leading to degraded generalization.

Method: Reformulates cache allocation as multi-objective optimization problem and leverages evolutionary search to dynamically configure layer budgets while directly maximizing downstream performance.

Result: Outperforms all baseline methods across 11 tasks, surpasses heuristic baselines by up to 7 percentage points on GSM8K, and achieves superior performance over full KV cache on code completion with only 1.5% of original budget.

Conclusion: Demonstrates untapped potential in learned compression strategies for KV cache budget allocation, showing that adaptive layer-wise optimization can significantly improve both memory efficiency and task performance.

Abstract: Existing key-value (KV) cache compression methods typically rely on heuristics, such as uniform cache allocation across layers or static eviction policies, however, they ignore the critical interplays among layer-specific feature patterns and task performance, which can lead to degraded generalization. In this paper, we propose EvolKV, an adaptive framework for layer-wise, task-driven KV cache compression that jointly optimizes the memory efficiency and task performance. By reformulating cache allocation as a multi-objective optimization problem, EvolKV leverages evolutionary search to dynamically configure layer budgets while directly maximizing downstream performance. Extensive experiments on 11 tasks demonstrate that our approach outperforms all baseline methods across a wide range of KV cache budgets on long-context tasks and surpasses heuristic baselines by up to 7 percentage points on GSM8K. Notably, EvolKV achieves superior performance over the full KV cache setting on code completion while utilizing only 1.5% of the original budget, suggesting the untapped potential in learned compression strategies for KV cache budget allocation.

[234] Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing

Lukas Toral, Teddy Lazebnik

Main category: cs.LG

TL;DR: LLM tutoring accelerates RL convergence while maintaining performance, with advice reuse further reducing training time but causing less stable convergence.

DetailsMotivation: RL algorithms require long training in complex environments, and existing acceleration techniques are domain-specific and require expert knowledge.

Method: Used student-teacher architecture with pre-trained LLMs (Llama, Vicuna, DeepSeek) as tutors for RL algorithms (DQN, PPO, A2C) across environments (Blackjack, Snake, Connect Four) with 54 configurations, exploring advice reuse.

Result: LLM tutoring significantly accelerates RL convergence while maintaining comparable optimal performance. Advice reuse further improves training duration but results in less stable convergence dynamics.

Conclusion: LLM tutoring generally improves RL convergence, with effectiveness depending on specific task, RL algorithm, and LLM model combination.

Abstract: Reinforcement Learning (RL) algorithms often require long training to become useful, especially in complex environments with sparse rewards. While techniques like reward shaping and curriculum learning exist to accelerate training, these are often extremely specific and require the developer’s professionalism and dedicated expertise in the problem’s domain. Tackling this challenge, in this study, we explore the effectiveness of pre-trained Large Language Models (LLMs) as tutors in a student-teacher architecture with RL algorithms, hypothesizing that LLM-generated guidance allows for faster convergence. In particular, we explore the effectiveness of reusing the LLM’s advice on the RL’s convergence dynamics. Through an extensive empirical examination, which included 54 configurations, varying the RL algorithm (DQN, PPO, A2C), LLM tutor (Llama, Vicuna, DeepSeek), and environment (Blackjack, Snake, Connect Four), our results demonstrate that LLM tutoring significantly accelerates RL convergence while maintaining comparable optimal performance. Furthermore, the advice reuse mechanism shows a further improvement in training duration but also results in less stable convergence dynamics. Our findings suggest that LLM tutoring generally improves convergence, and its effectiveness is sensitive to the specific task, RL algorithm, and LLM model combination.

[235] Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

Jiaming Yan, Jianchun Liu, Hongli Xu, Liusheng Huang

Main category: cs.LG

TL;DR: MoEpic is an efficient MoE inference system that vertically splits experts into top/bottom segments to improve cache hit rates and reduce GPU memory usage, achieving significant latency reduction and cost savings.

DetailsMotivation: MoE architectures for large language models require heavy GPU memory, and existing offloading approaches suffer from poor cache hit rates and high expert loading latency during inference.

Method: Proposes expert vertical splitting into top/bottom segments, caches top segments of hot experts, predicts and prefetches next layer experts, and uses divide-and-conquer algorithm for adaptive cache configuration.

Result: Saves about half of GPU cost while reducing inference latency by 37.51%-65.73% compared to baselines on popular MoE LLMs.

Conclusion: MoEpic effectively addresses GPU memory constraints in MoE inference through expert segmentation and intelligent caching, delivering significant performance improvements and cost savings.

Abstract: Mixture-of-Experts (MoE) has emerged as a promising architecture for modern large language models (LLMs). However, massive parameters impose heavy GPU memory (i.e., VRAM) demands, hindering the widespread adoption of MoE LLMs. Offloading the expert parameters to CPU RAM offers an effective way to alleviate the VRAM requirements for MoE inference. Existing approaches typically cache a small subset of experts in VRAM and dynamically prefetch experts from RAM during inference, leading to significant degradation in inference speed due to the poor cache hit rate and substantial expert loading latency. In this work, we propose MoEpic, an efficient MoE inference system with a novel expert split mechanism. Specifically, each expert is vertically divided into two segments: top and bottom. MoEpic caches the top segment of hot experts, so that more experts will be stored under the limited VRAM budget, thereby improving the cache hit rate. During each layer’s inference, MoEpic predicts and prefetches the activated experts for the next layer. Since the top segments of cached experts are exempt from fetching, the loading time is reduced, which allows efficient transfer-computation overlap. Nevertheless, the performance of MoEpic critically depends on the cache configuration (i.e., each layer’s VRAM budget and expert split ratio). To this end, we propose a divide-and-conquer algorithm based on fixed-point iteration for adaptive cache configuration. Extensive experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost, while lowering the inference latency by about 37.51%-65.73% compared to the baselines.

[236] Generative Data Refinement: Just Ask for Better Data

Minqi Jiang, João G. M. Araújo, Will Ellsworth, Sian Gooding, Edward Grefenstette

Main category: cs.LG

TL;DR: GDR framework uses generative models to transform datasets with undesirable content into refined training data, addressing data scarcity while mitigating privacy and safety risks.

DetailsMotivation: Training data is becoming scarce as web data growth can't keep up with model training needs, and user-generated content contains private/unsafe content that can't be directly used.

Method: Generative Data Refinement (GDR) uses pretrained generative models to conditionally transform problematic datasets, generating synthetic data that maintains original diversity while removing undesirable content.

Result: GDR outperforms industry-grade anonymization solutions, enables detoxification of unsafe datasets, and preserves web-scale dataset diversity without complex prompting.

Conclusion: GDR provides a simple yet effective solution to scale training data supply for frontier models by safely leveraging user-generated content through generative refinement.

Abstract: For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web, leading to projected data exhaustion over the next decade. Much more data exists as user-generated content that is not publicly indexed, but incorporating such data comes with considerable risks, such as leaking private information and other undesirable content. We introduce a framework, Generative Data Refinement (GDR), for using pretrained generative models to transform a dataset with undesirable content into a refined dataset that is more suitable for training. Our experiments show that GDR can outperform industry-grade solutions for dataset anonymization, as well as enable direct detoxification of highly unsafe datasets. Moreover, we show that by generating synthetic data that is conditioned on each example in the real dataset, GDR’s refined outputs naturally match the diversity of web scale datasets, and thereby avoid the often challenging task of generating diverse synthetic data via model prompting. The simplicity and effectiveness of GDR make it a powerful tool for scaling up the total stock of training data for frontier models.

[237] Prediction Loss Guided Decision-Focused Learning

Haeun Jeon, Hyunglip Bae, Chanyeong Kim, Yongjae Lee, Woo Chang Kim

Main category: cs.LG

TL;DR: A novel gradient perturbation method that combines prediction and decision loss gradients to improve decision-focused learning stability and performance.

DetailsMotivation: Traditional prediction-focused learning (PFL) optimizes prediction accuracy but ignores downstream decision quality, while decision-focused learning (DFL) directly optimizes decisions but suffers from unstable convergence due to flat-and-sharp loss landscapes.

Method: Proposes perturbing the decision loss gradient using the prediction loss gradient with a sigmoid-like decaying parameter, creating an update direction that guides DFL training without additional training requirements.

Result: The method achieves lower regret with more stable training across three stochastic optimization problems, outperforming both PFL and vanilla DFL baselines in challenging scenarios.

Conclusion: The gradient perturbation approach effectively combines the stability of PFL with the decision quality focus of DFL, providing a practical solution for decision-making under uncertainty with theoretical convergence guarantees.

Abstract: Decision-making under uncertainty is often considered in two stages: predicting the unknown parameters, and then optimizing decisions based on predictions. While traditional prediction-focused learning (PFL) treats these two stages separately, decision-focused learning (DFL) trains the predictive model by directly optimizing the decision quality in an end-to-end manner. However, despite using exact or well-approximated gradients, vanilla DFL often suffers from unstable convergence due to its flat-and-sharp loss landscapes. In contrast, PFL yields more stable optimization, but overlooks the downstream decision quality. To address this, we propose a simple yet effective approach: perturbing the decision loss gradient using the prediction loss gradient to construct an update direction. Our method requires no additional training and can be integrated with any DFL solvers. Using the sigmoid-like decaying parameter, we let the prediction loss gradient guide the decision loss gradient to train a predictive model that optimizes decision quality. Also, we provide a theoretical convergence guarantee to Pareto stationary point under mild assumptions. Empirically, we demonstrate our method across three stochastic optimization problems, showing promising results compared to other baselines. We validate that our approach achieves lower regret with more stable training, even in situations where either PFL or DFL struggles.

[238] Efficient Decoding Methods for Language Models on Encrypted Data

Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, Yoav Goldberg

Main category: cs.LG

TL;DR: Cutmax enables efficient homomorphic encryption for LLM text generation, reducing latency 24x-35x while maintaining privacy.

DetailsMotivation: Privacy concerns with processing sensitive data on untrusted servers using LLMs, and the computational bottleneck of non-polynomial decoding methods under homomorphic encryption.

Method: Introduces cutmax (HE-friendly argmax algorithm) and first HE-compatible nucleus sampling method, both polynomial and differentiable for efficient encrypted inference.

Result: Achieves 24x-35x latency reduction over baselines on realistic LLM outputs while providing provable privacy guarantees.

Conclusion: Cutmax and HE-compatible sampling enable practical privacy-preserving text generation with strong theoretical convergence properties and significant performance improvements.

Abstract: Large language models (LLMs) power modern AI applications, but processing sensitive data on untrusted servers raises privacy concerns. Homomorphic encryption (HE) enables computation on encrypted data for secure inference. However, neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption, creating a significant performance bottleneck. We introduce cutmax, an HE-friendly argmax algorithm that reduces ciphertext operations compared to prior methods, enabling practical greedy decoding under encryption. We also propose the first HE-compatible nucleus (top-p) sampling method, leveraging cutmax for efficient stochastic decoding with provable privacy guarantees. Both techniques are polynomial, supporting efficient inference in privacy-preserving settings. Moreover, their differentiability facilitates gradient-based sequence-level optimization as a polynomial alternative to straight-through estimators. We further provide strong theoretical guarantees for cutmax, proving it converges globally to a unique two-level fixed point, independent of the input values beyond the identity of the maximizer, which explains its rapid convergence in just a few iterations. Evaluations on realistic LLM outputs show latency reductions of 24x-35x over baselines, advancing secure text generation.

[239] AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

Main category: cs.LG

TL;DR: AgentGym-RL is a new reinforcement learning framework for training LLM agents from scratch without supervised fine-tuning, featuring modular architecture and supporting diverse real-world scenarios with a novel ScalingInter-RL training approach.

DetailsMotivation: The community lacks a unified interactive RL framework to train autonomous LLM agents from scratch across diverse environments without relying on supervised fine-tuning, which limits agent development and knowledge acquisition through environmental interaction.

Method: Introduces AgentGym-RL framework with modular architecture and ScalingInter-RL training approach that balances exploration-exploitation by gradually increasing interaction horizons to encourage diverse problem-solving strategies while maintaining stable RL optimization.

Result: Extensive experiments show the framework is stable and effective, with agents matching or surpassing commercial models on 27 tasks across diverse environments.

Conclusion: The framework successfully bridges the gap in autonomous agent training, provides key insights, and will be open-sourced to empower development of next-generation intelligent agents through reinforcement learning.

Abstract: Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch – without relying on supervised fine-tuning (SFT) – across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework – including code and datasets – to empower the research community in developing the next generation of intelligent agents.

[240] Rethinking the Backbone in Class Imbalanced Federated Source Free Domain Adaptation: The Utility of Vision Foundation Models

Kosuke Kihara, Junki Mori, Taiki Miyagawa, Akinori F. Ebihara

Main category: cs.LG

TL;DR: Replacing FFREEDA backbone with frozen vision foundation models improves accuracy and reduces costs in federated learning with class imbalances.

DetailsMotivation: Address class imbalances in both source and target domains, plus label shifts between domains and among clients in federated learning scenarios.

Method: Replace FFREEDA backbone with frozen vision foundation model (VFM) instead of complex aggregation and domain adaptation methods.

Result: VFMs effectively mitigate domain gaps, class imbalances, and non-IID-ness among target clients with improved accuracy.

Conclusion: Strong feature extractors (VFMs) are more important than complex adaptation or FL methods for real-world federated learning success.

Abstract: Federated Learning (FL) offers a framework for training models collaboratively while preserving data privacy of each client. Recently, research has focused on Federated Source-Free Domain Adaptation (FFREEDA), a more realistic scenario wherein client-held target domain data remains unlabeled, and the server can access source domain data only during pre-training. We extend this framework to a more complex and realistic setting: Class Imbalanced FFREEDA (CI-FFREEDA), which takes into account class imbalances in both the source and target domains, as well as label shifts between source and target and among target clients. The replication of existing methods in our experimental setup lead us to rethink the focus from enhancing aggregation and domain adaptation methods to improving the feature extractors within the network itself. We propose replacing the FFREEDA backbone with a frozen vision foundation model (VFM), thereby improving overall accuracy without extensive parameter tuning and reducing computational and communication costs in federated learning. Our experimental results demonstrate that VFMs effectively mitigate the effects of domain gaps, class imbalances, and even non-IID-ness among target clients, suggesting that strong feature extractors, not complex adaptation or FL methods, are key to success in the real-world FL.

[241] Two Sides of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models

Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: MoT addresses domain generalization conflicts in graph foundation models by introducing Information Tinker and Regularization Tinker components to prevent model degradation and representation collapse, achieving state-of-the-art performance across multiple domains.

DetailsMotivation: Graph foundation models suffer from domain generalization conflicts that cause imperceptible pitfalls - model degradation where encoders fail to capture input diversity, and representation collapse where embeddings lose semantic separability due to narrow representation subspaces.

Method: Proposes MoT (Mixture-of-Tinkers) with: (1) Information Tinker using edge-wise semantic fusion and mixture-of-codebooks with domain-aware routing to improve information capacity, (2) Regularization Tinker with two additional regularizations to improve gradient supervision.

Result: Experiments on 22 datasets across 6 domains show MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios compared to state-of-the-art baselines.

Conclusion: MoT effectively addresses information bottleneck and regularization deficit challenges in graph foundation models, adheres to scaling laws, and provides a flexible architecture with controllable model scale for cross-domain generalization.

Abstract: Graph foundation models, inspired by the success of LLMs, are designed to learn the optimal embedding from multi-domain TAGs for the downstream cross-task generalization capability. During our investigation, graph VQ-MAE stands out among the increasingly diverse landscape of GFM architectures. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the decoder and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT (Mixture-of-Tinkers) - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.

[242] Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

Main category: cs.LG

TL;DR: Vision Language Models (VLMs) based on fine-tuned LLaMa 3.2 outperform CNNs in neutrino interaction classification from pixelated detector data, offering better performance, interpretability, and multimodal reasoning capabilities.

DetailsMotivation: To explore the application of Vision Language Models (VLMs) for identifying neutrino interactions in high-energy physics experiments, leveraging their multimodal reasoning capabilities beyond traditional CNN approaches.

Method: Fine-tuned a variant of LLaMa 3.2 VLM and benchmarked it against state-of-the-art CNN architectures similar to those used in NOvA and DUNE experiments, evaluating both classification performance and interpretability.

Result: VLMs outperformed CNNs in classifying electron and muon neutrino events, while providing greater flexibility for integrating auxiliary information and more interpretable, reasoning-based predictions.

Conclusion: VLMs show strong potential as a general-purpose backbone for physics event classification due to their high performance, interpretability, and generalizability, opening new avenues for multimodal reasoning in experimental neutrino physics.

Abstract: Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.

[243] Merge-of-Thought Distillation

Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, Junbo Zhao

Main category: cs.LG

TL;DR: Merge-of-Thought Distillation (MoT) is a lightweight framework that efficiently distills reasoning capabilities from multiple teachers into compact students by alternating teacher-specific fine-tuning and weight-space merging, achieving superior performance with minimal data.

DetailsMotivation: Traditional reasoning distillation assumes a single oracle teacher, but practical scenarios offer multiple candidate teachers and growing CoT corpora. Different students have different 'best teachers,' and even the same student's optimal teacher varies across datasets.

Method: MoT alternates between teacher-specific supervised fine-tuning branches and weight-space merging of resulting student variants, unifying multiple teachers’ reasoning abilities while overcoming supervision conflicts.

Result: On competition math benchmarks with only ~200 high-quality CoT samples, MoT applied to Qwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1, demonstrating substantial gains and outperforming single-teacher distillation and naive multi-teacher union.

Conclusion: MoT provides a simple, scalable route to efficiently distill long CoT capabilities from diverse teachers into compact students, showing robustness, reducing catastrophic forgetting, improving general reasoning beyond mathematics, and even cultivating better teachers through consensus-filtered reasoning features.

Abstract: Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different “best teachers,” and even for the same student the best teacher can vary across datasets. Therefore, to unify multiple teachers’ reasoning abilities into student with overcoming conflicts among various teachers’ supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 high-quality CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation and the naive multi-teacher union, raises the performance ceiling while mitigating overfitting, and shows robustness to distribution-shifted and peer-level teachers. Moreover, MoT reduces catastrophic forgetting, improves general reasoning beyond mathematics and even cultivates a better teacher, indicating that consensus-filtered reasoning features transfer broadly. These results position MoT as a simple, scalable route to efficiently distilling long CoT capabilities from diverse teachers into compact students.

[244] An Interpretable Deep Learning Model for General Insurance Pricing

Patrick J. Laub, Tu Pho, Bernard Wong

Main category: cs.LG

TL;DR: ANAM is an interpretable deep learning model for insurance pricing that uses dedicated neural networks per covariate and pairwise interactions, maintaining predictive power while ensuring transparency through architectural constraints like sparsity, smoothness, and monotonicity.

DetailsMotivation: To bridge the gap between traditional actuarial methods' interpretability and neural networks' predictive power in insurance pricing, providing fully transparent and interpretable results while retaining strong predictive capabilities.

Method: Assigns dedicated neural networks (or subnetworks) to each individual covariate and pairwise interaction term, implementing architectural constraints for interpretability (sparsity) and practical requirements (smoothness, monotonicity) in insurance applications.

Result: Outperforms traditional actuarial and state-of-the-art machine learning methods in most cases on both synthetic and real insurance datasets while offering complete transparency in internal logic.

Conclusion: The Actuarial Neural Additive Model successfully combines strong predictive power with complete interpretability, making it suitable for insurance pricing applications where both accuracy and transparency are essential.

Abstract: This paper introduces the Actuarial Neural Additive Model, an inherently interpretable deep learning model for general insurance pricing that offers fully transparent and interpretable results while retaining the strong predictive power of neural networks. This model assigns a dedicated neural network (or subnetwork) to each individual covariate and pairwise interaction term to independently learn its impact on the modeled output while implementing various architectural constraints to allow for essential interpretability (e.g. sparsity) and practical requirements (e.g. smoothness, monotonicity) in insurance applications. The development of our model is grounded in a solid foundation, where we establish a concrete definition of interpretability within the insurance context, complemented by a rigorous mathematical framework. Comparisons in terms of prediction accuracy are made with traditional actuarial and state-of-the-art machine learning methods using both synthetic and real insurance datasets. The results show that the proposed model outperforms other methods in most cases while offering complete transparency in its internal logic, underscoring the strong interpretability and predictive capability.

[245] SHAining on Process Mining: Explaining Event Log Characteristics Impact on Algorithms

Andrea Maldonado, Christian M. M. Frey, Sai Anirudh Aryasomayajula, Ludwig Zellner, Stephan A. Fahrenkrog-Petersen, Thomas Seidl

Main category: cs.LG

TL;DR: SHAining is a novel approach that quantifies how different event log characteristics individually impact process mining algorithm performance metrics, addressing the gap in systematic analysis of characteristic effects.

DetailsMotivation: Existing process mining evaluations lack systematic analysis of how individual event log characteristics affect algorithm performance, and prior work often overlooks the associational effects of co-occurring characteristics in real-world logs.

Method: Developed SHAining to quantify marginal contributions of event log characteristics to algorithm metrics. Analyzed over 22,000 event logs covering diverse characteristics, using process discovery as downstream task to evaluate metrics like fitness, precision, and complexity.

Result: The approach uncovers which event log characteristics most significantly affect algorithms across various metrics and provides insights into how characteristic values correlate with their impact on algorithm robustness.

Conclusion: SHAining provides the first systematic quantification of event log characteristic contributions to process mining algorithm performance, offering valuable insights into algorithm robustness and characteristic impact assessment.

Abstract: Process mining aims to extract and analyze insights from event logs, yet algorithm metric results vary widely depending on structural event log characteristics. Existing work often evaluates algorithms on a fixed set of real-world event logs but lacks a systematic analysis of how event log characteristics impact algorithms individually. Moreover, since event logs are generated from processes, where characteristics co-occur, we focus on associational rather than causal effects to assess how strong the overlapping individual characteristic affects evaluation metrics without assuming isolated causal effects, a factor often neglected by prior work. We introduce SHAining, the first approach to quantify the marginal contribution of varying event log characteristics to process mining algorithms’ metrics. Using process discovery as a downstream task, we analyze over 22,000 event logs covering a wide span of characteristics to uncover which affect algorithms across metrics (e.g., fitness, precision, complexity) the most. Furthermore, we offer novel insights about how the value of event log characteristics correlates with their contributed impact, assessing the algorithm’s robustness.

[246] Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis

Matias D. Cattaneo, Boris Shigida

Main category: cs.LG

TL;DR: Analysis of Polyak heavy-ball momentum gradient descent showing it’s equivalent to plain gradient descent with a modified loss on attractive manifolds, with arbitrary precision approximation bounds and combinatorial polynomial structures.

DetailsMotivation: To understand the mathematical structure and behavior of gradient descent with Polyak heavy-ball momentum, particularly its relationship to plain gradient descent and the combinatorial properties hidden within the algorithm.

Method: Proved that on exponentially attractive invariant manifolds, HB is exactly plain GD with modified loss for small step sizes. Conducted fine-grained combinatorial analysis of memoryless approximations, derived continuous modified equations with rigorous bounds, and analyzed both full-batch and mini-batch variants.

Result: Established global approximation bounds O(h^R) for any finite order R≥2, discovered rich family of polynomials including Eulerian and Narayana polynomials, and derived principal flow approximating HB dynamics.

Conclusion: The analysis provides new insights into heavy-ball momentum features and establishes a framework for similar analysis of other optimization algorithms, with rigorous approximation guarantees and combinatorial structures revealed.

Abstract: We analyze gradient descent with Polyak heavy-ball momentum (HB) whose fixed momentum parameter $\beta \in (0, 1)$ provides exponential decay of memory. Building on Kovachki and Stuart (2021), we prove that on an exponentially attractive invariant manifold the algorithm is exactly plain gradient descent with a modified loss, provided that the step size $h$ is small enough. Although the modified loss does not admit a closed-form expression, we describe it with arbitrary precision and prove global (finite “time” horizon) approximation bounds $O(h^{R})$ for any finite order $R \geq 2$. We then conduct a fine-grained analysis of the combinatorics underlying the memoryless approximations of HB, in particular, finding a rich family of polynomials in $\beta$ hidden inside which contains Eulerian and Narayana polynomials. We derive continuous modified equations of arbitrary approximation order (with rigorous bounds) and the principal flow that approximates the HB dynamics, generalizing Rosca et al. (2023). Approximation theorems cover both full-batch and mini-batch HB. Our theoretical results shed new light on the main features of gradient descent with heavy-ball momentum, and outline a road-map for similar analysis of other optimization algorithms.

[247] Heart Disease Prediction: A Comparative Study of Optimisers Performance in Deep Neural Networks

Chisom Chibuike, Adeyinka Ogunsanya

Main category: cs.LG

TL;DR: Comparison of 10 optimizers for heart disease prediction using MLP, finding RMSProp most effective with balanced performance across metrics like precision (0.765), recall (0.827), and AUC (0.841).

DetailsMotivation: Lack of systematic approach in optimizer selection for deep learning models, need to understand trade-offs between convergence speed and stability.

Method: Evaluated 10 optimizers training a Multi-layer Perceptron on heart disease dataset with consistent training paradigm, measuring convergence speed, stability, AUC, precision, and recall.

Result: Adagrad and Adadelta were more stable but slower to converge. RMSProp performed best overall with fast training time and strong metrics (precision: 0.765, recall: 0.827, AUC: 0.841), though not the most stable.

Conclusion: RMSProp is recommended for heart disease prediction tasks due to balanced performance. Systematic optimizer evaluation should be adopted to improve scientific rigor and model performance in deep learning.

Abstract: Optimization has been an important factor and topic of interest in training deep learning models, yet less attention has been given to how we select the optimizers we use to train these models. Hence, there is a need to dive deeper into how we select the optimizers we use for training and the metrics that determine this selection. In this work, we compare the performance of 10 different optimizers in training a simple Multi-layer Perceptron model using a heart disease dataset from Kaggle. We set up a consistent training paradigm and evaluate the optimizers based on metrics such as convergence speed and stability. We also include some other Machine Learning Evaluation metrics such as AUC, Precision, and Recall, which are central metrics to classification problems. Our results show that there are trade-offs between convergence speed and stability, as optimizers like Adagrad and Adadelta, which are more stable, took longer time to converge. Across all our metrics, we chose RMSProp to be the most effective optimizer for this heart disease prediction task because it offered a balanced performance across key metrics. It achieved a precision of 0.765, a recall of 0.827, and an AUC of 0.841, along with faster training time. However, it was not the most stable. We recommend that, in less compute-constrained environments, this method of choosing optimizers through a thorough evaluation should be adopted to increase the scientific nature and performance in training deep learning models.

[248] Variational Rank Reduction Autoencoders for Generative

Alicia Tierz, Jad Mounayer, Beatriz Moya, Francisco Chinesta

Main category: cs.LG

TL;DR: Hybrid framework combining VRRAE and DeepONet for efficient generative thermal design with continuous latent representations and accurate gradient prediction.

DetailsMotivation: Address high computational costs of thermal simulations and limitations of conventional generative models like AEs/VAEs that produce unstructured latent spaces with discontinuities.

Method: VRRAE uses truncated SVD for continuous, interpretable latent representations, then DeepONet uses this encoding in branch network with spatial coordinates in trunk network to predict temperature gradients.

Result: Enhanced geometric reconstruction quality, improved gradient prediction accuracy, and substantial inference efficiency gains compared to traditional numerical solvers.

Conclusion: Structured latent representations are crucial for operator learning, and combining generative models with operator networks shows great potential for thermal design and engineering applications.

Abstract: Generative thermal design for complex geometries is fundamental in many areas of engineering, yet it faces two main challenges: the high computational cost of high-fidelity simulations and the limitations of conventional generative models. Approaches such as autoencoders (AEs) and variational autoencoders (VAEs) often produce unstructured latent spaces with discontinuities, which restricts their capacity to explore designs and generate physically consistent solutions. To address these limitations, we propose a hybrid framework that combines Variational Rank-Reduction Autoencoders (VRRAEs) with Deep Operator Networks (DeepONets). The VRRAE introduces a truncated SVD within the latent space, leading to continuous, interpretable, and well-structured representations that mitigate posterior collapse and improve geometric reconstruction. The DeepONet then exploits this compact latent encoding in its branch network, together with spatial coordinates in the trunk network, to predict temperature gradients efficiently and accurately. This hybrid approach not only enhances the quality of generated geometries and the accuracy of gradient prediction, but also provides a substantial advantage in inference efficiency compared to traditional numerical solvers. Overall, the study underscores the importance of structured latent representations for operator learning and highlights the potential of combining generative models and operator networks in thermal design and broader engineering applications.

[249] Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures

Wen-Bo Xie, Xun Fu, Bin Chen, Yan-Li Lee, Tao Deng, Tian Zou, Xin Wang, Zhen Liu, Jaideep Srivastavad

Main category: cs.LG

TL;DR: A graph-based active clustering algorithm that uses two sparse graphs to achieve efficient and scalable pairwise constraint-based clustering with minimal user annotations.

DetailsMotivation: To address the need for efficient and scalable pairwise constraint-based active clustering for large-scale data processing in applications like data mining, knowledge annotation, and AI model pre-training, with goals to reduce computational costs, enhance constraint impact, and minimize memory usage.

Method: Proposes a graph-based active clustering algorithm that utilizes two sparse graphs: one for representing data relationships (data skeleton) and another for updating this skeleton. The graphs work together to refine connected subgraphs and create nested clusters.

Result: Empirical analysis shows the algorithm achieves more accurate clustering with dramatically fewer user-provided constraints, outperforms counterparts in computational performance and scalability, and maintains robustness across various distance metrics.

Conclusion: The proposed graph-based approach successfully addresses efficiency and scalability challenges in active clustering, delivering superior performance with reduced computational costs and annotation requirements while maintaining accuracy and robustness.

Abstract: In this work, we focus on the efficiency and scalability of pairwise constraint-based active clustering, crucial for processing large-scale data in applications such as data mining, knowledge annotation, and AI model pre-training. Our goals are threefold: (1) to reduce computational costs for iterative clustering updates; (2) to enhance the impact of user-provided constraints to minimize annotation requirements for precise clustering; and (3) to cut down memory usage in practical deployments. To achieve these aims, we propose a graph-based active clustering algorithm that utilizes two sparse graphs: one for representing relationships between data (our proposed data skeleton) and another for updating this data skeleton. These two graphs work in concert, enabling the refinement of connected subgraphs within the data skeleton to create nested clusters. Our empirical analysis confirms that the proposed algorithm consistently facilitates more accurate clustering with dramatically less input of user-provided constraints, and outperforms its counterparts in terms of computational performance and scalability, while maintaining robustness across various distance metrics.

[250] MAESTRO: Multi-modal Adaptive Ensemble for Spectro-Temporal Robust Optimization

Hong Liu

Main category: cs.LG

TL;DR: MAESTRO is a multi-modal ensemble model for robust influenza forecasting that fuses surveillance, web search, and weather data using spectro-temporal architecture and achieves state-of-the-art performance with R-square 0.956.

DetailsMotivation: Timely and robust influenza incidence forecasting is critical for public health decision-making, requiring adaptive fusion of multi-modal data sources for accurate predictions.

Method: Decomposes time series into seasonal/trend components, processes through hybrid pipeline with Transformer encoders, Mamba state-space model, multi-scale convolutions, frequency analysis, and cross-channel attention for multi-modal fusion, ending with sequence-to-sequence forecasting.

Result: Achieved state-of-the-art R-square of 0.956 on 11+ years of Hong Kong influenza data, with extensive ablations confirming significant contributions from multi-modal fusion and spectro-temporal components.

Conclusion: Presents a powerful unified framework demonstrating critical synergy of advanced spectro-temporal modeling and multi-modal data fusion for robust epidemiological forecasting, with publicly available modular pipeline.

Abstract: Timely and robust influenza incidence forecasting is critical for public health decision-making. To address this, we present MAESTRO, a Multi-modal Adaptive Ensemble for Spectro-Temporal Robust Optimization. MAESTRO achieves robustness by adaptively fusing multi-modal inputs-including surveillance, web search trends, and meteorological data-and leveraging a comprehensive spectro-temporal architecture. The model first decomposes time series into seasonal and trend components. These are then processed through a hybrid feature enhancement pipeline combining Transformer-based encoders, a Mamba state-space model for long-range dependencies, multi-scale temporal convolutions, and a frequency-domain analysis module. A cross-channel attention mechanism further integrates information across the different data modalities. Finally, a temporal projection head performs sequence-to-sequence forecasting, with an optional estimator to quantify prediction uncertainty. Evaluated on over 11 years of Hong Kong influenza data (excluding the COVID-19 period), MAESTRO shows strong competitive performance, demonstrating a superior model fit and relative accuracy, achieving a state-of-the-art R-square of 0.956. Extensive ablations confirm the significant contributions of both multi-modal fusion and the spectro-temporal components. Our modular and reproducible pipeline is made publicly available to facilitate deployment and extension to other regions and pathogens.Our publicly available pipeline presents a powerful, unified framework, demonstrating the critical synergy of advanced spectro-temporal modeling and multi-modal data fusion for robust epidemiological forecasting.

[251] Interpretability as Alignment: Making Internal Understanding a Design Principle

Aadit Sengupta, Pratinav Seth, Vinay Kumar Sankarapu

Main category: cs.LG

TL;DR: Interpretability should be a core design principle for AI alignment, not just a diagnostic tool, to ensure transparent and trustworthy systems that reliably align with human values.

DetailsMotivation: Large neural models deployed in high-stakes settings raise concerns about reliable alignment with human values, requiring internal transparency through interpretability.

Method: Advocates for mechanistic interpretability approaches (circuit tracing, activation patching) over post-hoc methods (LIME, SHAP) to provide causal insight into internal failures and misaligned reasoning.

Result: Identifies that mechanistic interpretability can reveal deceptive or misaligned reasoning that behavioral methods like RLHF, red teaming, or Constitutional AI may overlook.

Conclusion: Progress on safe and trustworthy AI depends on making interpretability a first-class objective in AI R&D to ensure systems are auditable, transparent, and aligned with human intent.

Abstract: Large neural models are increasingly deployed in high-stakes settings, raising concerns about whether their behavior reliably aligns with human values. Interpretability provides a route to internal transparency by revealing the computations that drive outputs. We argue that interpretability especially mechanistic approaches should be treated as a design principle for alignment, not an auxiliary diagnostic tool. Post-hoc methods such as LIME or SHAP offer intuitive but correlational explanations, while mechanistic techniques like circuit tracing or activation patching yield causal insight into internal failures, including deceptive or misaligned reasoning that behavioral methods like RLHF, red teaming, or Constitutional AI may overlook. Despite these advantages, interpretability faces challenges of scalability, epistemic uncertainty, and mismatches between learned representations and human concepts. Our position is that progress on safe and trustworthy AI will depend on making interpretability a first-class objective of AI research and development, ensuring that systems are not only effective but also auditable, transparent, and aligned with human intent.

[252] Classification of 24-hour movement behaviors from wrist-worn accelerometer data: from handcrafted features to deep learning techniques

Alireza Sameh, Mehrdad Rostami, Mourad Oussalah, Vahid Farrahi

Main category: cs.LG

TL;DR: DL algorithms using raw acceleration signals achieved 80-85% accuracy for 24-hour movement behavior classification, slightly outperforming classical ML methods (70-81% accuracy) using handcrafted features.

DetailsMotivation: To compare the performance of deep learning vs classical machine learning algorithms for classifying 24-hour movement behaviors (sleep, sedentary, LPA, MVPA) from wrist-worn accelerometer data.

Method: Used data from 151 adults wearing wrist accelerometers. Compared 4 DL algorithms (LSTM, BiLSTM, GRU, 1D-CNN) using raw signals and handcrafted features against 6 classical ML algorithms (RF, SVM, XGBoost, LR, ANN, DT) using 104 handcrafted features on 10-second windows.

Result: DL methods with raw signals achieved 80-85% accuracy, while both DL and classical ML with handcrafted features achieved 70-81% accuracy. Higher confusion between MVPA and LPA categories compared to sleep/sedentary.

Conclusion: DL methods with raw acceleration signals only slightly outperform classical ML with handcrafted features for movement behavior classification, suggesting limited advantage of raw signal processing over feature engineering.

Abstract: Purpose: We compared the performance of deep learning (DL) and classical machine learning (ML) algorithms for the classification of 24-hour movement behavior into sleep, sedentary, light intensity physical activity (LPA), and moderate-to-vigorous intensity physical activity (MVPA). Methods: Open-access data from 151 adults wearing a wrist-worn accelerometer (Axivity-AX3) was used. Participants were randomly divided into training, validation, and test sets (121, 15, and 15 participants each). Raw acceleration signals were segmented into non-overlapping 10-second windows, and then a total of 104 handcrafted features were extracted. Four DL algorithms-Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), Gated Recurrent Units (GRU), and One-Dimensional Convolutional Neural Network (1D-CNN)-were trained using raw acceleration signals and with handcrafted features extracted from these signals to predict 24-hour movement behavior categories. The handcrafted features were also used to train classical ML algorithms, namely Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Artificial Neural Network (ANN), and Decision Tree (DT) for classifying 24-hour movement behavior intensities. Results: LSTM, BiLSTM, and GRU showed an overall accuracy of approximately 85% when trained with raw acceleration signals, and 1D-CNN an overall accuracy of approximately 80%. When trained on handcrafted features, the overall accuracy for both DL and classical ML algorithms ranged from 70% to 81%. Overall, there was a higher confusion in classification of MVPA and LPA, compared to sleep and sedentary categories. Conclusion: DL methods with raw acceleration signals had only slightly better performance in predicting 24-hour movement behavior intensities, compared to when DL and classical ML were trained with handcrafted features.

[253] Towards Interpretable Deep Neural Networks for Tabular Data

Khawla Elhadri, Jörg Schlötterer, Christin Seifert

Main category: cs.LG

TL;DR: XNNTab is an interpretable neural architecture for tabular data that uses sparse autoencoders to learn monosemantic features with human-interpretable semantics, achieving competitive performance while maintaining full interpretability.

DetailsMotivation: Tabular data is widely used in critical domains like finance and healthcare, but current DNNs for tabular data are blackboxes with poor interpretability, limiting their trustworthiness and adoption in sensitive applications.

Method: Uses a sparse autoencoder (SAE) to learn a dictionary of monosemantic features in the latent prediction space, then applies automated methods to assign human-interpretable semantics to these features, enabling predictions as linear combinations of meaningful components.

Result: Empirical evaluations show XNNTab achieves performance on par with or exceeding state-of-the-art black-box neural models and classical machine learning approaches while being fully interpretable.

Conclusion: XNNTab successfully bridges the gap between performance and interpretability in tabular data analysis, providing a transparent neural architecture that maintains competitive predictive power while offering complete interpretability through semantically meaningful feature representations.

Abstract: Tabular data is the foundation of many applications in fields such as finance and healthcare. Although DNNs tailored for tabular data achieve competitive predictive performance, they are blackboxes with little interpretability. We introduce XNNTab, a neural architecture that uses a sparse autoencoder (SAE) to learn a dictionary of monosemantic features within the latent space used for prediction. Using an automated method, we assign human-interpretable semantics to these features. This allows us to represent predictions as linear combinations of semantically meaningful components. Empirical evaluations demonstrate that XNNTab attains performance on par with or exceeding that of state-of-the-art, black-box neural models and classical machine learning approaches while being fully interpretable.

[254] An upper bound of the silhouette validation metric for clustering

Hugo Sträng, Tai Dinh

Main category: cs.LG

TL;DR: This paper presents a method to compute data-dependent upper bounds for silhouette coefficients, providing more realistic benchmarks for clustering quality evaluation than the theoretical maximum of 1.

DetailsMotivation: The standard upper limit of 1 for silhouette width is often unattainable in practice, making it difficult to assess how close a clustering result is to the dataset-specific optimum.

Method: The authors derive sharp upper bounds for individual data points’ silhouette width and aggregate these to create a canonical data-dependent upper bound on the average silhouette width (ASW).

Result: The bounds are provably near-tight across synthetic and real datasets, enabling better evaluation of clustering quality and early stopping in optimization loops.

Conclusion: The proposed data-dependent bounds significantly enrich cluster quality evaluation by providing realistic, dataset-specific benchmarks for silhouette-based clustering assessment.

Abstract: The silhouette coefficient summarizes, per observation, cohesion versus separation in [-1, 1]; the average silhouette width (ASW) is a common internal measure of clustering quality where higher values indicate more coveted results. However, the dataset-specific maximum of ASW is typically unknown, and the standard upper limit 1 is often unattainable. In this work, we derive for each data point in a given dataset a sharp upper bound on its silhouette width. By aggregating these individual bounds, we present a canonical data-dependent upper bound on ASW that often assumes values well below 1. The presented bounds can indicate whether individual data points can ever be well placed, enable early stopping of silhouette-based optimization loops, and help answer a key question: How close is my clustering result to the best possible outcome on this specific data? Across synthetic and real datasets, the bounds are provably near-tight in many cases and offer significant enrichment of cluster quality evaluation.

[255] Replicable Reinforcement Learning with Linear Function Approximation

Eric Eaton, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell

Main category: cs.LG

TL;DR: First provably efficient replicable reinforcement learning algorithms for linear function approximation settings, addressing the challenge of experimental replication in ML.

DetailsMotivation: Replicability is a fundamental challenge in machine learning, especially in reinforcement learning where algorithms are known to be unstable. While replicable algorithms exist for tabular RL, extending these guarantees to practical function approximation settings remained an open problem.

Method: Developed replicable methods for linear function approximation in RL, including two efficient algorithms for replicable random design regression and uncentered covariance estimation. Leveraged these tools to create provably efficient replicable RL algorithms for linear Markov decision processes in both generative model and episodic settings.

Result: Successfully developed the first provably efficient replicable RL algorithms for linear function approximation. Experimental evaluation shows these algorithms can inspire more consistent neural policies.

Conclusion: This work makes significant progress in addressing replicability challenges in RL by providing the first efficient replicable algorithms for linear function approximation settings, bridging the gap between theoretical replicability guarantees and practical RL applications.

Abstract: Replication of experimental results has been a challenge faced by many scientific disciplines, including the field of machine learning. Recent work on the theory of machine learning has formalized replicability as the demand that an algorithm produce identical outcomes when executed twice on different samples from the same distribution. Provably replicable algorithms are especially interesting for reinforcement learning (RL), where algorithms are known to be unstable in practice. While replicable algorithms exist for tabular RL settings, extending these guarantees to more practical function approximation settings has remained an open problem. In this work, we make progress by developing replicable methods for linear function approximation in RL. We first introduce two efficient algorithms for replicable random design regression and uncentered covariance estimation, each of independent interest. We then leverage these tools to provide the first provably efficient replicable RL algorithms for linear Markov decision processes in both the generative model and episodic settings. Finally, we evaluate our algorithms experimentally and show how they can inspire more consistent neural policies.

[256] Signal Fidelity Index-Aware Calibration for Dementia Predictions Across Heterogeneous Real-World Data

Jingya Cheng, Jiazi Tian, Federica Spoto, Alaleh Azhir, Daniel Mork, Hossein Estiri

Main category: cs.LG

TL;DR: Developed Signal Fidelity Index (SFI) to quantify diagnostic data quality in dementia EHRs and SFI-aware calibration to improve ML model performance across healthcare systems without outcome labels.

DetailsMotivation: Machine learning models trained on EHRs degrade across healthcare systems due to distributional shift, particularly diagnostic signal decay - variability in diagnostic quality and consistency across institutions.

Method: Built simulation framework with 2,500 synthetic datasets. SFI derived from six components: diagnostic specificity, temporal consistency, entropy, contextual concordance, medication alignment, and trajectory stability. Applied SFI-aware calibration with multiplicative adjustment optimized across simulation batches.

Result: At optimal parameter (α=2.0), SFI-aware calibration significantly improved all metrics (p<0.001): 10.3% Balanced Accuracy, 32.5% Recall, 31.9% Precision, 26.1% F1-score. Performance approached reference standards with substantial improvements.

Conclusion: Diagnostic signal decay is a tractable barrier to model generalization. SFI-aware calibration provides practical, label-free strategy to enhance prediction across healthcare contexts, especially for administrative datasets lacking outcome labels.

Abstract: \textbf{Background:} Machine learning models trained on electronic health records (EHRs) often degrade across healthcare systems due to distributional shift. A fundamental but underexplored factor is diagnostic signal decay: variability in diagnostic quality and consistency across institutions, which affects the reliability of codes used for training and prediction. \textbf{Objective:} To develop a Signal Fidelity Index (SFI) quantifying diagnostic data quality at the patient level in dementia, and to test SFI-aware calibration for improving model performance across heterogeneous datasets without outcome labels. \textbf{Methods:} We built a simulation framework generating 2,500 synthetic datasets, each with 1,000 patients and realistic demographics, encounters, and coding patterns based on dementia risk factors. The SFI was derived from six interpretable components: diagnostic specificity, temporal consistency, entropy, contextual concordance, medication alignment, and trajectory stability. SFI-aware calibration applied a multiplicative adjustment, optimized across 50 simulation batches. \textbf{Results:} At the optimal parameter ($\alpha$ = 2.0), SFI-aware calibration significantly improved all metrics (p $<$ 0.001). Gains ranged from 10.3% for Balanced Accuracy to 32.5% for Recall, with notable increases in Precision (31.9%) and F1-score (26.1%). Performance approached reference standards, with F1-score and Recall within 1% and Balanced Accuracy and Detection Rate improved by 52.3% and 41.1%, respectively. \textbf{Conclusions:} Diagnostic signal decay is a tractable barrier to model generalization. SFI-aware calibration provides a practical, label-free strategy to enhance prediction across healthcare contexts, particularly for large-scale administrative datasets lacking outcome labels.

[257] Perfectly-Private Analog Secure Aggregation in Federated Learning

Delio Jaramillo-Velez, Charul Rajput, Ragnar Freij-Hollanti, Camilla Hollanti, Alexandre Graell i Amat

Main category: cs.LG

TL;DR: A novel secure aggregation method for federated learning using torus instead of finite fields, ensuring perfect privacy without accuracy loss.

DetailsMotivation: Address privacy risks in federated learning by preventing information leakage from local models while maintaining model accuracy, overcoming limitations of finite field approaches.

Method: Proposes secure parameter aggregation using the torus (rather than finite fields) with uniform distribution masking, avoiding fixed-point modular arithmetic issues of finite field methods.

Result: Experimental results show similar performance to non-secure models while maintaining perfect privacy. Outperforms finite field secure aggregation in accuracy and cosine similarity.

Conclusion: Torus-based secure aggregation provides perfect privacy protection without accuracy degradation, making it a safer and more effective choice for federated learning compared to finite field approaches.

Abstract: In federated learning, multiple parties train models locally and share their parameters with a central server, which aggregates them to update a global model. To address the risk of exposing sensitive data through local models, secure aggregation via secure multiparty computation has been proposed to enhance privacy. At the same time, perfect privacy can only be achieved by a uniform distribution of the masked local models to be aggregated. This raises a problem when working with real valued data, as there is no measure on the reals that is invariant under the masking operation, and hence information leakage is bound to occur. Shifting the data to a finite field circumvents this problem, but as a downside runs into an inherent accuracy complexity tradeoff issue due to fixed point modular arithmetic as opposed to floating point numbers that can simultaneously handle numbers of varying magnitudes. In this paper, a novel secure parameter aggregation method is proposed that employs the torus rather than a finite field. This approach guarantees perfect privacy for each party’s data by utilizing the uniform distribution on the torus, while avoiding accuracy losses. Experimental results show that the new protocol performs similarly to the model without secure aggregation while maintaining perfect privacy. Compared to the finite field secure aggregation, the torus-based protocol can in some cases significantly outperform it in terms of model accuracy and cosine similarity, hence making it a safer choice.

[258] Reshaping the Forward-Forward Algorithm with a Similarity-Based Objective

James Gong, Raymond Luo, Emma Wang, Leon Ge, Bruce Li, Felix Marattukalam, Waleed Abdulla

Main category: cs.LG

TL;DR: FAUST algorithm combines Forward-Forward with similarity learning to eliminate multiple forward passes during inference while improving accuracy close to backpropagation levels.

DetailsMotivation: Backpropagation has biological plausibility issues (backward locking, global error propagation), and while Forward-Forward addresses these, it suffers from accuracy deficits and inefficient inference requiring multiple forward passes.

Method: Integrates Forward-Forward algorithm with similarity learning frameworks to create FAUST (Forward-Forward Algorithm Unified with Similarity-based Tuplet loss), eliminating the need for multiple forward passes during inference.

Result: FAUST substantially improves accuracy on MNIST, Fashion-MNIST, and CIFAR-10 datasets. On CIFAR-10, achieves 56.22% accuracy with MLP, approaching backpropagation’s 57.63% benchmark.

Conclusion: FAUST successfully addresses both biological plausibility concerns and inference efficiency issues of Forward-Forward while narrowing the performance gap with backpropagation.

Abstract: Backpropagation is the pivotal algorithm underpinning the success of artificial neural networks, yet it has critical limitations such as biologically implausible backward locking and global error propagation. To circumvent these constraints, the Forward-Forward algorithm was proposed as a more biologically plausible method that replaces the backward pass with an additional forward pass. Despite this advantage, the Forward-Forward algorithm significantly trails backpropagation in accuracy, and its optimal form exhibits low inference efficiency due to multiple forward passes required. In this work, the Forward-Forward algorithm is reshaped through its integration with similarity learning frameworks, eliminating the need for multiple forward passes during inference. This proposed algorithm is named Forward-Forward Algorithm Unified with Similarity-based Tuplet loss (FAUST). Empirical evaluations on MNIST, Fashion-MNIST, and CIFAR-10 datasets indicate that FAUST substantially improves accuracy, narrowing the gap with backpropagation. On CIFAR-10, FAUST achieves 56.22% accuracy with a simple multi-layer perceptron architecture, approaching the backpropagation benchmark of 57.63% accuracy.

[259] A layered architecture for log analysis in complex IT systems

Thorsten Wittkopp

Main category: cs.LG

TL;DR: A three-layered AIOps architecture for automated log analysis that enables DevOps teams to efficiently detect and resolve system failures through autonomous log labeling, anomaly detection, and root cause analysis.

DetailsMotivation: The growing complexity of IT systems challenges DevOps teams in maintaining stability and reliability. Log analysis provides critical insights but requires automation to handle complex behaviors and failures efficiently.

Method: A three-layered architecture: 1) Log Investigation layer with autonomous log labeling and anomaly classification taxonomy, 2) Anomaly Detection layer with flexible method adaptable to different training scenarios, 3) Root Cause Analysis layer that identifies minimal log sets describing failures and their sequences.

Result: Achieved F1-scores between 0.98-1.0 on public and industry datasets for anomaly detection. Root cause analysis consistently detects 90-98% of root cause log lines within top 10 candidates.

Conclusion: The integrated three-layer architecture provides DevOps teams with robust methods to enhance IT system reliability by automating log analysis and failure resolution processes.

Abstract: In the evolving IT landscape, stability and reliability of systems are essential, yet their growing complexity challenges DevOps teams in implementation and maintenance. Log analysis, a core element of AIOps, provides critical insights into complex behaviors and failures. This dissertation introduces a three-layered architecture to support DevOps in failure resolution. The first layer, Log Investigation, performs autonomous log labeling and anomaly classification. We propose a method that labels log data without manual effort, enabling supervised training and precise evaluation of anomaly detection. Additionally, we define a taxonomy that groups anomalies into three categories, ensuring appropriate method selection. The second layer, Anomaly Detection, detects behaviors deviating from the norm. We propose a flexible Anomaly Detection method adaptable to unsupervised, weakly supervised, and supervised training. Evaluations on public and industry datasets show F1-scores between 0.98 and 1.0, ensuring reliable anomaly detection. The third layer, Root Cause Analysis, identifies minimal log sets describing failures, their origin, and event sequences. By balancing training data and identifying key services, our Root Cause Analysis method consistently detects 90-98% of root cause log lines within the top 10 candidates, providing actionable insights for mitigation. Our research addresses how log analysis methods can be designed and optimized to help DevOps resolve failures efficiently. By integrating these three layers, the architecture equips teams with robust methods to enhance IT system reliability.

[260] Machine Learning-Based Prediction of Speech Arrest During Direct Cortical Stimulation Mapping

Nikasadat Emami, Amirhossein Khalilian-Gourtani, Jianghao Qian, Antoine Ratouchniak, Xupeng Chen, Yao Wang, Adeen Flinker

Main category: cs.LG

TL;DR: Machine learning models using ECoG data can predict speech-critical brain regions with high accuracy, matching clinical gold standard ESM but non-invasively.

DetailsMotivation: Electrical Stimulation Mapping (ESM) is invasive and time-consuming for identifying speech-critical brain regions during presurgical evaluation. A non-invasive alternative is needed.

Method: Analyzed ECoG data from 16 participants performing speech tasks. Developed classification models combining neural activity, anatomical regions, and functional connectivity features. Used trial-level RBF-kernel SVM with MLP-based aggregation for electrode classification.

Result: Best model achieved ROC-AUC: 0.87 and PR-AUC: 0.57 on held-out participants. Models combining region and connectivity features outperformed single-feature models.

Conclusion: Combining spatial and network information with non-linear modeling improves functional mapping for presurgical evaluation, offering a promising non-invasive alternative to ESM.

Abstract: Identifying cortical regions critical for speech is essential for safe brain surgery in or near language areas. While Electrical Stimulation Mapping (ESM) remains the clinical gold standard, it is invasive and time-consuming. To address this, we analyzed intracranial electrocorticographic (ECoG) data from 16 participants performing speech tasks and developed machine learning models to directly predict if the brain region underneath each ECoG electrode is critical. Ground truth labels indicating speech arrest were derived independently from Electrical Stimulation Mapping (ESM) and used to train classification models. Our framework integrates neural activity signals, anatomical region labels, and functional connectivity features to capture both local activity and network-level dynamics. We found that models combining region and connectivity features matched the performance of the full feature set, and outperformed models using either type alone. To classify each electrode, trial-level predictions were aggregated using an MLP applied to histogram-encoded scores. Our best-performing model, a trial-level RBF-kernel Support Vector Machine together with MLP-based aggregation, achieved strong accuracy on held-out participants (ROC-AUC: 0.87, PR-AUC: 0.57). These findings highlight the value of combining spatial and network information with non-linear modeling to improve functional mapping in presurgical evaluation.

[261] DEQuify your force field: More efficient simulations using deep equilibrium models

Andreas Burger, Luca Thiede, Alán Aspuru-Guzik, Nandita Vijaykumar

Main category: cs.LG

TL;DR: Recasting equivariant force field models as deep equilibrium models to exploit temporal continuity in molecular dynamics, improving accuracy and speed by 10-20% while reducing memory usage.

DetailsMotivation: Leverage the inherent temporal continuity in molecular dynamics simulations - where successive states are extremely similar - as prior knowledge that hasn't been exploited in machine learning force fields.

Method: Transform a state-of-the-art equivariant base model into a deep equilibrium model to recycle intermediate neural network features from previous time steps.

Result: Achieved 10-20% improvements in both accuracy and speed on MD17, MD22, and OC20 200k datasets compared to non-DEQ base models, with significantly more memory-efficient training.

Conclusion: Exploiting temporal continuity through deep equilibrium modeling is an effective approach for improving molecular dynamics force fields, enabling training of more expressive models on larger systems.

Abstract: Machine learning force fields show great promise in enabling more accurate molecular dynamics simulations compared to manually derived ones. Much of the progress in recent years was driven by exploiting prior knowledge about physical systems, in particular symmetries under rotation, translation, and reflections. In this paper, we argue that there is another important piece of prior information that, thus fa,r hasn’t been explored: Simulating a molecular system is necessarily continuous, and successive states are therefore extremely similar. Our contribution is to show that we can exploit this information by recasting a state-of-the-art equivariant base model as a deep equilibrium model. This allows us to recycle intermediate neural network features from previous time steps, enabling us to improve both accuracy and speed by $10%-20%$ on the MD17, MD22, and OC20 200k datasets, compared to the non-DEQ base model. The training is also much more memory efficient, allowing us to train more expressive models on larger systems.

[262] Securing Private Federated Learning in a Malicious Setting: A Scalable TEE-Based Approach with Client Auditing

Shun Takagi, Satoshi Hasegawa

Main category: cs.LG

TL;DR: A novel server extension using ephemeral TEEs to achieve maliciously secure DP-FTRL in federated learning, providing verifiable proofs with minimal client overhead.

DetailsMotivation: Existing DP-FTRL approaches assume semi-honest servers and cannot handle client dropouts or corruption in practical settings. TEEs alone introduce forking attacks and availability issues.

Method: Introduces a trusted computing base (TCB) implemented as an ephemeral TEE module on server side to produce verifiable proofs of server actions. Selected clients audit these proofs with small additional costs.

Result: Formal proofs demonstrate privacy guarantees in malicious settings. Experimental results show small constant overhead to clients in realistic scenarios.

Conclusion: The proposed framework reduces TCB size while maintaining system scalability and liveness, providing maliciously secure DP-FTRL with practical efficiency.

Abstract: In cross-device private federated learning, differentially private follow-the-regularized-leader (DP-FTRL) has emerged as a promising privacy-preserving method. However, existing approaches assume a semi-honest server and have not addressed the challenge of securely removing this assumption. This is due to its statefulness, which becomes particularly problematic in practical settings where clients can drop out or be corrupted. While trusted execution environments (TEEs) might seem like an obvious solution, a straightforward implementation can introduce forking attacks or availability issues due to state management. To address this problem, our paper introduces a novel server extension that acts as a trusted computing base (TCB) to realize maliciously secure DP-FTRL. The TCB is implemented with an ephemeral TEE module on the server side to produce verifiable proofs of server actions. Some clients, upon being selected, participate in auditing these proofs with small additional communication and computational demands. This extension solution reduces the size of the TCB while maintaining the system’s scalability and liveness. We provide formal proofs based on interactive differential privacy, demonstrating privacy guarantee in malicious settings. Finally, we experimentally show that our framework adds small constant overhead to clients in several realistic settings.

[263] Compressing CNN models for resource-constrained systems by channel and layer pruning

Ahmed Sadaqa, Di Liu

Main category: cs.LG

TL;DR: Hybrid pruning framework combining channel and layer pruning to reduce CNN complexity for edge deployment, inspired by EfficientNet scaling principles in reverse.

DetailsMotivation: Large CNN models are difficult to deploy on edge devices due to their complexity and size, requiring effective model compression techniques.

Method: Proposes a hybrid pruning approach that combines both channel and layer pruning, inspired by EfficientNet’s scaling principles but applied in reverse to scale down networks.

Result: Significant reduction in model complexity with minimal accuracy loss, and reduced latency when deployed on NVIDIA JETSON TX2 embedded AI device.

Conclusion: The hybrid pruning framework effectively compresses CNN models for edge deployment while maintaining competitive performance.

Abstract: Convolutional Neural Networks (CNNs) have achieved significant breakthroughs in various fields. However, these advancements have led to a substantial increase in the complexity and size of these networks. This poses a challenge when deploying large and complex networks on edge devices. Consequently, model compression has emerged as a research field aimed at reducing the size and complexity of CNNs. One prominent technique in model compression is model pruning. This paper will present a new technique of pruning that combines both channel and layer pruning in what is called a “hybrid pruning framework”. Inspired by EfficientNet, a renowned CNN architecture known for scaling up networks from both channel and layer perspectives, this hybrid approach applies the same principles but in reverse, where it scales down the network through pruning. Experiments on the hybrid approach demonstrated a notable decrease in the overall complexity of the model, with only a minimal reduction in accuracy compared to the baseline model. This complexity reduction translates into reduced latency when deploying the pruned models on an NVIDIA JETSON TX2 embedded AI device.

[264] Data-driven generative simulation of SDEs using diffusion models

Xuefeng Gao, Jiale Zha, Xun Yu Zhou

Main category: cs.LG

TL;DR: A diffusion model-based approach for generating sample paths of unknown SDEs without requiring explicit drift/diffusion coefficients, outperforming traditional methods and showing promise in financial applications.

DetailsMotivation: Traditional Monte Carlo methods for simulating stochastic differential equations require explicit specifications of drift and diffusion coefficients, which may not be available. There's a need for model-free, data-driven approaches that can generate synthetic paths from finite sample data.

Method: Uses conditional diffusion models to generate new synthetic paths of SDEs given only a finite set of sample paths, without requiring explicit knowledge of the underlying coefficients.

Result: The method demonstrates effectiveness in simulation experiments, outperforming benchmark methods including neural SDEs. In empirical studies, synthetically generated paths enhance reinforcement learning performance for continuous-time mean-variance portfolio selection.

Conclusion: Diffusion models show promising applications in financial analysis and decision-making, providing a powerful data-driven alternative to traditional SDE simulation methods.

Abstract: This paper introduces a new approach to generating sample paths of unknown stochastic differential equations (SDEs) using diffusion models, a class of generative AI models commonly employed in image and video applications. Unlike the traditional Monte Carlo methods for simulating SDEs, which require explicit specifications of the drift and diffusion coefficients, our method takes a model-free, data-driven approach. Given a finite set of sample paths from an SDE, we utilize conditional diffusion models to generate new, synthetic paths of the same SDE. To demonstrate the effectiveness of our approach, we conduct a simulation experiment to compare our method with alternative benchmark ones including neural SDEs. Furthermore, in an empirical study we leverage these synthetically generated sample paths to enhance the performance of reinforcement learning algorithms for continuous-time mean-variance portfolio selection, hinting promising applications of diffusion models in financial analysis and decision-making.

[265] Using AI to Optimize Patient Transfer and Resource Utilization During Mass-Casualty Incidents: A Simulation Platform

Zhaoxun “Lorenz” Liu, Wagner H. Souza, Jay Han, Amin Madani

Main category: cs.LG

TL;DR: AI agent using deep reinforcement learning outperforms trauma surgeons in mass casualty incident patient allocation, enabling non-experts to achieve expert-level performance when assisted.

DetailsMotivation: Mass casualty incidents overwhelm healthcare systems and require rapid, accurate patient-hospital allocation decisions under extreme pressure, necessitating effective decision-support tools.

Method: Developed and validated a deep reinforcement learning-based AI agent that optimizes patient transfer decisions by balancing patient acuity, care requirements, hospital capacities, and transport logistics. Integrated into MasTER web dashboard and tested with 30 participants (experts and non-experts) across different MCI scenarios.

Result: Increasing AI involvement significantly improves decision quality and consistency. AI outperforms trauma surgeons (p < 0.001) and enables non-experts to achieve expert-level performance when assisted, contrasting with their inferior unassisted performance (p < 0.001).

Conclusion: AI-driven decision support has strong potential to enhance both MCI preparedness training and real-world emergency response management by improving allocation decisions and enabling non-experts to perform at expert levels.

Abstract: Mass casualty incidents (MCIs) overwhelm healthcare systems and demand rapid, accurate patient-hospital allocation decisions under extreme pressure. Here, we developed and validated a deep reinforcement learning-based decision-support AI agent to optimize patient transfer decisions during simulated MCIs by balancing patient acuity levels, specialized care requirements, hospital capacities, and transport logistics. To integrate this AI agent, we developed MasTER, a web-accessible command dashboard for MCI management simulations. Through a controlled user study with 30 participants (6 trauma experts and 24 non-experts), we evaluated three interaction approaches with the AI agent (human-only, human-AI collaboration, and AI-only) across 20- and 60-patient MCI scenarios in the Greater Toronto Area. Results demonstrate that increasing AI involvement significantly improves decision quality and consistency. The AI agent outperforms trauma surgeons (p < 0.001) and enables non-experts to achieve expert-level performance when assisted, contrasting sharply with their significantly inferior unassisted performance (p < 0.001). These findings establish the potential for our AI-driven decision support to enhance both MCI preparedness training and real-world emergency response management.

[266] ChemBOMAS: Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System

Dong Han, Zhehong Ai, Pengxiang Cai, Shuzhou Sun, Shanya Lu, Jianpeng Chen, Ben Gao, Lingli Ge, Weida Wang, Xiangxin Zhou, Xihui Liu, Mao Su, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Tao XU, Yuqiang Li, Shufei Zhang

Main category: cs.LG

TL;DR: ChemBOMAS is an LLM-enhanced multi-agent system that accelerates Bayesian optimization in chemistry through knowledge-driven coarse-grained optimization and data-driven fine-grained optimization, achieving 96% success in real-world pharmaceutical experiments vs 15% by human experts.

DetailsMotivation: Bayesian optimization in chemistry is hindered by sparse experimental data and complex reaction mechanisms, requiring a more efficient approach to accelerate chemical discovery.

Method: Two-stage framework: 1) Knowledge-driven coarse-grained optimization where LLMs decompose search space using chemical knowledge, 2) Data-driven fine-grained optimization where LLMs generate pseudo-data points to enhance BO within candidate regions.

Result: Achieved 96% optimal objective value in wet-lab pharmaceutical experiments (vs 15% by domain experts) and demonstrated superior performance compared to various BO algorithms in benchmark evaluations.

Conclusion: ChemBOMAS is a powerful tool that significantly enhances optimization effectiveness and efficiency, accelerating chemical discovery through LLM-enhanced Bayesian optimization.

Abstract: The efficiency of Bayesian optimization (BO) in chemistry is often hindered by sparse experimental data and complex reaction mechanisms. To overcome these limitations, we introduce ChemBOMAS, a new framework named LLM-Enhanced Multi-Agent System for accelerating BO in chemistry. ChemBOMAS’s optimization process is enhanced by LLMs and synergistically employs two strategies: knowledge-driven coarse-grained optimization and data-driven fine-grained optimization. First, in the knowledge-driven coarse-grained optimization stage, LLMs intelligently decompose the vast search space by reasoning over existing chemical knowledge to identify promising candidate regions. Subsequently, in the data-driven fine-grained optimization stage, LLMs enhance the BO process within these candidate regions by generating pseudo-data points, thereby improving data utilization efficiency and accelerating convergence. Benchmark evaluations** further confirm that ChemBOMAS significantly enhances optimization effectiveness and efficiency compared to various BO algorithms. Importantly, the practical utility of ChemBOMAS was validated through wet-lab experiments conducted under pharmaceutical industry protocols, targeting conditional optimization for a previously unreported and challenging chemical reaction. In the wet experiment, ChemBOMAS achieved an optimal objective value of 96%. This was substantially higher than the 15% achieved by domain experts. This real-world success, together with strong performance on benchmark evaluations, highlights ChemBOMAS as a powerful tool to accelerate chemical discovery.

[267] PracMHBench: Re-evaluating Model-Heterogeneous Federated Learning Based on Practical Edge Device Constraints

Yuanchun Guo, Bingyan Liu, Yulong Sha, Zhensheng Xian

Main category: cs.LG

TL;DR: First systematic platform (PracMHBench) to evaluate model-heterogeneous federated learning under practical edge device constraints, testing various algorithms across multiple data tasks and metrics.

DetailsMotivation: Existing model-heterogeneous FL algorithms lack evaluation under practical edge device constraints and quantitative analysis across diverse data scenarios and metrics.

Method: Constructed PracMHBench platform to classify and test diverse model heterogeneity algorithms on multiple data tasks and metrics under different edge constraints.

Result: Extensive experiments conducted to observe algorithm applicability and corresponding heterogeneity patterns under various edge constraints.

Conclusion: Provides the first comprehensive evaluation framework for model-heterogeneous FL in practical edge computing environments, enabling better understanding of algorithm performance under real-world constraints.

Abstract: Federating heterogeneous models on edge devices with diverse resource constraints has been a notable trend in recent years. Compared to traditional federated learning (FL) that assumes an identical model architecture to cooperate, model-heterogeneous FL is more practical and flexible since the model can be customized to satisfy the deployment requirement. Unfortunately, no prior work ever dives into the existing model-heterogeneous FL algorithms under the practical edge device constraints and provides quantitative analysis on various data scenarios and metrics, which motivates us to rethink and re-evaluate this paradigm. In our work, we construct the first system platform \textbf{PracMHBench} to evaluate model-heterogeneous FL on practical constraints of edge devices, where diverse model heterogeneity algorithms are classified and tested on multiple data tasks and metrics. Based on the platform, we perform extensive experiments on these algorithms under the different edge constraints to observe their applicability and the corresponding heterogeneity pattern.

[268] Fourier Learning Machines: Nonharmonic Fourier-Based Neural Networks for Scientific Machine Learning

Mominul Rubel, Adam Meyers, Gabriel Nicolosi

Main category: cs.LG

TL;DR: FLM is a novel neural network architecture that represents multidimensional nonharmonic Fourier series using cosine activations, learning frequencies, amplitudes and phase shifts as trainable parameters for scientific computing problems.

DetailsMotivation: To create a neural network architecture that can represent complete, separable Fourier basis in multiple dimensions using standard MLP-like structure, overcoming limitations of previous Fourier-inspired models.

Method: Simple feedforward network with cosine activation functions that learns frequencies, amplitudes, and phase shifts as trainable parameters to create problem-specific spectral basis adaptable to both periodic and nonperiodic functions.

Result: FLM demonstrates one-to-one correspondence between Fourier coefficients and amplitudes/phase-shifts, and shows comparable or superior performance to established architectures like SIREN and vanilla feedforward NNs on benchmark PDEs and optimal control problems.

Conclusion: FLM successfully represents complete multidimensional Fourier basis using standard architecture, providing effective performance on scientific computing problems with direct interpretability of Fourier coefficients.

Abstract: We introduce the Fourier Learning Machine (FLM), a neural network (NN) architecture designed to represent a multidimensional nonharmonic Fourier series. The FLM uses a simple feedforward structure with cosine activation functions to learn the frequencies, amplitudes, and phase shifts of the series as trainable parameters. This design allows the model to create a problem-specific spectral basis adaptable to both periodic and nonperiodic functions. Unlike previous Fourier-inspired NN models, the FLM is the first architecture able to represent a complete, separable Fourier basis in multiple dimensions using a standard Multilayer Perceptron-like architecture. A one-to-one correspondence between the Fourier coefficients and amplitudes and phase-shifts is demonstrated, allowing for the translation between a full, separable basis form and the cosine phase–shifted one. Additionally, we evaluate the performance of FLMs on several scientific computing problems, including benchmark Partial Differential Equations (PDEs) and a family of Optimal Control Problems (OCPs). Computational experiments show that the performance of FLMs is comparable, and often superior, to that of established architectures like SIREN and vanilla feedforward NNs.

[269] ADHDeepNet From Raw EEG to Diagnosis: Improving ADHD Diagnosis through Temporal-Spatial Processing, Adaptive Attention Mechanisms, and Explainability in Raw EEG Signals

Ali Amini, Mohammad Alijanpour, Behnam Latifi, Ali Motie Nasrabadi

Main category: cs.LG

TL;DR: ADHDeepNet, a deep learning model using EEG signals, achieves near-perfect accuracy (99.17%) in ADHD diagnosis with comprehensive temporal-spatial analysis and explainability techniques.

DetailsMotivation: Early ADHD diagnosis is crucial but labor-intensive and time-consuming. The paper aims to improve diagnosis precision and timeliness using deep learning and EEG signals.

Method: Developed ADHDeepNet - a DL model with temporal-spatial characterization, attention modules, and explainability techniques for EEG signals. Used 121 participants (61 ADHD, 60 controls) with nested cross-validation, hyperparameter optimization, and Additive Gaussian Noise for data augmentation.

Result: Achieved 100% sensitivity and 99.17% accuracy in ADHD/HC classification. Model explainability analysis identified key brain regions and frequency bands for ADHD diagnosis.

Conclusion: DL and EEG approaches show strong potential for enhancing ADHD diagnosis accuracy and efficiency, with ADHDeepNet demonstrating exceptional performance in automated diagnosis.

Abstract: Attention Deficit Hyperactivity Disorder (ADHD) is a common brain disorder in children that can persist into adulthood, affecting social, academic, and career life. Early diagnosis is crucial for managing these impacts on patients and the healthcare system but is often labor-intensive and time-consuming. This paper presents a novel method to improve ADHD diagnosis precision and timeliness by leveraging Deep Learning (DL) approaches and electroencephalogram (EEG) signals. We introduce ADHDeepNet, a DL model that utilizes comprehensive temporal-spatial characterization, attention modules, and explainability techniques optimized for EEG signals. ADHDeepNet integrates feature extraction and refinement processes to enhance ADHD diagnosis. The model was trained and validated on a dataset of 121 participants (61 ADHD, 60 Healthy Controls), employing nested cross-validation for robust performance. The proposed two-stage methodology uses a 10-fold cross-subject validation strategy. Initially, each iteration optimizes the model’s hyper-parameters with inner 2-fold cross-validation. Then, Additive Gaussian Noise (AGN) with various standard deviations and magnification levels is applied for data augmentation. ADHDeepNet achieved 100% sensitivity and 99.17% accuracy in classifying ADHD/HC subjects. To clarify model explainability and identify key brain regions and frequency bands for ADHD diagnosis, we analyzed the learned weights and activation patterns of the model’s primary layers. Additionally, t-distributed Stochastic Neighbor Embedding (t-SNE) visualized high-dimensional data, aiding in interpreting the model’s decisions. This study highlights the potential of DL and EEG in enhancing ADHD diagnosis accuracy and efficiency.

[270] A Survey of TinyML Applications in Beekeeping for Hive Monitoring and Management

Willy Sucipto, Jianlong Zhou, Ray Seung Min Kwon, Fang Chen

Main category: cs.LG

TL;DR: Survey paper on TinyML applications in beekeeping, covering hive monitoring, behavior recognition, pest detection, and swarming prediction using edge AI devices for sustainable pollinator management.

DetailsMotivation: Honey bee colonies face increasing threats but traditional monitoring methods are disruptive and cloud-based solutions are impractical for remote apiaries, creating need for low-power edge computing solutions.

Method: Synthesizes current innovations in TinyML and apiculture across four functional areas: hive condition monitoring, bee behavior recognition, pest/disease detection, and swarming event forecasting, including examination of datasets, lightweight models, and benchmarking strategies.

Result: Identifies critical limitations like data scarcity, generalization challenges, and deployment barriers in off-grid environments, while highlighting opportunities in efficient inference, adaptive edge learning, and dataset standardization.

Conclusion: Provides foundation for scalable, AI-driven monitoring systems to support sustainable pollinator management through consolidated research and engineering practices in TinyML applications for beekeeping.

Abstract: Honey bee colonies are essential for global food security and ecosystem stability, yet they face escalating threats from pests, diseases, and environmental stressors. Traditional hive inspections are labor-intensive and disruptive, while cloud-based monitoring solutions remain impractical for remote or resource-limited apiaries. Recent advances in Internet of Things (IoT) and Tiny Machine Learning (TinyML) enable low-power, real-time monitoring directly on edge devices, offering scalable and non-invasive alternatives. This survey synthesizes current innovations at the intersection of TinyML and apiculture, organized around four key functional areas: monitoring hive conditions, recognizing bee behaviors, detecting pests and diseases, and forecasting swarming events. We further examine supporting resources, including publicly available datasets, lightweight model architectures optimized for embedded deployment, and benchmarking strategies tailored to field constraints. Critical limitations such as data scarcity, generalization challenges, and deployment barriers in off-grid environments are highlighted, alongside emerging opportunities in ultra-efficient inference pipelines, adaptive edge learning, and dataset standardization. By consolidating research and engineering practices, this work provides a foundation for scalable, AI-driven, and ecologically informed monitoring systems to support sustainable pollinator management.

[271] FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models

Kai Yi, Georg Meinhardt, Laurent Condat, Peter Richtárik

Main category: cs.LG

TL;DR: FedComLoc combines compression techniques with the Scaffnew algorithm to reduce communication costs in Federated Learning while maintaining privacy.

DetailsMotivation: Federated Learning faces high communication costs, especially with heterogeneous clients. Local training helps but still requires efficient communication strategies to handle the communication bottleneck.

Method: Integrates practical compression (TopK compressor and quantization) into the Scaffnew algorithm to create FedComLoc, which enhances communication efficiency while preserving privacy.

Result: Extensive experiments show FedComLoc substantially reduces communication overheads in heterogeneous FL settings while maintaining performance.

Conclusion: FedComLoc successfully enhances communication efficiency in FL by combining compression techniques with the Scaffnew framework, making it suitable for practical deployment in heterogeneous environments.

Abstract: Federated Learning (FL) has garnered increasing attention due to its unique characteristic of allowing heterogeneous clients to process their private data locally and interact with a central server, while being respectful of privacy. A critical bottleneck in FL is the communication cost. A pivotal strategy to mitigate this burden is Local Training, which involves running multiple local stochastic gradient descent iterations between communication phases. Our work is inspired by the innovative Scaffnew algorithm, which has considerably advanced the reduction of communication complexity in FL. We introduce FedComLoc (Federated Compressed and Local Training), integrating practical and effective compression into Scaffnew to further enhance communication efficiency. Extensive experiments, using the popular TopK compressor and quantization, demonstrate its prowess in substantially reducing communication overheads in heterogeneous settings.

[272] A Transformer approach for Electricity Price Forecasting

Oscar Llorente, Jose Portela

Main category: cs.LG

TL;DR: Pure Transformer model for electricity price forecasting outperforms traditional methods without needing recurrent networks, demonstrating attention mechanisms alone can capture temporal patterns effectively.

DetailsMotivation: To demonstrate that pure Transformer models without recurrent networks can effectively handle electricity price forecasting tasks, providing a simpler yet powerful alternative to hybrid approaches.

Method: Used a pure Transformer architecture with attention mechanisms only (no recurrent networks), benchmarked against traditional methods using open-source EPF toolbox for fair comparison.

Result: Transformer model outperformed traditional forecasting methods, showing superior performance in capturing temporal patterns for electricity price prediction.

Conclusion: Pure Transformer models offer a promising solution for reliable electricity price forecasting, with attention mechanisms being sufficient for temporal pattern capture without recurrent network components.

Abstract: This paper presents a novel approach to electricity price forecasting (EPF) using a pure Transformer model. As opposed to other alternatives, no other recurrent network is used in combination to the attention mechanism. Hence, showing that the attention layer is enough for capturing the temporal patterns. The paper also provides fair comparison of the models using the open-source EPF toolbox and provide the code to enhance reproducibility and transparency in EPF research. The results show that the Transformer model outperforms traditional methods, offering a promising solution for reliable and sustainable power system operation.

[273] The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, Yonatan Belinkov

Main category: cs.LG

TL;DR: The paper proposes a causal mediation analysis framework to unify interpretability research, categorizing methods by causal units (mediators) and search techniques to provide systematic evaluation and comparison.

DetailsMotivation: Current interpretability research lacks unity with ad-hoc evaluations and undefined causal units, making progress measurement and technique comparison difficult despite frequent discussions of mechanistic understanding.

Method: The authors propose a perspective grounded in causal mediation analysis, taxonomizing interpretability research history and current state according to types of causal units (mediators) and methods used to search over mediators.

Result: The framework provides insights into when particular mediators and search methods are most appropriate, yielding a cohesive narrative of the field and helping researchers select methods based on their objectives.

Conclusion: The causal mediation analysis approach offers actionable recommendations for future work, including discovering new mediators and developing standardized evaluations tailored to interpretability research goals.

Abstract: Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this article, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate. We argue that this framing yields a more cohesive narrative of the field and helps researchers select appropriate methods based on their research objective. Our analysis yields actionable recommendations for future work, including the discovery of new mediators and the development of standardized evaluations tailored to these goals.

[274] Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?

Grgur Kovač, Jérémy Perez, Rémy Portelas, Peter Ford Dominey, Pierre-Yves Oudeyer

Main category: cs.LG

TL;DR: LLM training on synthetic data causes distribution shifts (model collapse), with human data properties determining shift magnitude - lexical diversity amplifies shifts while semantic diversity and quality mitigate them.

DetailsMotivation: To understand how human data properties affect distribution shifts in recursive LLM training as models increasingly train on synthetic content, creating feedback loops.

Method: Empirical examination through exhaustive manipulation of dataset properties combined with regression analyses on different human datasets.

Result: Lexical diversity amplifies distribution shifts, semantic diversity and data quality mitigate them, effects are domain-specific, and human data properties determine whether initial biases are amplified or reduced.

Conclusion: Different parts of the internet may undergo different types of distribution shift based on their underlying human data properties, providing a novel view of model collapse dynamics.

Abstract: Large language models (LLMs) are increasingly used in the creation of online content, creating feedback loops as subsequent generations of models will be trained on this synthetic data. Such loops were shown to lead to distribution shifts - models misrepresenting the true underlying distributions of human data (also called model collapse). However, how human data properties affect such shifts remains poorly understood. In this paper, we provide the first empirical examination of the effect of such properties on the outcome of recursive training. We first confirm that using different human datasets leads to distribution shifts of different magnitudes. Through exhaustive manipulation of dataset properties combined with regression analyses, we then identify a set of properties predicting distribution shift magnitudes. Lexical diversity is found to amplify these shifts, while semantic diversity and data quality mitigate them. Furthermore, we find that these influences are highly modular: data scrapped from a given internet domain has little influence on the content generated for another domain. Finally, experiments on political bias reveal that human data properties affect whether the initial bias will be amplified or reduced. Overall, our results portray a novel view, where different parts of internet may undergo different types of distribution shift.

[275] TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

Muhammad Taha Cheema, Abeer Aamir, Khawaja Gul Muhammad, Naveed Anwar Bhatti, Ihsan Ayyub Qazi, Zafar Ayyub Qazi

Main category: cs.LG

TL;DR: TweakLLM is a novel caching architecture that uses a lightweight LLM to dynamically adapt cached responses to new prompts, improving cache effectiveness while maintaining response quality comparable to frontier models.

DetailsMotivation: LLMs process millions of queries daily, but efficient caching is challenging due to personalized chatbot interactions and limited accuracy of semantic similarity search in preserving relevance to user queries.

Method: A routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts, evaluated through user studies with side-by-side comparisons, satisfaction voting, and multi-agent LLM debates.

Result: TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness, demonstrating scalability and resource-efficiency for high-volume LLM deployments.

Conclusion: TweakLLM provides a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.

Abstract: Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.

[276] A general language model for peptide identification

Jixiu Zhai, Zikun Wang, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang

Main category: cs.LG

TL;DR: PDeepPP is a unified deep learning framework that combines pretrained protein language models with transformer-convolutional architecture for accurate identification of bioactive peptides and protein post-translational modifications across diverse biological tasks.

DetailsMotivation: Current computational methods lack generalizability across diverse peptide functions, limiting accurate identification of bioactive peptides and protein post-translational modifications which are essential for understanding protein function and therapeutic discovery.

Method: Integrates pretrained protein language models with hybrid transformer-convolutional architecture, uses comprehensive benchmark datasets with strategies to address data imbalance, and systematically extracts both global and local sequence features.

Result: Achieves state-of-the-art performance in 25 of 33 biological identification tasks, with high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, 99.5% specificity in glycosylation site prediction, and substantial reduction in false negatives in antimalarial tasks.

Conclusion: PDeepPP enables large-scale, accurate peptide analysis to support biomedical research and discovery of novel therapeutic targets for disease treatment, with all code, datasets, and pretrained models publicly available.

Abstract: Accurate identification of bioactive peptides (BPs) and protein post-translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-convolutional architecture, enabling robust identification across diverse peptide classes and PTM sites. We curated comprehensive benchmark datasets and implemented strategies to address data imbalance, allowing PDeepPP to systematically extract both global and local sequence features. Through extensive analyses-including dimensionality reduction and comparison studies-PDeepPP demonstrates strong, interpretable peptide representations and achieves state-of-the-art performance in 25 of the 33 biological identification tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, with 99.5% specificity in glycosylation site prediction and substantial reduction in false negatives in antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available via GitHub:https://github.com/fondress/PDeepPP and Hugging Face:https://huggingface.co/fondress/PDeppPP.

[277] To See a World in a Spark of Neuron: Disentangling Multi-task Interference for Training-free Model Merging

Zitao Fang, Guodong DU, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, Sim Kuan Goh

Main category: cs.LG

TL;DR: NeuroMerging is a novel model merging framework that uses neuronal subspace decomposition to mitigate task interference and enable training-free fusion of fine-tuned models across diverse tasks.

DetailsMotivation: Fine-tuning pre-trained models improves task-specific performance but reduces generalization. Model merging techniques help create multi-task models but suffer from task interference due to overlooking neuronal mechanisms, connectivity, and activation patterns.

Method: Decomposed task-specific representations into two complementary neuronal subspaces regulating input sensitivity and task adaptability. Developed NeuroMerging framework to mitigate task interference within these neuronal subspaces for training-free model fusion.

Result: NeuroMerging achieved superior performance compared to existing methods on multi-task benchmarks across both natural language and vision domains through extensive experiments.

Conclusion: Aligning neuronal mechanisms is crucial for effective model merging. The approach offers new insights into mitigating task interference and improving knowledge fusion, demonstrating the importance of considering how neurons relay and process information in model merging.

Abstract: Fine-tuning pre-trained models on targeted datasets enhances task-specific performance but often comes at the expense of generalization. Model merging techniques, which integrate multiple fine-tuned models into a single multi-task model through task arithmetic, offer a promising solution. However, task interference remains a fundamental challenge, leading to performance degradation and suboptimal merged models. Existing approaches largely overlooked the fundamental roles of neurons, their connectivity, and activation, resulting in a merging process and a merged model that does not consider how neurons relay and process information. In this work, we present the first study that relies on neuronal mechanisms for model merging. Specifically, we decomposed task-specific representations into two complementary neuronal subspaces that regulate input sensitivity and task adaptability. Leveraging this decomposition, we introduced NeuroMerging, a novel merging framework developed to mitigate task interference within neuronal subspaces, enabling training-free model fusion across diverse tasks. Through extensive experiments, we demonstrated that NeuroMerging achieved superior performance compared to existing methods on multi-task benchmarks across both natural language and vision domains. Our findings highlighted the importance of aligning neuronal mechanisms in model merging, offering new insights into mitigating task interference and improving knowledge fusion. Our project is available at https://ZzzitaoFang.github.io/projects/NeuroMerging/.

[278] VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Mohamed Salim Aissi, Clemence Grislain, Mohamed Chetouani, Olivier Sigaud, Laure Soulier, Nicolas Thome

Main category: cs.LG

TL;DR: VIPER is a multimodal framework that combines VLM perception and LLM reasoning for visual instruction-based planning, achieving state-of-the-art performance on ALFWorld benchmark.

DetailsMotivation: While LLMs excel at text reasoning and VLMs are effective for visual perception, applying these models for visual instruction-based planning remains an open problem that needs to be addressed.

Method: Uses a modular pipeline with frozen VLM to generate textual descriptions of images, then processes them with LLM policy to predict actions. Fine-tunes reasoning module using behavioral cloning and reinforcement learning.

Result: Significantly outperforms state-of-the-art visual instruction-based planners on ALFWorld benchmark and narrows the gap with purely text-based oracles.

Conclusion: VIPER’s text-based intermediate representation enhances explainability and enables fine-grained analysis of perception and reasoning components, paving the way for improved multimodal planning systems.

Abstract: While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent’s decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

[279] FlexFringe: Modeling Software Behavior by Learning Probabilistic Automata

Sicco Verwer, Christian Hammerschmidt

Main category: cs.LG

TL;DR: Efficient implementations of probabilistic deterministic finite automaton learning methods in FlexFringe, showing competitive results and improvements over default implementations, with applications in software log analysis and anomaly detection.

DetailsMotivation: To improve the performance and practicality of state-merging algorithms for probabilistic deterministic finite automaton learning, making them more efficient and effective for real-world applications like software log analysis and anomaly detection.

Method: Implemented efficient versions of well-known state-merging strategies with several modifications to enhance practical performance, using FlexFringe as the platform for learning probabilistic deterministic finite automata.

Result: The algorithms achieved competitive results and significant improvements over default implementations. Learned interpretable models from software logs for anomaly detection, with smaller convoluted models outperforming neural network-based solutions.

Conclusion: FlexFringe provides efficient and effective probabilistic deterministic finite automaton learning methods that outperform default implementations and neural network approaches in anomaly detection tasks, while offering interpretable models for software log analysis.

Abstract: We present the efficient implementations of probabilistic deterministic finite automaton learning methods available in FlexFringe. These implement well-known strategies for state-merging including several modifications to improve their performance in practice. We show experimentally that these algorithms obtain competitive results and significant improvements over a default implementation. We also demonstrate how to use FlexFringe to learn interpretable models from software logs and use these for anomaly detection. Although less interpretable, we show that learning smaller more convoluted models improves the performance of FlexFringe on anomaly detection, outperforming an existing solution based on neural nets.

[280] Joint Optimization of Energy Consumption and Completion Time in Federated Learning

Xinyu Zhou, Jun Zhao, Huimei Han, Claude Guet

Main category: cs.LG

TL;DR: Proposed a resource allocation algorithm for Federated Learning that optimizes energy consumption and completion time through bandwidth, power, and CPU frequency allocation.

DetailsMotivation: To balance the trade-off between energy and execution latency in Federated Learning systems to accommodate different application demands and scenarios while preserving privacy.

Method: Formulated an optimization problem to minimize weighted sum of energy and completion time, decomposed it into two subproblems, and devised a resource allocation algorithm for bandwidth, transmission power, and CPU frequency allocation.

Result: Numerical results show the proposed algorithm outperforms state-of-the-art methods across different weight parameters and demands.

Conclusion: The developed resource allocation algorithm effectively optimizes FL system performance, providing better energy-latency trade-offs than existing approaches.

Abstract: Federated Learning (FL) is an intriguing distributed machine learning approach due to its privacy-preserving characteristics. To balance the trade-off between energy and execution latency, and thus accommodate different demands and application scenarios, we formulate an optimization problem to minimize a weighted sum of total energy consumption and completion time through two weight parameters. The optimization variables include bandwidth, transmission power and CPU frequency of each device in the FL system, where all devices are linked to a base station and train a global model collaboratively. Through decomposing the non-convex optimization problem into two subproblems, we devise a resource allocation algorithm to determine the bandwidth allocation, transmission power, and CPU frequency for each participating device. We further present the convergence analysis and computational complexity of the proposed algorithm. Numerical results show that our proposed algorithm not only has better performance at different weight parameters (i.e., different demands) but also outperforms the state of the art.

[281] Calibrating Transformers via Sparse Gaussian Processes

Wenlong Chen, Yingzhen Li

Main category: cs.LG

TL;DR: SGPA replaces dot-product attention with sparse Gaussian processes to improve Transformer uncertainty calibration while maintaining competitive accuracy.

DetailsMotivation: Extending Transformer success to safety-critical domains requires better uncertainty estimation, which current models lack.

Method: Replace scaled dot-product operation with symmetric kernel and use sparse Gaussian processes to approximate posterior processes in multi-head attention blocks.

Result: Achieves competitive predictive accuracy while significantly improving both in-distribution calibration and out-of-distribution robustness/detection across text, image, and graph tasks.

Conclusion: SGPA successfully enables Transformers to provide calibrated uncertainty estimates without sacrificing performance, making them suitable for safety-critical applications.

Abstract: Transformer models have achieved profound success in prediction tasks in a wide range of applications in natural language processing, speech recognition and computer vision. Extending Transformer’s success to safety-critical domains requires calibrated uncertainty estimation which remains under-explored. To address this, we propose Sparse Gaussian Process attention (SGPA), which performs Bayesian inference directly in the output space of multi-head attention blocks (MHAs) in transformer to calibrate its uncertainty. It replaces the scaled dot-product operation with a valid symmetric kernel and uses sparse Gaussian processes (SGP) techniques to approximate the posterior processes of MHA outputs. Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both in-distribution calibration and out-of-distribution robustness and detection.

[282] Reasoning Large Language Model Errors Arise from Hallucinating Critical Problem Features

Alex Heyman, Joel Zylberberg

Main category: cs.LG

TL;DR: RLLMs hallucinate graph edges not in prompts, causing significant error rates in constraint-satisfaction problems like graph coloring.

DetailsMotivation: To understand failure modes and causes of reasoning large language models (RLLMs) in constraint-satisfaction tasks, as they remain imperfect reasoners despite CoT training.

Method: Tested multiple RLLMs (o1-mini, o3-mini, DeepSeek-R1, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, Grok 3 Mini Beta) on graph coloring problems with varying complexity, analyzed error rates and CoT explanations, and validated findings with stable matching problems.

Result: RLLMs consistently hallucinate graph edges not specified in prompts across complexity levels and semantic frames, accounting for significant fractions of incorrect answers (majority for some models).

Conclusion: RLLMs have broader issues with misrepresenting problem specifics; design choices should address input-conflicting hallucinations to improve reasoning reliability.

Abstract: Large language models have recently made great strides in reasoning task performance through chain-of-thought (CoT) strategies trained via reinforcement learning; however, these “reasoning large language models” (RLLMs) remain imperfect reasoners, and understanding the frequencies and causes of their failure modes is important for both users and developers. We test o1-mini, o3-mini, DeepSeek-R1, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, and Grok 3 Mini Beta on graph coloring as a variable-complexity constraint-satisfaction logic problem, and find evidence from both error rate comparisons and CoT/explanation text analysis that RLLMs are prone to hallucinate graph edges not specified in the prompt. This phenomenon persists across multiple problem complexity levels and semantic frames, and it appears to account for a significant fraction of the incorrect answers from every tested model, and the vast majority of them for some models. We also validate the generalizability of this input-conflicting hallucination phenomenon with smaller-scale experiments on a type of stable matching problem. Our results indicate that RLLMs may possess broader issues with misrepresentation of problem specifics, and we offer suggestions for design choices to mitigate this weakness.

Jialong Zhou, Xing Ai, Yuni Lai, Tomasz Michalak, Gaolei Li, Jianhua Li, Di Tang, Xingxing Zhang, Mengpei Yang, Kai Zhou

Main category: cs.LG

TL;DR: The paper identifies vulnerabilities in signed graph neural networks (SGNNs) due to balance theory, proposes balance-attack to exploit these vulnerabilities, and introduces BA-SGCL framework for robust defense using contrastive learning and balance augmentation.

DetailsMotivation: Balance theory, while essential for modeling signed relationships in SGNNs, introduces exploitable vulnerabilities to black-box attacks, creating security concerns for signed graph analysis in social networks.

Method: Proposes balance-attack strategy to compromise graph balance degree with efficient heuristic algorithm, and develops BA-SGCL framework combining contrastive learning with balance augmentation techniques to maintain high balance degree in latent space.

Result: Extensive experiments show effectiveness of balance-attack in exploiting vulnerabilities and superior robustness of BA-SGCL across multiple SGNN architectures and real-world datasets, effectively addressing the irreversibility challenge.

Conclusion: The proposed BA-SGCL framework advances security and reliability of signed graph analysis by circumventing balance-related vulnerabilities and enhancing model resilience through contrastive learning and balance augmentation techniques.

Abstract: Signed graphs serve as fundamental data structures for representing positive and negative relationships in social networks, with signed graph neural networks (SGNNs) emerging as the primary tool for their analysis. Our investigation reveals that balance theory, while essential for modeling signed relationships in SGNNs, inadvertently introduces exploitable vulnerabilities to black-box attacks. To showcase this, we propose balance-attack, a novel adversarial strategy specifically designed to compromise graph balance degree, and develop an efficient heuristic algorithm to solve the associated NP-hard optimization problem. While existing approaches attempt to restore attacked graphs through balance learning techniques, they face a critical challenge we term “Irreversibility of Balance-related Information,” as restored edges fail to align with original attack targets. To address this limitation, we introduce Balance Augmented-Signed Graph Contrastive Learning (BA-SGCL), an innovative framework that combines contrastive learning with balance augmentation techniques to achieve robust graph representations. By maintaining high balance degree in the latent space, BA-SGCL not only effectively circumvents the irreversibility challenge but also significantly enhances model resilience. Extensive experiments across multiple SGNN architectures and real-world datasets demonstrate both the effectiveness of our proposed balance-attack and the superior robustness of BA-SGCL, advancing the security and reliability of signed graph analysis in social networks. Datasets and codes of the proposed framework are at the github repository https://anonymous.4open.science/r/BA-SGCL-submit-DF41/.

[284] Generative Example-Based Explanations: Bridging the Gap between Generative Modeling and Explainability

Philipp Vaeth, Alexander M. Fruehwald, Benjamin Paassen, Magda Gregorova

Main category: cs.LG

TL;DR: A probabilistic framework for example-based explanations that bridges the gap between deep generative modeling and classical explainability literature.

DetailsMotivation: To address the conceptual and communication gap between deep generative methods for example-based explanations and classical explainability literature, which leads to misunderstandings and misalignments in goals and expectations.

Method: Proposes a probabilistic framework that formally defines example-based explanations in a probabilistic manner suitable for deep generative modeling while maintaining coherence with widely accepted explainability characteristics and desiderata.

Result: Provides a constructive framework for developing well-grounded generative algorithms for example-based explanations and facilitates communication between generative and explainability research communities.

Conclusion: The framework aims to foster rigor, transparency, and improve research quality in example-based explanations by bridging the gap between generative modeling and explainability literature.

Abstract: Recently, several methods have leveraged deep generative modeling to produce example-based explanations of image classifiers. Despite producing visually stunning results, these methods are largely disconnected from classical explainability literature. This conceptual and communication gap leads to misunderstandings and misalignments in goals and expectations. In this paper, we bridge this gap by proposing a probabilistic framework for example-based explanations, formally defining the example-based explanations in a probabilistic manner amenable for modeling via deep generative models while coherent with the critical characteristics and desiderata widely accepted in the explainability community. Our aim is on one hand to provide a constructive framework for the development of well-grounded generative algorithms for example-based explanations and, on the other, to facilitate communication between the generative and explainability research communities, foster rigor and transparency, and improve the quality of peer discussion and research progress in this promising direction.

[285] Symbolic regression via MDLformer-guided search: from minimizing prediction error to minimizing description length

Zihan Yu, Jingtao Ding, Yong Li, Depeng Jin

Main category: cs.LG

TL;DR: Proposes SR4MDL, a symbolic regression method using minimum description length (MDL) as search objective to overcome limitations of prediction error-based approaches, achieving 43.92% better performance than state-of-the-art methods.

DetailsMotivation: Traditional symbolic regression methods use prediction error as search objective, but formulas with similar shapes can have different symbolic forms, causing non-monotonic error decrease and low recovery rates.

Method: Design MDLformer neural network to estimate minimum description length of input data through large-scale training, then use MDL as search objective in symbolic regression algorithm SR4MDL.

Result: Achieves 50 successful formula recoveries across 133 benchmark problems (43.92% improvement over SOTA) and demonstrates strong generalization on 122 unseen black-box problems.

Conclusion: MDL-based search objective provides monotonic distance measure to target formula, enabling more effective symbolic regression with significantly improved recovery rates and generalization performance.

Abstract: Symbolic regression, a task discovering the formula best fitting the given data, is typically based on the heuristical search. These methods usually update candidate formulas to obtain new ones with lower prediction errors iteratively. However, since formulas with similar function shapes may have completely different symbolic forms, the prediction error does not decrease monotonously as the search approaches the target formula, causing the low recovery rate of existing methods. To solve this problem, we propose a novel search objective based on the minimum description length, which reflects the distance from the target and decreases monotonically as the search approaches the correct form of the target formula. To estimate the minimum description length of any input data, we design a neural network, MDLformer, which enables robust and scalable estimation through large-scale training. With the MDLformer’s output as the search objective, we implement a symbolic regression method, SR4MDL, that can effectively recover the correct mathematical form of the formula. Extensive experiments illustrate its excellent performance in recovering formulas from data. Our method successfully recovers around 50 formulas across two benchmark datasets comprising 133 problems, outperforming state-of-the-art methods by 43.92%. Experiments on 122 unseen black-box problems further demonstrate its generalization performance. We release our code at https://github.com/tsinghua-fib-lab/SR4MDL .

[286] Discrete Diffusion in Large Language and Multimodal Models: A Survey

Runpeng Yu, Qi Li, Xinchao Wang

Main category: cs.LG

TL;DR: Survey paper on Discrete Diffusion Language Models (dLLMs) and Multimodal Language Models (dMLLMs) that provide parallel decoding, faster inference, and enhanced control compared to autoregressive models.

DetailsMotivation: To systematically review and analyze the emerging field of discrete diffusion models for language and multimodal tasks, which offer advantages over traditional autoregressive approaches including parallel generation, fine-grained control, and faster inference speeds.

Method: Comprehensive survey methodology including historical development tracing, mathematical framework formalization, model categorization, analysis of training/inference techniques, and application summarization across language, vision-language, and biological domains.

Result: Discrete diffusion models demonstrate comparable performance to autoregressive counterparts while achieving up to 10x inference acceleration, with growing industrial and academic adoption across various domains.

Conclusion: Discrete diffusion models represent a promising alternative to autoregressive intelligence, with the survey providing foundational understanding and identifying future research directions for this emerging paradigm.

Abstract: In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs). Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm using full attention and a denoising-based generation strategy. This paradigm naturally enables parallel generation, fine-grained output control, and dynamic perception. These capabilities are previously difficult to achieve with AR models. A growing number of industrial-scale proprietary d(M)LLMs, as well as a large number of open-source academic d(M)LLMs, have demonstrated performance comparable to their autoregressive counterparts, while achieving up to \textit{10$\times$} acceleration in inference speed. These developments position discrete diffusion models as a promising alternative to intelligence based on the traditional autoregressive approach. In this work, we present a comprehensive overview of the research in the dLLM and dMLLM domains. We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, and categorize representative models. We further analyze key techniques for training and inference, and summarize emerging applications across language, vision-language, and biological domains and \textit{etc.}. We conclude by discussing future directions for research and deployment. Relative papers are collected in https://github.com/LiQiiiii/Awesome-Discrete-Diffusion-LLM_MLLM

[287] FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs–Down to 2 Bits!

Yi Ren, Ruge Xu, Xinfei Guo, Weikang Qian

Main category: cs.LG

TL;DR: FAMES enables effective use of approximate multipliers in ultra-low bitwidth (down to 2-bit) quantized DNNs, achieving 28.67% energy reduction with <1% accuracy loss and 300x speedup over previous methods.

DetailsMotivation: Existing approximate multiplier research assumes high bitwidths, but modern quantization uses 2-bit precision, making high-bitwidth approximate multipliers energy-inefficient compared to low-bitwidth exact multipliers.

Method: FAMES (Fast Approximate Multiplier Substitution) - a systematic method for substituting approximate multipliers in mixed-precision DNNs with very low bitwidths.

Result: 28.67% average energy reduction on state-of-the-art mixed-precision quantized models (down to 2 bits) with accuracy losses under 1%, and 300x faster than genetic algorithm-based approaches.

Conclusion: Approximate multipliers can be effectively applied to ultra-low bitwidth quantized DNNs, providing significant energy savings without compromising accuracy, with FAMES offering a fast and efficient solution.

Abstract: A widely-used technique in designing energy-efficient deep neural network (DNN) accelerators is quantization. Recent progress in this direction has reduced the bitwidths used in DNN down to 2. Meanwhile, many prior works apply approximate multipliers (AppMuls) in designing DNN accelerators to lower their energy consumption. Unfortunately, these works still assume a bitwidth much larger than 2, which falls far behind the state-of-the-art in quantization area and even challenges the meaningfulness of applying AppMuls in DNN accelerators, since a high-bitwidth AppMul consumes much more energy than a low-bitwidth exact multiplier! Thus, an important problem to study is: Can approximate multipliers be effectively applied to quantized DNN models with very low bitwidths? In this work, we give an affirmative answer to this question and present a systematic solution that achieves the answer: FAMES, a fast approximate multiplier substitution method for mixed-precision DNNs. Our experiments demonstrate an average 28.67% energy reduction on state-of-the-art mixed-precision quantized models with bitwidths as low as 2 bits and accuracy losses kept under 1%. Additionally, our approach is up to 300x faster than previous genetic algorithm-based methods.

[288] A Nonlinear Low-rank Representation Model with Convolutional Neural Network for Imputing Water Quality Data

Xin Liao, Bing Yang, Cai Yu

Main category: cs.LG

TL;DR: Proposes a Nonlinear Low-rank Representation model with CNNs for imputing missing water quality data, outperforming existing methods in accuracy.

DetailsMotivation: Water quality monitoring systems face challenges with large amounts of missing data due to sensor failures and communication delays, leading to high-dimensional sparse data that traditional methods struggle to handle effectively.

Method: Uses Convolutional Neural Networks to fuse temporal features for modeling temporal dependence and extract nonlinear interactions/local patterns to mine higher-order relationships and achieve deep fusion of multidimensional information.

Result: Experimental studies on three real water quality datasets show the proposed model significantly outperforms existing state-of-the-art data imputation models in estimation accuracy.

Conclusion: Provides an effective approach for handling water quality monitoring data in complex dynamic environments, successfully addressing the challenges of missing data imputation in water quality monitoring systems.

Abstract: The integrity of Water Quality Data (WQD) is critical in environmental monitoring for scientific decision-making and ecological protection. However, water quality monitoring systems are often challenged by large amounts of missing data due to unavoidable problems such as sensor failures and communication delays, which further lead to water quality data becoming High-Dimensional and Sparse (HDS). Traditional data imputation methods are difficult to depict the potential dynamics and fail to capture the deep data features, resulting in unsatisfactory imputation performance. To effectively address the above issues, this paper proposes a Nonlinear Low-rank Representation model (NLR) with Convolutional Neural Networks (CNN) for imputing missing WQD, which utilizes CNNs to implement two ideas: a) fusing temporal features to model the temporal dependence of data between time slots, and b) Extracting nonlinear interactions and local patterns to mine higher-order relationships features and achieve deep fusion of multidimensional information. Experimental studies on three real water quality datasets demonstrate that the proposed model significantly outperforms existing state-of-the-art data imputation models in terms of estimation accuracy. It provides an effective approach for handling water quality monitoring data in complex dynamic environments.

[289] Differentially Private Random Feature Model

Chunyang Liao, Deanna Needell, Hayden Schaeffer, Alexander Xue

Main category: cs.LG

TL;DR: Differentially private random feature model for privacy-preserving machine learning in over-parametrized regime with theoretical guarantees and improved fairness.

DetailsMotivation: Privacy-preserving ML is crucial for sensitive data, with differential privacy being a standard mechanism. Existing methods need improvement for over-parametrized settings and fairness concerns.

Method: Private random feature model using output perturbation techniques on min-norm interpolation in over-parametrized regime (more features than samples).

Result: Method preserves privacy with generalization error bounds. Superior performance compared to other privacy-preserving methods on synthetic and benchmark data. Reduces disparate impact and improves fairness.

Conclusion: First privacy-preserving random feature model in over-parametrized regime with theoretical guarantees. Shows potential to mitigate disparate impact in DP mechanisms while maintaining strong generalization performance.

Abstract: Designing privacy-preserving machine learning algorithms has received great attention in recent years, especially in the setting when the data contains sensitive information. Differential privacy (DP) is a widely used mechanism for data analysis with privacy guarantees. In this paper, we produce a differentially private random feature model. Random features, which were proposed to approximate large-scale kernel machines, have been used to study privacy-preserving kernel machines as well. We consider the over-parametrized regime (more features than samples) where the non-private random feature model is learned via solving the min-norm interpolation problem, and then we apply output perturbation techniques to produce a private model. We show that our method preserves privacy and derive a generalization error bound for the method. To the best of our knowledge, we are the first to consider privacy-preserving random feature models in the over-parametrized regime and provide theoretical guarantees. We empirically compare our method with other privacy-preserving learning methods in the literature as well. Our results show that our approach is superior to the other methods in terms of generalization performance on synthetic data and benchmark data sets. Additionally, it was recently observed that DP mechanisms may exhibit and exacerbate disparate impact, which means that the outcomes of DP learning algorithms vary significantly among different groups. We show that both theoretically and empirically, random features have the potential to reduce disparate impact, and hence achieve better fairness.

[290] HopCast: Calibration of Autoregressive Dynamics Models

Muhammad Bilal Shahid, Cody Fleming

Main category: cs.LG

TL;DR: HOP method uses Modern Hopfield Networks as a Corrector to predict errors of a deterministic Predictor, enabling well-calibrated multi-step predictions for dynamical systems.

DetailsMotivation: Existing deep learning models for dynamical systems produce calibrated one-step predictions but lack proper uncertainty propagation for multi-step autoregressive predictions, leading to poor calibration.

Method: Predictor-Corrector approach with Modern Hopfield Networks that learn errors of a deterministic predictor, creating prediction intervals during autoregression.

Result: Produces sharper and better-calibrated prediction intervals with higher predictive accuracy compared to baselines, evaluated across various dynamical systems.

Conclusion: HOP provides effective uncertainty propagation for multi-step predictions and establishes the first benchmark for calibration errors in uncertainty propagation methods.

Abstract: Deep learning models are often trained to approximate dynamical systems that can be modeled using differential equations. Many of these models are optimized to predict one step ahead; such approaches produce calibrated one-step predictions if the predictive model can quantify uncertainty, such as Deep Ensembles. At inference time, multi-step predictions are generated via autoregression, which needs a sound uncertainty propagation method to produce calibrated multi-step predictions. This work introduces an alternative Predictor-Corrector approach named \hop{} that uses Modern Hopfield Networks (MHN) to learn the errors of a deterministic Predictor that approximates the dynamical system. The Corrector predicts a set of errors for the Predictor’s output based on a context state at any timestep during autoregression. The set of errors creates sharper and well-calibrated prediction intervals with higher predictive accuracy compared to baselines without uncertainty propagation. The calibration and prediction performances are evaluated across a set of dynamical systems. This work is also the first to benchmark existing uncertainty propagation methods based on calibration errors.

[291] MDDM: A Molecular Dynamics Diffusion Model to Predict Particle Self-Assembly

Kevin Ferguson, Yu-hsuan Chen, Levent Burak Kara

Main category: cs.LG

TL;DR: MDDM is a diffusion model that predicts molecular conformations from input potential functions, outperforming baseline methods while incorporating domain-specific constraints like periodic boundaries.

DetailsMotivation: Molecular simulations are computationally expensive, creating a need for efficient methods to predict molecular structures from potential functions without running full simulations.

Method: Proposed MDDM (Molecular Dynamics Diffusion Model) trained on molecular dynamics self-assembly data, converts uniform noise to meaningful particle structures for arbitrary input potentials with built-in domain constraints.

Result: MDDM significantly outperforms baseline point-cloud diffusion models for both unconditional and conditional generation tasks.

Conclusion: The proposed diffusion model provides an efficient alternative to computationally expensive molecular simulations for predicting molecular conformations from potential functions.

Abstract: The discovery and study of new material systems rely on molecular simulations that often come with significant computational expense. We propose MDDM, a Molecular Dynamics Diffusion Model, which is capable of predicting a valid output conformation for a given input pair potential function. After training MDDM on a large dataset of molecular dynamics self-assembly results, the proposed model can convert uniform noise into a meaningful output particle structure corresponding to an arbitrary input potential. The model’s architecture has domain-specific properties built-in, such as satisfying periodic boundaries and being invariant to translation. The model significantly outperforms the baseline point-cloud diffusion model for both unconditional and conditional generation tasks.

[292] Comprehensive Evaluation of Prototype Neural Networks

Philipp Schlinge, Steffen Meinert, Martin Atzmueller

Main category: cs.LG

TL;DR: Comprehensive analysis of prototype-based XAI models (ProtoPNet, ProtoPool, PIPNet) using both standard and novel metrics across diverse datasets, with open-source implementation.

DetailsMotivation: Prototype models are crucial for explainable AI, but there's a need for thorough evaluation using comprehensive metrics to assess their interpretability and performance across various scenarios.

Method: Applied multiple prototype models on diverse datasets (fine-grained classification, Non-IID settings, multi-label classification) using a comprehensive set of metrics including both standard literature metrics and newly proposed ones.

Result: The study provides comparative analysis of prototype models’ performance and interpretability, along with an open-source library (quanproto) for easy metric application and extensibility.

Conclusion: The research contributes a systematic evaluation framework for prototype-based XAI models and provides a practical tool for the community to assess and extend interpretability metrics.

Abstract: Prototype models are an important method for explainable artificial intelligence (XAI) and interpretable machine learning. In this paper, we perform an in-depth analysis of a set of prominent prototype models including ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive set of metrics. In addition to applying standard metrics from literature, we propose several new metrics to further complement the analysis of model interpretability. In our experimentation, we apply the set of prototype models on a diverse set of datasets including fine-grained classification, Non-IID settings and multi-label classification to further contrast the performance. Furthermore, we also provide our code as an open-source library (https://github.com/uos-sis/quanproto), which facilitates simple application of the metrics itself, as well as extensibility – providing the option for easily adding new metrics and models.

[293] Investigating Compositional Reasoning in Time Series Foundation Models

Willa Potosnak, Cristian Challu, Mononito Goswami, Kin G. Olivares, Michał Wiliński, Nina Żukowska, Artur Dubrawski

Main category: cs.LG

TL;DR: Time series foundation models show promising zero-shot performance, but it’s unclear if they reason or just memorize patterns. This study defines compositional reasoning in forecasting and evaluates 16 models, finding patch-based Transformers and residualized MLPs perform best with significantly lower computational costs.

DetailsMotivation: To determine whether large pre-trained time series foundation models (TSFMs) succeed through memorization or genuine reasoning capabilities, and to formally define and evaluate compositional reasoning in forecasting as distinct from in-distribution generalization.

Method: Evaluated 16 popular deep learning forecasting models on synthetic and real-world datasets, with controlled studies examining design choices in 7 open-source TSFMs to identify factors contributing to improved reasoning capabilities.

Result: Patch-based Transformers showed the best reasoning performance, closely followed by residualized MLP-based architectures which were 97% less computationally complex and 86% smaller in parameters. Some models outperformed statistical baselines in zero-shot out-of-distribution scenarios.

Conclusion: Only a few design choices (like tokenization method) significantly impacted Transformer performance. The study provides key insights into TSFM architecture design’s impact on compositional reasoning and generalization capabilities.

Abstract: Large pre-trained time series foundation models (TSFMs) have demonstrated promising zero-shot performance across a wide range of domains. However, a question remains: Do TSFMs succeed by memorizing patterns in training data, or do they possess the ability to reason about such patterns? While reasoning is a topic of great interest in the study of Large Language Models (LLMs), it is undefined and largely unexplored in the context of TSFMs. In this work, inspired by language modeling literature, we formally define compositional reasoning in forecasting and distinguish it from in-distribution generalization. We evaluate the reasoning and generalization capabilities of 16 popular deep learning forecasting models on multiple synthetic and real-world datasets. Additionally, through controlled studies, we systematically examine which design choices in 7 popular open-source TSFMs contribute to improved reasoning capabilities. Our study yields key insights into the impact of TSFM architecture design on compositional reasoning and generalization. We find that patch-based Transformers have the best reasoning performance, closely followed by residualized MLP-based architectures, which are 97% less computationally complex in terms of FLOPs and 86% smaller in terms of the number of trainable parameters. Interestingly, in some zero-shot out-of-distribution scenarios, these models can outperform moving average and exponential smoothing statistical baselines trained on in-distribution data. Only a few design choices, such as the tokenization method, had a significant (negative) impact on Transformer model performance.

[294] How Should We Meta-Learn Reinforcement Learning Algorithms?

Alexander David Goldie, Zilin Wang, Jaron Cohen, Jakob Nicolaus Foerster, Shimon Whiteson

Main category: cs.LG

TL;DR: Empirical comparison of meta-learning approaches for RL algorithms, evaluating performance, interpretability, and efficiency across different methods including evolution and LLMs.

DetailsMotivation: Meta-learning shows promise for improving RL algorithms but lacks systematic comparison between different approaches like evolutionary optimization and LLM-based code generation.

Method: Conducted empirical comparison of various meta-learning algorithms applied to different parts of the RL pipeline, evaluating meta-train/test performance, interpretability, sample cost, and training time.

Result: The study provides comprehensive performance metrics and analysis of different meta-learning approaches for RL algorithm development.

Conclusion: Proposes guidelines for meta-learning new RL algorithms to ensure optimal performance based on empirical findings comparing different meta-learning methods.

Abstract: The process of meta-learning algorithms from data, instead of relying on manual design, is growing in popularity as a paradigm for improving the performance of machine learning systems. Meta-learning shows particular promise for reinforcement learning (RL), where algorithms are often adapted from supervised or unsupervised learning despite their suboptimality for RL. However, until now there has been a severe lack of comparison between different meta-learning algorithms, such as using evolution to optimise over black-box functions or LLMs to propose code. In this paper, we carry out this empirical comparison of the different approaches when applied to a range of meta-learned algorithms which target different parts of the RL pipeline. In addition to meta-train and meta-test performance, we also investigate factors including the interpretability, sample cost and train time for each meta-learning algorithm. Based on these findings, we propose several guidelines for meta-learning new RL algorithms which will help ensure that future learned algorithms are as performant as possible.

[295] Cauchy Random Features for Operator Learning in Sobolev Space

Chunyang Liao, Deanna Needell, Hayden Schaeffer

Main category: cs.LG

TL;DR: Proposes a random feature operator learning method that provides theoretical guarantees and error bounds, offering similar or better performance than kernel-based and neural network methods with significantly reduced training times and simpler implementation.

DetailsMotivation: Existing operator learning methods (DeepONet, FNO) lack practical guarantees - existence theorems don't ensure obtainable accurate networks in practice. Recent kernel-based frameworks show promise but are computationally expensive.

Method: Random feature operator learning method that serves as randomized approximation of kernel methods, significantly reducing computational requirements for training while maintaining theoretical guarantees.

Result: Achieves similar or better test errors across benchmark examples compared to kernel-based and neural network methods, with significantly reduced training times. Implementation is simple and doesn’t require expensive GPU resources.

Conclusion: The random feature approach provides a practical, theoretically-grounded alternative to existing operator learning methods, offering computational efficiency and strong performance without requiring specialized hardware.

Abstract: Operator learning is the approximation of operators between infinite dimensional Banach spaces using machine learning approaches. While most progress in this area has been driven by variants of deep neural networks such as the Deep Operator Network and Fourier Neural Operator, the theoretical guarantees are often in the form of a universal approximation property. However, the existence theorems do not guarantee that an accurate operator network is obtainable in practice. Motivated by the recent kernel-based operator learning framework, we propose a random feature operator learning method with theoretical guarantees and error bounds. The random feature method can be viewed as a randomized approximation of a kernel method, which significantly reduces the computation requirements for training. We provide a generalization error analysis for our proposed random feature operator learning method along with comprehensive numerical results. Compared to kernel-based method and neural network methods, the proposed method can obtain similar or better test errors across benchmarks examples with significantly reduced training times. An additional advantages it that our implementation is simple and does require costly computational resources, such as GPU.

[296] Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

Vaibhav Singh, Paul Janson, Paria Mehrbod, Adam Ibrahim, Irina Rish, Eugene Belilovsky, Benjamin Thérien

Main category: cs.LG

TL;DR: Infinite learning rate schedule outperforms repeated cosine decay for continual self-supervised pre-training across image and language datasets, effectively addressing forgetting issues without fixed iteration budget constraints.

DetailsMotivation: Existing self-supervised learning methods struggle with non-stationary, non-IID real-world data streams and suffer from forgetting during re-warming phases in continual pre-training with cosine annealing schedules.

Method: Systematic comparison of cosine schedule vs infinite learning rate schedule for continual pre-training, evaluated across diverse image and language datasets including MAE pre-training and autoregressive language models.

Result: Infinite learning rate schedule consistently enhances continual pre-training performance, outperforming repeated cosine decay in both small-scale and large-scale MAE pre-training, as well as zero-shot language model benchmarks.

Conclusion: The infinite learning rate schedule is a more effective alternative to cosine decay for continual self-supervised pre-training, demonstrating scalability and superior performance across multiple domains without iteration budget restrictions.

Abstract: The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. While self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.

[297] Self-Questioning Language Models

Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, Deepak Pathak

Main category: cs.LG

TL;DR: Self-Questioning Language Models (SQLM) enable LLMs to improve reasoning skills through asymmetric self-play where a proposer generates questions and a solver attempts to answer them, without external data.

DetailsMotivation: To test whether language models can improve their reasoning abilities autonomously by generating and solving their own questions rather than relying on external curated datasets.

Method: Asymmetric self-play framework with proposer and solver components trained via RL. Proposer gets rewarded for generating appropriately difficult questions, solver gets rewarded based on majority voting correctness. For coding, uses unit tests for verification.

Result: The method successfully improves language models on three benchmarks: three-digit multiplication, algebra problems from OMEGA, and programming problems from Codeforces without access to external training data.

Conclusion: Language models can autonomously improve their reasoning skills through self-generated question-answer cycles, demonstrating the potential for self-supervised learning without curated datasets.

Abstract: Can large language models improve without external data – by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets.

[298] Task-based Loss Functions in Computer Vision: A Comprehensive Review

Omar Elharrouss, Yasir Mahmood, Yassine Bechqito, Mohamed Adel Serhani, Elarbi Badidi, Jamal Riffi, Hamid Tairi

Main category: cs.LG

TL;DR: Comprehensive review of loss functions in deep learning, covering fundamental to advanced functions, their mathematical foundations, applications across domains, and future directions.

DetailsMotivation: Loss functions are critical for deep learning performance but selecting the right one is challenging. The paper aims to provide a systematic review to guide researchers and practitioners in choosing appropriate loss functions for various applications.

Method: The paper conducts a comprehensive literature review, analyzing mathematical foundations of loss functions, their impact on model training, and strategic selection criteria. It covers applications in computer vision, tabular data prediction, and time series forecasting.

Result: The review systematically categorizes and explains various loss functions from basic (MSE, Cross-Entropy) to advanced (Adversarial, Diffusion losses), providing insights into their computational efficiency, historical evolution, and performance across different domains.

Conclusion: Loss function design needs more adaptive and robust solutions, especially for complex scenarios like multi-modal data and class imbalances. Future directions should focus on enhancing interpretability, scalability, and generalization for more effective deep learning models.

Abstract: Loss functions are at the heart of deep learning, shaping how models learn and perform across diverse tasks. They are used to quantify the difference between predicted outputs and ground truth labels, guiding the optimization process to minimize errors. Selecting the right loss function is critical, as it directly impacts model convergence, generalization, and overall performance across various applications, from computer vision to time series forecasting. This paper presents a comprehensive review of loss functions, covering fundamental metrics like Mean Squared Error and Cross-Entropy to advanced functions such as Adversarial and Diffusion losses. We explore their mathematical foundations, impact on model training, and strategic selection for various applications, including computer vision (Discriminative and generative), tabular data prediction, and time series forecasting. For each of these categories, we discuss the most used loss functions in the recent advancements of deep learning techniques. Also, this review explore the historical evolution, computational efficiency, and ongoing challenges in loss function design, underlining the need for more adaptive and robust solutions. Emphasis is placed on complex scenarios involving multi-modal data, class imbalances, and real-world constraints. Finally, we identify key future directions, advocating for loss functions that enhance interpretability, scalability, and generalization, leading to more effective and resilient deep learning models.

[299] Traversal Learning: A Lossless And Efficient Distributed Learning Framework

Erdenebileg Batbaatar, Jeonggeol Kim, Yongcheol Kim, Young Yoon

Main category: cs.LG

TL;DR: Traversal Learning (TL) is a novel distributed learning approach that addresses accuracy drops in FL, SL, and SFL by implementing centralized learning principles in distributed environments through model traversal and orchestrator-managed virtual batches.

DetailsMotivation: Traditional distributed learning paradigms like Federated Learning, Split Learning, and SplitFed Learning suffer from accuracy degradation due to averaging functions in aggregation and independent gradient updates, leading to decreased model quality.

Method: TL uses a unique strategy where the model traverses nodes during forward propagation while performing backward propagation on the orchestrator. The orchestrator generates virtual batches and plans sequential node visits aligned with data order, effectively implementing centralized learning in distributed settings.

Result: TL matches classic centralized learning in inference accuracy and outperforms other DL methods: 7.85% accuracy improvement for IID datasets, 1.06% macro F1-score improvement for non-IID datasets, 2.60% accuracy improvement for text classification, and 3.88-4.54% AUC improvements for medical/financial datasets.

Conclusion: TL represents a significant advancement in distributed learning by effectively preserving data privacy while maintaining performance comparable to centralized learning, offering a robust solution for various DL tasks across diverse domains.

Abstract: In this paper, we introduce Traversal Learning (TL), a novel approach designed to address the problem of decreased quality encountered in popular distributed learning (DL) paradigms such as Federated Learning (FL), Split Learning (SL), and SplitFed Learning (SFL). Traditional FL experiences from an accuracy drop during aggregation due to its averaging function, while SL and SFL face increased loss due to the independent gradient updates on each split network. TL adopts a unique strategy where the model traverses the nodes during forward propagation (FP) and performs backward propagation (BP) on the orchestrator, effectively implementing centralized learning (CL) principles within a distributed environment. The orchestrator is tasked with generating virtual batches and planning the sequential node visits of the model during FP, aligning them with the ordered index of the data within these batches. We conducted experiments on six datasets representing diverse characteristics across various domains. Our evaluation demonstrates that TL is on par with classic CL approaches in terms of accurate inference, thereby offering a viable and robust solution for DL tasks. TL outperformed other DL methods and improved accuracy by 7.85% for independent and identically distributed (IID) datasets, macro F1-score by 1.06% for non-IID datasets, accuracy by 2.60% for text classification, and AUC by 3.88% and 4.54% for medical and financial datasets, respectively. By effectively preserving data privacy while maintaining performance, TL represents a significant advancement in DL methodologies. The implementation of TL is available at https://github.com/neouly-inc/Traversal-Learning

[300] Training Deep Morphological Neural Networks as Universal Approximators

Konstantinos Fotopoulos, Petros Maragos

Main category: cs.LG

TL;DR: Deep morphological neural networks require ’linear’ activations despite their non-linearity. Proposed constrained architectures preserve sparsity through morphological operations, with improved generalization via residual connections and weight dropout. Networks are trainable, more prunable than linear networks, and hybrid architectures accelerate convergence.

DetailsMotivation: To investigate deep morphological neural networks (DMNNs) and demonstrate the essential role of 'linear' activations despite their inherent non-linearity, while preserving sparsity through constrained parameter architectures.

Method: Propose two constrained architectures: one where majority of parameters are morphological operations, another where majority of learnable parameters are morphological. Use residual connections and weight dropout for improved generalization. Develop hybrid networks combining linear and morphological layers.

Result: Successfully trained DMNNs under constraints, showing they are more prunable than linear networks. Hybrid architectures with morphological layers significantly accelerate gradient descent convergence with large batches.

Conclusion: Morphological neural networks can be effectively trained with constrained parameter architectures, offering better pruning capabilities and faster convergence in hybrid configurations with linear layers.

Abstract: We investigate deep morphological neural networks (DMNNs). We demonstrate that despite their inherent non-linearity, “linear” activations are essential for DMNNs. To preserve their inherent sparsity, we propose architectures that constraint the parameters of the “linear” activations: For the first (resp. second) architecture, we work under the constraint that the majority of parameters (resp. learnable parameters) should be part of morphological operations. We improve the generalization ability of our networks via residual connections and weight dropout. Our proposed networks can be successfully trained, and are more prunable than linear networks. To the best of our knowledge, we are the first to successfully train DMNNs under such constraints. Finally, we propose a hybrid network architecture combining linear and morphological layers, showing empirically that the inclusion of morphological layers significantly accelerates the convergence of gradient descent with large batches.

[301] HOFT: Householder Orthogonal Fine-tuning

Alejandro Moreno Arcas, Albert Sanchis, Jorge Civera, Alfons Juan

Main category: cs.LG

TL;DR: HOFT and SHOFT are novel orthogonal fine-tuning methods that improve time/space efficiency while maintaining or surpassing SOTA performance in various downstream tasks.

DetailsMotivation: Existing orthogonal fine-tuning methods have good generalization but are time and memory inefficient compared to low-rank adaptation methods.

Method: Proposed Householder Orthogonal Fine-tuning (HOFT) to reduce complexity, and Scaled Householder Orthogonal Fine-tuning (SHOFT) based on theoretical exploration of orthogonal fine-tuning properties.

Result: Both HOFT and SHOFT achieve comparable or better results than state-of-the-art adaptation methods in commonsense reasoning, machine translation, subject-driven generation, and mathematical reasoning tasks.

Conclusion: The proposed HOFT and SHOFT methods successfully address efficiency issues in orthogonal fine-tuning while maintaining strong performance across multiple downstream applications.

Abstract: Adaptation of foundation models using low-rank methods is a widespread approach. Another way to adapt these models is to employ orthogonal fine-tuning methods, which are less time and memory efficient despite their good generalization properties. In this work, we propose Householder Orthogonal Fine-tuning (HOFT), a novel orthogonal fine-tuning method that aims to alleviate time and space complexity. Moreover, some theoretical properties of the orthogonal fine-tuning paradigm are explored. From this exploration, Scaled Householder Orthogonal Fine-tuning (SHOFT) is proposed. Both HOFT and SHOFT are evaluated in downstream tasks, namely commonsense reasoning, machine translation, subject-driven generation and mathematical reasoning. Compared with state-of-the-art adaptation methods, HOFT and SHOFT show comparable or better results.

[302] Learning Fluid-Structure Interaction Dynamics with Physics-Informed Neural Networks and Immersed Boundary Methods

Afrah Farea, Saiful Khan, Reza Daryani, Emre Cenk Ersan, Mustafa Serdar Celebi

Main category: cs.LG

TL;DR: Novel Eulerian-Lagrangian PINN architecture with domain-specific networks and learnable B-spline activations significantly improves accuracy for fluid-structure interaction problems with moving boundaries.

DetailsMotivation: Traditional unified PINN architectures struggle to capture distinct physics in fluid-structure interaction problems with moving boundaries and deformable interfaces, leading to substantial errors particularly in pressure predictions.

Method: Developed an Eulerian-Lagrangian PINN architecture that integrates immersed boundary method principles, using separate neural networks for fluid (Eulerian) and structural (Lagrangian) domains coupled through physics-based constraints, plus learnable B-spline activation functions with SiLU.

Result: EL-L architecture achieved 24.1-91.4% accuracy improvement over baseline PINNs, reducing pressure errors from 12.9% to 2.39% in 2D cavity flow with moving solid structure.

Conclusion: Domain decomposition aligned with physical principles combined with locality-aware activation functions is essential for accurate FSI modeling within the PINN framework.

Abstract: Physics-informed neural networks (PINNs) have emerged as a promising approach for solving complex fluid dynamics problems, yet their application to fluid-structure interaction (FSI) problems with moving boundaries remains largely unexplored. This work addresses the critical challenge of modeling FSI systems with deformable interfaces, where traditional unified PINN architectures struggle to capture the distinct physics governing fluid and structural domains simultaneously. We present an innovative Eulerian-Lagrangian PINN architecture that integrates immersed boundary method (IBM) principles to solve FSI problems with moving boundary conditions. Our approach fundamentally departs from conventional unified architectures by introducing domain-specific neural networks: an Eulerian network for fluid dynamics and a Lagrangian network for structural interfaces, coupled through physics-based constraints. Additionally, we incorporate learnable B-spline activation functions with SiLU to capture both localized high-gradient features near interfaces and global flow patterns. Empirical studies on a 2D cavity flow problem involving a moving solid structure show that while baseline unified PINNs achieve reasonable velocity predictions, they suffer from substantial pressure errors (12.9%) in structural regions. Our Eulerian-Lagrangian architecture with learnable activations (EL-L) achieves better performance across all metrics, improving accuracy by 24.1-91.4% and particularly reducing pressure errors from 12.9% to 2.39%. These results demonstrate that domain decomposition aligned with physical principles, combined with locality-aware activation functions, is essential for accurate FSI modeling within the PINN framework.

[303] A Certified Unlearning Approach without Access to Source Data

Umit Yigit Basaran, Sk Miraj Ahmed, Amit Roy-Chowdhury, Basak Guler

Main category: cs.LG

TL;DR: Certified unlearning framework for data removal without access to original training data, using surrogate datasets and statistical distance-based noise scaling.

DetailsMotivation: Address the challenge of data privacy regulations requiring model unlearning when original training data is unavailable, overcoming limitations of traditional methods that assume full dataset access.

Method: Uses surrogate dataset approximating source data statistics, applies controlled noise scaling based on statistical distance between datasets, with theoretical bounds and practical noise calibration techniques.

Result: Effective data removal demonstrated through experiments on synthetic and real-world datasets, providing strong privacy guarantees while maintaining model utility.

Conclusion: Proposed framework enables reliable certified unlearning without original data access, making it practical for privacy-sensitive applications where source data is unavailable.

Abstract: With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model’s behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.

[304] Rescaled Influence Functions: Accurate Data Attribution in High Dimension

Ittai Rubinstein, Samuel B. Hopkins

Main category: cs.LG

TL;DR: RIF (Rescaled Influence Functions) improves upon traditional IF by providing more accurate data attribution with minimal computational overhead, especially in high-dimensional settings.

DetailsMotivation: Influence functions (IF) are widely used for data attribution but tend to underestimate sample removal effects in high-dimensional regimes, limiting their practical effectiveness.

Method: Proposes rescaled influence functions (RIF) as a drop-in replacement for IF that maintains computational efficiency while improving accuracy through better scaling.

Result: RIF demonstrates significantly better prediction accuracy compared to IF across real-world datasets and can detect data poisoning attacks that fool IF-based methods.

Conclusion: RIF provides a practical, computationally efficient improvement over traditional influence functions for data attribution tasks, with both theoretical justification and empirical validation.

Abstract: How does the training data affect a model’s behavior? This is the question we seek to answer with data attribution. The leading practical approaches to data attribution are based on influence functions (IF). IFs utilize a first-order Taylor approximation to efficiently predict the effect of removing a set of samples from the training set without retraining the model, and are used in a wide variety of machine learning applications. However, especially in the high-dimensional regime (# params $\geq \Omega($# samples$)$), they are often imprecise and tend to underestimate the effect of sample removals, even for simple models such as logistic regression. We present rescaled influence functions (RIF), a new tool for data attribution which can be used as a drop-in replacement for influence functions, with little computational overhead but significant improvement in accuracy. We compare IF and RIF on a range of real-world datasets, showing that RIFs offer significantly better predictions in practice, and present a theoretical analysis explaining this improvement. Finally, we present a simple class of data poisoning attacks that would fool IF-based detections but would be detected by RIF.

[305] KLLM: Fast LLM Inference with K-Means Quantization

Xueying Wu, Baijun Zhou, Zhihui Gao, Yuzhe Fu, Qilin Zheng, Yintao He, Hai Li

Main category: cs.LG

TL;DR: KLLM is an LLM inference accelerator that enables efficient execution with K-Means-quantized weights and activations through index-based computation and lightweight outlier detection.

DetailsMotivation: Large language model inference faces memory and computation challenges. While K-Means quantization offers higher accuracy than uniform quantization, its non-uniform structure prevents direct hardware execution and activation outliers hinder effective low-precision quantization.

Method: Proposes KLLM accelerator with index-based computation scheme for efficient MatMuls on K-Means-quantized data, avoiding dequantization. Includes Orizuru outlier detection engine for online identification of top-k largest/smallest activation elements.

Result: Enables efficient deployment of K-Means-based weight and activation quantization for LLM inference by overcoming hardware execution barriers and outlier detection challenges.

Conclusion: KLLM fully unleashes the potential of K-Means quantization for LLM inference by providing hardware-efficient computation and lightweight online outlier detection, addressing key deployment challenges.

Abstract: Large language model (LLM) inference poses significant challenges due to its intensive memory and computation demands. Weight and activation quantization (WAQ) offers a promising solution by reducing both memory footprint and arithmetic complexity. Traditional WAQ designs rely on uniform integer quantization for hardware efficiency, but often suffer from significant model performance degradation at low precision. In contrast, K-Means quantization, a non-uniform technique, achieves higher accuracy by aligning with the Gaussian-like distributions of weights and activations in LLMs. However, two key challenges prevent the efficient deployment of K-Means-based WAQ designs for LLM inference: (1) The non-uniform structure of K-Means-quantized data precludes direct execution on low-precision compute units, necessitating dequantization and floating-point matrix multiplications (MatMuls) during inference. (2) Activation outliers hinder effective low-precision quantization. Offline thresholding methods for outlier detection degrade model performance substantially, while existing online detection techniques introduce significant runtime overhead. To address the aforementioned challenges and fully unleash the potential of K-Means-based WAQ for LLM inference, in this paper, we propose KLLM, an LLM inference accelerator for efficient execution with K-Means-quantized weights and activations. KLLM features an index-based computation scheme for efficient execution of MatMuls and nonlinear operations on K-Means-quantized data, which avoids most of the dequantization and full-precision computations. Moreover, KLLM incorporates a lightweight outlier detection engine, Orizuru, that efficiently identifies the top-$k$ largest and smallest elements in the activation data stream during online inference.

[306] Predicting the Performance of Graph Convolutional Networks with Spectral Properties of the Graph Laplacian

Shalima Binta Manir, Tim Oates

Main category: cs.LG

TL;DR: The Fiedler value (algebraic connectivity) predicts GCN performance - graphs with similar Fiedler values have analogous structural properties and respond similarly to GCN filters and hyperparameters.

DetailsMotivation: Stacking GCN layers doesn't consistently improve performance on tasks like node classification and edge prediction, suggesting the need for better predictors of GCN effectiveness.

Method: Theoretical and empirical exploration using synthetic and real graph data (Cora, CiteSeer, Polblogs), with multiple aggregation methods for Fiedler values across connected components.

Result: Fiedler value is a good predictor of GCN performance, enabling better hyperparameter selection and indicating transfer learning potential between graphs with similar algebraic connectivity.

Conclusion: Algebraic connectivity serves as an effective predictor for GCN performance, providing insights for filter design, hyperparameter tuning, and transfer learning across graphs with similar structural properties.

Abstract: A common observation in the Graph Convolutional Network (GCN) literature is that stacking GCN layers may or may not result in better performance on tasks like node classification and edge prediction. We have found empirically that a graph’s algebraic connectivity, which is known as the Fiedler value, is a good predictor of GCN performance. Intuitively, graphs with similar Fiedler values have analogous structural properties, suggesting that the same filters and hyperparameters may yield similar results when used with GCNs, and that transfer learning may be more effective between graphs with similar algebraic connectivity. We explore this theoretically and empirically with experiments on synthetic and real graph data, including the Cora, CiteSeer and Polblogs datasets. We explore multiple ways of aggregating the Fiedler value for connected components in the graphs to arrive at a value for the entire graph, and show that it can be used to predict GCN performance. We also present theoretical arguments as to why the Fiedler value is a good predictor.

[307] Metis: Training Large Language Models with Advanced Low-Bit Quantization

Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang

Main category: cs.LG

TL;DR: Metis enables stable low-bit quantization (FP8/FP4) training for LLMs by addressing anisotropic parameter distributions through spectral decomposition, adaptive learning rates, and dual-range regularization.

DetailsMotivation: Anisotropic parameter distributions in LLMs create wide numerical ranges that conflict with block-wise quantization bias, causing training instability and poor performance when using low-bit quantization.

Method: Combines spectral decomposition with random embedding to compress distributions, adaptive learning rates in spectral domain to amplify underrepresented features, and dual-range regularizer to constrain numerical precision and parameter range distribution.

Result: FP8 training surpasses FP32 baselines, and FP4 training achieves accuracy comparable to FP32, enabling robust low-bit quantization for LLM training.

Conclusion: Metis provides a framework for stable and scalable LLM training under advanced low-bit quantization, overcoming fundamental barriers posed by anisotropic parameter distributions.

Abstract: This work identifies anisotropic parameter distributions as a fundamental barrier to training large language models (LLMs) with low-bit quantization: a few dominant singular values create wide numerical ranges that conflict with the inherent bias of block-wise quantization. This bias disproportionately preserves high-magnitude values while discarding smaller ones, causing training instability and low model performance. This work introduces Metis, a training framework that combines (i) spectral decomposition with random embedding to efficiently disentangle dominant from long-tail components, compressing broad distributions into quantization-friendly narrow ranges; (ii) adaptive learning rates in the spectral domain to amplify underrepresented directions and better capture diverse features critical for performance; and (iii) a dual-range regularizer that jointly constrains numerical precision and parameter range distribution, ensuring stable, unbiased low-bit training. With Metis, FP8 training surpasses FP32 baselines, and FP4 training achieves accuracy comparable to FP32, paving the way for robust and scalable LLM training under advanced low-bit quantization. The code implementation for Metis is available at: https://github.com/sii-research/Metis.

[308] RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models

Shikun Liu, Deyu Zou, Nima Shoghi, Victor Fung, Kai Liu, Pan Li

Main category: cs.LG

TL;DR: This paper analyzes fine-tuning methods for molecular graph foundation models (MGFMs), classifies eight approaches into three categories, benchmarks them on regression/classification tasks, and proposes ROFT-MOL - a robust fine-tuning method combining weight interpolation and ensemble techniques.

DetailsMotivation: Molecular graph foundation models face unique challenges including smaller pre-training datasets, severe downstream data scarcity, and diverse objectives (regression/classification tasks), requiring enhanced generalization and robust fine-tuning methods.

Method: Classified eight fine-tuning methods into three mechanisms (weight-based, representation-based, partial fine-tuning), benchmarked them on downstream tasks across supervised and self-supervised pre-trained models, then designed ROFT-MOL combining weight interpolation with weight ensemble techniques.

Result: Extensive evaluation provided insights leading to ROFT-MOL, which delivers improved performance across both regression and classification tasks while maintaining ease of use from post-hoc weight interpolation.

Conclusion: ROFT-MOL effectively addresses the unique fine-tuning challenges in molecular graph foundation models by combining the strengths of different fine-tuning approaches, offering robust performance across diverse task types and labeling settings.

Abstract: In the era of foundation models, fine-tuning pre-trained models for specific downstream tasks has become crucial. This drives the need for robust fine-tuning methods to address challenges such as model overfitting and sparse labeling. Molecular graph foundation models (MGFMs) face unique difficulties that complicate fine-tuning. These models are limited by smaller pre-training datasets and more severe data scarcity for downstream tasks, both of which require enhanced model generalization. Moreover, MGFMs must accommodate diverse objectives, including both regression and classification tasks. To better understand and improve fine-tuning techniques under these conditions, we classify eight fine-tuning methods into three mechanisms: weight-based, representation-based, and partial fine-tuning. We benchmark these methods on downstream regression and classification tasks across supervised and self-supervised pre-trained models in diverse labeling settings. This extensive evaluation provides valuable insights and informs the design of a refined robust fine-tuning method, ROFT-MOL. This approach combines the strengths of simple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods, delivering improved performance across both task types while maintaining the ease of use inherent in post-hoc weight interpolation.

[309] Second-Order Tensorial Partial Differential Equations on Graphs

Aref Einizade, Fragkiskos D. Malliaros, Jhony H. Giraldo

Main category: cs.LG

TL;DR: Second-order tensorial PDEs on graphs for continuous product graph neural networks that preserve high-frequency signals and enable efficient spectral decomposition.

DetailsMotivation: Existing approaches for processing multiple interacting graphs rely on discrete filtering or first-order continuous models that dampen high frequencies and propagate information slowly, limiting their effectiveness.

Method: Introduces second-order tensorial partial differential equations on graphs (So-TPDEGs) that exploit the separability of cosine kernels in Cartesian product graphs for efficient spectral decomposition.

Result: The framework preserves high-frequency signals while enabling efficient computation through spectral methods.

Conclusion: Provides a theoretically grounded foundation for second-order continuous graph learning with rigorous analyses of stability under graph perturbations and over-smoothing.

Abstract: Processing data on multiple interacting graphs is crucial for many applications, but existing approaches rely mostly on discrete filtering or first-order continuous models that dampen high frequencies and propagate information slowly. We introduce second-order tensorial partial differential equations on graphs (So-TPDEGs) and propose the first theoretically grounded framework for second-order continuous product graph neural networks. Our method exploits the separability of cosine kernels in Cartesian product graphs to enable efficient spectral decomposition while preserving high-frequency signals. We further provide rigorous analyses of stability under graph perturbations and over-smoothing, establishing a solid theoretical foundation for continuous graph learning.

[310] CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction

Hongzong Li, Jiahao Ma, Zhanpeng Shi, Rui Xiao, Fanming Jin, Ye-Fan Hu, Jian-Dong Huang

Main category: cs.LG

TL;DR: CAME-AB is a novel cross-modality attention framework with Mixture-of-Experts backbone for antibody binding site prediction, integrating five biological modalities and outperforming existing methods.

DetailsMotivation: Existing antibody binding site prediction methods rely on single-view features and fail to identify antibody-specific binding sites on antigens effectively.

Method: Integrates five biological modalities (amino acid encodings, BLOSUM profiles, language model embeddings, structure features, GCN-refined graphs) with adaptive modality fusion, Transformer encoder, MoE module, and supervised contrastive learning with stochastic weight averaging.

Result: Outperforms strong baselines on multiple metrics including Precision, Recall, F1-score, AUC-ROC, and MCC on benchmark antibody-antigen datasets.

Conclusion: CAME-AB provides robust antibody binding site prediction through effective multimodal integration and advanced architectural components, with ablation studies validating each component’s effectiveness.

Abstract: Antibody binding site prediction plays a pivotal role in computational immunology and therapeutic antibody design. Existing sequence or structure methods rely on single-view features and fail to identify antibody-specific binding sites on the antigens. In this paper, we propose \textbf{CAME-AB}, a novel Cross-modality Attention framework with a Mixture-of-Experts (MoE) backbone for robust antibody binding site prediction. CAME-AB integrates five biologically grounded modalities, including raw amino acid encodings, BLOSUM substitution profiles, pretrained language model embeddings, structure-aware features, and GCN-refined biochemical graphs, into a unified multimodal representation. To enhance adaptive cross-modal reasoning, we propose an \emph{adaptive modality fusion} module that learns to dynamically weight each modality based on its global relevance and input-specific contribution. A Transformer encoder combined with an MoE module further promotes feature specialization and capacity expansion. We additionally incorporate a supervised contrastive learning objective to explicitly shape the latent space geometry, encouraging intra-class compactness and inter-class separability. To improve optimization stability and generalization, we apply stochastic weight averaging during training. Extensive experiments on benchmark antibody-antigen datasets demonstrate that CAME-AB consistently outperforms strong baselines on multiple metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation studies further validate the effectiveness of each architectural component and the benefit of multimodal feature integration. The model implementation details and the codes are available on https://anonymous.4open.science/r/CAME-AB-C525

[311] RoseCDL: Robust and Scalable Convolutional Dictionary Learning for Rare-event Detection

Jad Yehya, Mansour Benbakoura, Cédric Allain, Benoît Malezieux, Matthieu Kowalski, Thomas Moreau

Main category: cs.LG

TL;DR: RoseCDL is a scalable and robust convolutional dictionary learning algorithm for unsupervised rare event detection in large signals, addressing computational efficiency and outlier sensitivity challenges.

DetailsMotivation: Convolutional Dictionary Learning (CDL) shows promise for pattern recognition but faces computational limitations and sensitivity to artifacts when detecting rare events in large-scale signals across astronomy, physics, and biomedical fields.

Method: RoseCDL combines stochastic windowing for efficient training on large datasets with inline outlier detection to enhance robustness and isolate anomalous patterns.

Result: The method reframes CDL as a practical tool for event discovery and characterization, extending its capabilities beyond traditional compression and denoising tasks.

Conclusion: RoseCDL provides a scalable and robust solution for unsupervised rare event detection in long signals, making CDL more applicable to real-world signal analysis challenges.

Abstract: Identifying recurring patterns and rare events in large-scale signals is a fundamental challenge in fields such as astronomy, physical simulations, and biomedical science. Convolutional Dictionary Learning (CDL) offers a powerful framework for modeling local structures in signals, but its use for detecting rare or anomalous events remains largely unexplored. In particular, CDL faces two key challenges in this setting: high computational cost and sensitivity to artifacts and outliers. In this paper, we introduce RoseCDL, a scalable and robust CDL algorithm designed for unsupervised rare event detection in long signals. RoseCDL combines stochastic windowing for efficient training on large datasets with inline outlier detection to enhance robustness and isolate anomalous patterns. This reframes CDL as a practical tool for event discovery and characterization in real-world signals, extending its role beyond traditional tasks like compression or denoising.

cs.MA

[312] Teamwork as Linear Interpersonal Dynamics

Andrew Jun Lee, Grace Qiyuan Miao, Rick Dale, Alexia Galati, Hongjing Lu

Main category: cs.MA

TL;DR: The paper introduces the context matrix as a unified representation for interpersonal dynamics that captures both synchrony and directional influence, validated through simulations and eye-tracking data.

DetailsMotivation: Existing measures of interpersonal dynamics capture only single dimensions (either synchrony/coordination or directional influence), lacking a psychologically meaningful unified representation.

Method: Proposed the context matrix as a linear dynamical system transition matrix, developed a sequential Bayesian model to infer context matrices from timeseries data, and validated it through noisy simulations and human eye-tracking data.

Result: The model accurately recovered context matrices in simulations, and summary features captured task-based differences in interpersonal dynamics, predicted task accuracy, and showed correspondence with existing measures like CRQA and Granger causality.

Conclusion: The context matrix provides a psychologically meaningful unified representation of interpersonal dynamics and fits within a broader agenda for modeling team coordination and influence.

Abstract: Successful teamwork depends on interpersonal dynamics, the ways in which individuals coordinate, influence, and adapt to one another over time. Existing measures of interpersonal dynamics, such as CRQA, correlation, Granger causality, and transfer entropy, typically capture only a single dimension: either the synchrony/coordination or the direction of influence between individuals. What is missing is a psychologically meaningful representation that unifies these dimensions and varies systematically with behavior. We propose the context matrix as one such representation. The context matrix is the transition matrix in a linear dynamical system, with entries specifying how much each individual’s current behavior is attributable to their own versus every other group member’s past behaviors. Its values can be distilled into psychologically interpretable summary features of synchrony and directional influence. Evidence for the context matrix as psychologically meaningful is provided in two steps. First, we develop a sequential Bayesian model that infers context matrices from timeseries data and show that it accurately recovers them in noisy simulations. Second, applying the model to human eyetracking data, we show that summary features of the inferred context matrices capture expected task-based differences in interpersonal dynamics (or lack thereof), predict task accuracy in psychologically reasonable ways, and show some correspondence with existing measures (CRQA and Granger causality). We conclude by situating the context matrix within a broader agenda for modeling interpersonal dynamics.

[313] Synergy Over Spiral: A Logistics 5.0 Game-Theoretic Model for Trust-Fatigue Co-regulation in Human-Cobot Order Picking

Soumyadeep Dhar, Ariyan Kumar Saha

Main category: cs.MA

TL;DR: This paper develops a Stackelberg game model for human-cobot collaboration in logistics, showing that accounting for trust and fatigue can increase productivity by 100% and reduce trust recovery time by 75% compared to naive approaches.

DetailsMotivation: To address the challenges of human-robot symbiosis in Logistics 5.0 by investigating how trust and fatigue impact collaborative order picking performance.

Method: Proposed a dynamic leader-follower Stackelberg game with utility functions that explicitly model human fatigue and trust, validated through agent-based simulations.

Result: Refined trust model created a “trust synergy cycle” increasing productivity by nearly 100%, and Trust-Recovery Mode reduced trust recovery time by over 75% after disruptions.

Conclusion: Provides a framework for designing intelligent cobot behaviors that support Industry 5.0 pillars of human-centricity, sustainability, and resilience in logistics operations.

Abstract: This paper investigates the critical role of trust and fatigue in human-cobot collaborative order picking, framing the challenge within the scope of Logistics 5.0: the implementation of human-robot symbiosis in smart logistics. We propose a dynamic, leader-follower Stackelberg game to model this interaction, where utility functions explicitly account for human fatigue and trust. Through agent-based simulations, we demonstrate that while a naive model leads to a “trust death spiral,” a refined trust model creates a “trust synergy cycle,” increasing productivity by nearly 100 percent. Finally, we show that a cobot operating in a Trust-Recovery Mode can overcome system brittleness after a disruption, reducing trust recovery time by over 75 percent compared to a non-adaptive model. Our findings provide a framework for designing intelligent cobot behaviors that fulfill the Industry 5.0 pillars of human-centricity, sustainability, and resilience.

cs.MM

[314] Memory-Anchored Multimodal Reasoning for Explainable Video Forensics

Chen Chen, Runze Li, Zejun Zhang, Pukun Zhao, Fanqing Zhou, Longxiang Wang, Haojian Huang

Main category: cs.MM

TL;DR: FakeHunter is a multimodal deepfake detection framework that combines memory-guided retrieval with structured reasoning and adaptive forensic tools to achieve both robustness and interpretability.

DetailsMotivation: Address the need for deepfake detection systems that are both robust against sophisticated manipulations and provide interpretable explanations rather than opaque scores.

Method: Uses CLIP for visual and CLAP for audio representations to retrieve authentic exemplars from memory, employs Observation-Thought-Action reasoning loop, and selectively triggers fine-grained forensic tools when confidence is low.

Result: Outperforms strong multimodal baselines, with ablation studies confirming both contextual retrieval and selective tool activation are essential for improved robustness and explanatory precision.

Conclusion: The proposed framework successfully combines retrieval-augmented reasoning with adaptive forensic analysis to achieve state-of-the-art performance while providing interpretable detection results.

Abstract: We address multimodal deepfake detection requiring both robustness and interpretability by proposing FakeHunter, a unified framework that combines memory guided retrieval, a structured Observation-Thought-Action reasoning loop, and adaptive forensic tool invocation. Visual representations from a Contrastive Language-Image Pretraining (CLIP) model and audio representations from a Contrastive Language-Audio Pretraining (CLAP) model retrieve semantically aligned authentic exemplars from a large scale memory, providing contextual anchors that guide iterative localization and explanation of suspected manipulations. Under low internal confidence the framework selectively triggers fine grained analyses such as spatial region zoom and mel spectrogram inspection to gather discriminative evidence instead of relying on opaque marginal scores. We also release X-AVFake, a comprehensive audio visual forgery benchmark with fine grained annotations of manipulation type, affected region or entity, reasoning category, and explanatory justification, designed to stress contextual grounding and explanation fidelity. Extensive experiments show that FakeHunter surpasses strong multimodal baselines, and ablation studies confirm that both contextual retrieval and selective tool activation are indispensable for improved robustness and explanatory precision.

eess.AS

[315] A Bottom-up Framework with Language-universal Speech Attribute Modeling for Syllable-based ASR

Hao Yen, Pin-Jui Ku, Sabato Marco Siniscalchi, Chin-Hui Lee

Main category: eess.AS

TL;DR: Bottom-up ASR framework that uses articulatory attributes as universal pronunciation representation, then converts to syllables, showing competitive performance and better robustness in low-resource settings with strong cross-lingual transfer.

DetailsMotivation: To develop a more robust and language-universal approach for syllable-based language ASR that can handle pronunciation variations and work well in low-resource scenarios with cross-lingual transfer capabilities.

Method: Two-stage framework: 1) Recognize sequences of articulatory attributes as universal pronunciation representation, 2) Transform attributes into syllables through structured knowledge integration. Introduces PrER and SHER metrics for evaluation.

Result: Achieves competitive performance on AISHELL-1 Mandarin corpus, better robustness in low-resource conditions compared to direct syllable prediction. 40% error rate reduction in zero-shot cross-lingual transfer to Japanese over character/phoneme baselines.

Conclusion: The bottom-up approach using articulatory attributes provides effective language-universal representation for syllable-based ASR, enabling robust performance and strong cross-lingual transfer capabilities.

Abstract: We propose a bottom-up framework for automatic speech recognition (ASR) in syllable-based languages by unifying language-universal articulatory attribute modeling with syllable-level prediction. The system first recognizes sequences or lattices of articulatory attributes that serve as a language-universal, interpretable representation of pronunciation, and then transforms them into syllables through a structured knowledge integration process. We introduce two evaluation metrics, namely Pronunciation Error Rate (PrER) and Syllable Homonym Error Rate (SHER), to evaluate the model’s ability to capture pronunciation and handle syllable ambiguities. Experimental results on the AISHELL-1 Mandarin corpus demonstrate that the proposed bottom-up framework achieves competitive performance and exhibits better robustness under low-resource conditions compared to the direct syllable prediction model. Furthermore, we investigate the zero-shot cross-lingual transferability on Japanese and demonstrate significant improvements over character- and phoneme-based baselines by 40% error rate reduction.

[316] Context-Aware Query Refinement for Target Sound Extraction: Handling Partially Matched Queries

Ryo Sato, Chiho Haruta, Nobuhiko Hiruma, Keisuke Imoto

Main category: eess.AS

TL;DR: Proposes context-aware query refinement for target sound extraction to handle partially matched queries containing both active and inactive sounds in the mixture.

DetailsMotivation: Real-world TSE scenarios often involve queries with inactive sounds not present in the mixture, particularly in Partially Matched Query conditions where performance degradation has been overlooked.

Method: Context-aware query refinement that eliminates inactive classes from queries during inference based on estimated sound class activity.

Result: Conventional methods suffer performance degradation under PMQ conditions, but proposed method effectively mitigates this and achieves high robustness across diverse query conditions.

Conclusion: The context-aware query refinement approach successfully addresses the PMQ problem in TSE, providing robust performance when queries contain both active and inactive sounds.

Abstract: Target sound extraction (TSE) is the task of extracting a target sound specified by a query from an audio mixture. Much prior research has focused on the problem setting under the Fully Matched Query (FMQ) condition, where the query specifies only active sounds present in the mixture. However, in real-world scenarios, queries may include inactive sounds that are not present in the mixture. This leads to scenarios such as the Fully Unmatched Query (FUQ) condition, where only inactive sounds are specified in the query, and the Partially Matched Query (PMQ) condition, where both active and inactive sounds are specified. Among these conditions, the performance degradation under the PMQ condition has been largely overlooked. To achieve robust TSE under the PMQ condition, we propose context-aware query refinement. This method eliminates inactive classes from the query during inference based on the estimated sound class activity. Experimental results demonstrate that while conventional methods suffer from performance degradation under the PMQ condition, the proposed method effectively mitigates this degradation and achieves high robustness under diverse query conditions.

[317] Few-shot Personalization via In-Context Learning for Speech Emotion Recognition based on Speech-Language Model

Mana Ihori, Taiga Yamane, Naotaka Kawata, Naoki Makishima, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura

Main category: eess.AS

TL;DR: Personalized speech emotion recognition using in-context learning with few emotional utterances from target speakers, outperforming conventional methods.

DetailsMotivation: Emotion expression varies by person, making speaker-specific adaptation crucial for SER performance. Traditional methods require emotional utterances for all emotion labels, which is often impractical.

Method: Meta-train a speech-language model extended from LLMs to perform personalized SER via in-context learning, conditioning on few emotional utterances of target speakers during inference.

Result: Experimental results on a newly collected SER dataset show the proposed method outperforms conventional SER approaches.

Conclusion: In-context learning enables effective personalization for speech emotion recognition using only a few emotional utterances, overcoming limitations of conventional methods.

Abstract: This paper proposes a personalization method for speech emotion recognition (SER) through in-context learning (ICL). Since the expression of emotions varies from person to person, speaker-specific adaptation is crucial for improving the SER performance. Conventional SER methods have been personalized using emotional utterances of a target speaker, but it is often difficult to prepare utterances corresponding to all emotion labels in advance. Our idea to overcome this difficulty is to obtain speaker characteristics by conditioning a few emotional utterances of the target speaker in ICL-based inference. ICL is a method to perform unseen tasks by conditioning a few input-output examples through inference in large language models (LLMs). We meta-train a speech-language model extended from the LLM to learn how to perform personalized SER via ICL. Experimental results using our newly collected SER dataset demonstrate that the proposed method outperforms conventional methods.

[318] Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition

Jing-Tong Tzeng, Carlos Busso, Chi-Chun Lee

Main category: eess.AS

TL;DR: Sparse MERIT is a multi-task learning framework that uses frame-wise expert routing on self-supervised speech representations to jointly optimize speech enhancement and emotion recognition, outperforming baselines in noisy conditions.

DetailsMotivation: Speech emotion recognition performance degrades significantly under noisy conditions. Speech enhancement can help but introduces artifacts and computational overhead. Multi-task learning faces gradient interference and representational conflicts between tasks.

Method: Proposed Sparse MERIT framework with task-specific gating networks that dynamically select from a shared pool of experts for each frame, enabling parameter-efficient and task-adaptive representation learning from self-supervised speech representations.

Result: On MSP-Podcast corpus, Sparse MERIT improved SER F1-macro by 12.0% over SE pre-processing baseline and 3.4% over naive MTL baseline at -5 dB SNR. For SE, improved SSNR by 28.2% and 20.0% respectively, with statistical significance on unseen noise conditions.

Conclusion: Sparse MERIT provides robust and generalizable performance for both emotion recognition and enhancement tasks in noisy environments, effectively addressing gradient interference and representational conflicts in multi-task learning.

Abstract: Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions. Although speech enhancement (SE) can improve robustness, it often introduces artifacts that obscure emotional cues and adds computational overhead to the pipeline. Multi-task learning (MTL) offers an alternative by jointly optimizing SE and SER tasks. However, conventional shared-backbone models frequently suffer from gradient interference and representational conflicts between tasks. To address these challenges, we propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations. Sparse MERIT incorporates task-specific gating networks that dynamically select from a shared pool of experts for each frame, enabling parameter-efficient and task-adaptive representation learning. Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks. Under the most challenging condition of -5 dB signal-to-noise ratio (SNR), Sparse MERIT improves SER F1-macro by an average of 12.0% over a baseline relying on a SE pre-processing strategy, and by 3.4% over a naive MTL baseline, with statistical significance on unseen noise conditions. For SE, Sparse MERIT improves segmental SNR (SSNR) by 28.2% over the SE pre-processing baseline and by 20.0% over the naive MTL baseline. These results demonstrate that Sparse MERIT provides robust and generalizable performance for both emotion recognition and enhancement tasks in noisy environments.

[319] Audio Deepfake Verification

Li Wang, Junyi Ao, Linyong Gan, Yuancheng Wang, Xueyao Zhang, Zhizheng Wu

Main category: eess.AS

TL;DR: This paper introduces Audio Deepfake Verification (ADV) task and Audity dual-branch architecture for open-set deepfake source tracing, outperforming single-branch approaches in both detection and verification.

DetailsMotivation: With rapid deepfake technology development, binary true/false audio judgments are insufficient. There's a need to accurately determine specific deepfake methods and overcome limitations of existing source tracing methods in closed-set scenarios.

Method: Proposes Audity dual-branch architecture that extracts deepfake features from two dimensions: audio structure and generation artifacts, enabling open-set deepfake source tracing.

Result: Experimental results show the dual-branch Audity architecture outperforms any single-branch configuration and achieves excellent performance in both deepfake detection and verification tasks.

Conclusion: The proposed ADV task and Audity dual-branch architecture effectively address open-set deepfake source tracing challenges and demonstrate superior performance compared to single-branch approaches.

Abstract: With the rapid development of deepfake technology, simply making a binary judgment of true or false on audio is no longer sufficient to meet practical needs. Accurately determining the specific deepfake method has become crucial. This paper introduces the Audio Deepfake Verification (ADV) task, effectively addressing the limitations of existing deepfake source tracing methods in closed-set scenarios, aiming to achieve open-set deepfake source tracing. Meanwhile, the Audity dual-branch architecture is proposed, extracting deepfake features from two dimensions: audio structure and generation artifacts. Experimental results show that the dual-branch Audity architecture outperforms any single-branch configuration, and it can simultaneously achieve excellent performance in both deepfake detection and verification tasks.

[320] Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching

Siratish Sakpiboonchit

Main category: eess.AS

TL;DR: SmoothCache accelerates DiT-based TTS inference by caching transformer layer outputs during denoising, reducing redundant computations while maintaining quality through calibrated cache scheduling.

DetailsMotivation: To optimize diffusion transformer TTS models by reducing inference time without architectural changes or retraining, addressing computational inefficiencies in the denoising process.

Method: Integrates SmoothCache into F5-TTS, caching self-attention and FFN layer outputs. Uses calibration phase with L1 error analysis to determine optimal cache schedules and adopts unified caching to handle inter-layer dependencies.

Result: Caching at higher denoising steps reduces inference time without quality loss, while caching at lower steps degrades quality similar to reducing total steps. Maintains performance with improved computational efficiency.

Conclusion: Transformer layer caching is an effective practical solution for optimizing DiT-based TTS models, offering computational benefits without requiring model modifications or retraining.

Abstract: This paper presents a method to accelerate the inference process of diffusion transformer (DiT)-based text-to-speech (TTS) models by applying a selective caching mechanism to transformer layers. Specifically, I integrate SmoothCache into the F5-TTS architecture, focusing on caching outputs of self-attention and feed-forward network layers to reduce redundant computations during the denoising process. A calibration phase is introduced to analyze L1 relative errors between timesteps, guiding the selection of cache schedules that minimize quality degradation. To address the problem of inter-layer dependency, a unified caching schedule is adopted, applying the cache pattern derived from self-attention layers to both layer types. Experiments on LibriSpeech-PC and Seed-TTS datasets evaluate various cache thresholds and denoising step configurations. Results show that caching at higher denoising steps reduces inference time without compromising output quality, whereas caching at lower steps can negatively impact synthesis quality similarly to reducing the total number of denoising steps. Objective and subjective metrics confirm the effectiveness of SmoothCache in maintaining performance while improving computational efficiency. Comparisons between cached inference and reduced-step inference further highlight the benefits of selective caching, especially under high-step configurations. This work demonstrates that transformer layer caching is a practical solution for optimizing diffusion transformer-based TTS models without requiring architectural changes or retraining. Example inference results can be heard at https://siratish.github.io/F5-TTS_SmoothCache/ .

[321] MF-PAM: Accurate Pitch Estimation through Periodicity Analysis and Multi-level Feature Fusion

Woo-Jin Chung, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang

Main category: eess.AS

TL;DR: MF-PAM is a deep learning model that accurately estimates pitch in noisy environments using periodic feature extraction and multi-level fusion, achieving 99.20% accuracy on clean music data with fewer parameters than state-of-the-art methods.

DetailsMotivation: To address the challenge of accurate pitch estimation in noisy and reverberant acoustic environments by leveraging the periodic characteristics of audio signals.

Method: Uses periodic non-periodic convolution (PNP-Conv) blocks to extract pitch periodicity, then aggregates multi-level features using a modified bi-directional feature pyramid network (BiFPN).

Result: Achieves superior pitch estimation performance compared to state-of-the-art baselines with fewer model parameters, including 99.20% accuracy on clean musical dataset.

Conclusion: Provides a promising solution for accurate pitch estimation in challenging acoustic environments with potential applications in audio signal processing.

Abstract: We introduce Multi-level feature Fusion-based Periodicity Analysis Model (MF-PAM), a novel deep learning-based pitch estimation model that accurately estimates pitch trajectory in noisy and reverberant acoustic environments. Our model leverages the periodic characteristics of audio signals and involves two key steps: extracting pitch periodicity using periodic non-periodic convolution (PNP-Conv) blocks and estimating pitch by aggregating multi-level features using a modified bi-directional feature pyramid network (BiFPN). We evaluate our model on speech and music datasets and achieve superior pitch estimation performance compared to state-of-the-art baselines while using fewer model parameters. Our model achieves 99.20 % accuracy in pitch estimation on a clean musical dataset. Overall, our proposed model provides a promising solution for accurate pitch estimation in challenging acoustic environments and has potential applications in audio signal processing.

eess.IV

[322] STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery

David Robinson, Animesh Gupta, Rizwan Quershi, Qiushi Fu, Mubarak Shah

Main category: eess.IV

TL;DR: StrokeVision-Bench is the first dedicated dataset of stroke patients performing clinically structured block transfer tasks (Box and Block Test) with 1,000 annotated videos in two modalities, addressing gaps in existing datasets for objective UE function assessment.

DetailsMotivation: Current clinical assessment of upper extremity function after stroke is subjective and lacks sensitivity for detecting subtle motor improvements. Existing datasets focus on daily living activities rather than structured clinical assessments and mix healthy/affected individuals.

Method: Created StrokeVision-Bench dataset with 1,000 annotated videos of stroke patients performing block transfer tasks, categorized into four action classes. Provided two modalities: raw video frames and 2D skeletal keypoints. Benchmarked state-of-the-art video action recognition and skeleton-based action classification methods.

Result: Established performance baselines for automated assessment of stroke rehabilitation using structured clinical tasks. The dataset enables objective, quantitative analysis of UE motor function through computer vision approaches.

Conclusion: StrokeVision-Bench addresses critical gaps in stroke rehabilitation datasets by providing dedicated, clinically structured assessment data, facilitating future research in automated, objective assessment of upper extremity function recovery after stroke.

Abstract: Despite advancements in rehabilitation protocols, clinical assessment of upper extremity (UE) function after stroke largely remains subjective, relying heavily on therapist observation and coarse scoring systems. This subjectivity limits the sensitivity of assessments to detect subtle motor improvements, which are critical for personalized rehabilitation planning. Recent progress in computer vision offers promising avenues for enabling objective, quantitative, and scalable assessment of UE motor function. Among standardized tests, the Box and Block Test (BBT) is widely utilized for measuring gross manual dexterity and tracking stroke recovery, providing a structured setting that lends itself well to computational analysis. However, existing datasets targeting stroke rehabilitation primarily focus on daily living activities and often fail to capture clinically structured assessments such as block transfer tasks. Furthermore, many available datasets include a mixture of healthy and stroke-affected individuals, limiting their specificity and clinical utility. To address these critical gaps, we introduce StrokeVision-Bench, the first-ever dedicated dataset of stroke patients performing clinically structured block transfer tasks. StrokeVision-Bench comprises 1,000 annotated videos categorized into four clinically meaningful action classes, with each sample represented in two modalities: raw video frames and 2D skeletal keypoints. We benchmark several state-of-the-art video action recognition and skeleton-based action classification methods to establish performance baselines for this domain and facilitate future research in automated stroke rehabilitation assessment.

[323] BodyWave: Egocentric Body Tracking using mmWave Radars on an MR Headset

Yin Li, Sean Korphi, Sam Shiu, Yasuo Morimoto, Jiang Zhu, Rajalakshimi Nandakumar

Main category: eess.IV

TL;DR: BodyWave is a millimeter-wave radar-based inside-out body tracking system that addresses limitations of camera-based solutions by detecting non-line-of-sight movements with better privacy and robustness, achieving comparable accuracy to state-of-the-art camera systems.

DetailsMotivation: Camera-based inside-out body tracking (IOBT) faces challenges with limited view angles, sparse input, and self-occlusion, especially for lower-body parts in mixed reality applications.

Method: Uses millimeter-wave radar technology placed 4cm from face (Meta Quest 3 form factor), processes raw signals into high-resolution range profiles to predict 3D body keypoint coordinates.

Result: Achieved MPJPE of 9.85cm on unseen users, 4.94cm with brief calibration, and 3.86cm in user-dependent setting - comparable to camera-based IOBT systems.

Conclusion: BodyWave provides a robust, privacy-preserving alternative to camera-based IOBT with low SWAP+C and non-line-of-sight detection capabilities suitable for mixed reality applications.

Abstract: Egocentric body tracking, also known as inside-out body tracking (IOBT), is an essential technology for applications like gesture control and codec avatar in mixed reality (MR), including augmented reality (AR) and virtual reality (VR). However, it is more challenging than exocentric body tracking due to the limited view angles of camera-based solutions, which provide only sparse and self-occluded input from head-mounted cameras, especially for lower-body parts. To address these challenges, we propose, BodyWave, an IOBT system based on millimeter-wave (mmWave) radar, which can detect non-line-of-sight. It offers low SWAP+C (size, weight, and power consumption), robustness to environmental and user factors, and enhanced privacy over camera-based solutions. Our prototype, modeled after the Meta Quest 3 form factor, places radars just 4cm away from the face, which significantly advances the practicality of radar-based IOBT. We tackle the sparsity issue of mmWave radar by processing the raw signal into high-resolution range profiles to predict fine-grained 3D coordinates of body keypoints. In a user study with 14 participants and around 500,000 frames of collected data, we achieved a mean per-joint position error (MPJPE) of 9.85 cm on unseen users, 4.94 cm with a few minutes of user calibration, and 3.86 cm in a fully-adapted user-dependent setting. This is comparable to state-of-the-art camera-based IOBT systems, introducing a robust and privacy-preserving alternative for MR applications.

[324] Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis

Ifrat Ikhtear Uddin, Longwei Wang, KC Santosh

Main category: eess.IV

TL;DR: Expert-guided explainable few-shot learning framework that integrates radiologist ROIs with Grad-CAM supervision to improve both classification accuracy and interpretability in medical imaging.

DetailsMotivation: Address limited expert-annotated data in medical image analysis that hinders model generalization and clinical adoption by incorporating expert knowledge to enhance both performance and interpretability.

Method: Proposes an explanation loss based on Dice similarity to align model attention with diagnostically relevant regions, jointly optimized with prototypical network objective. Uses Grad-CAM for spatial attention supervision during training.

Result: Achieved significant accuracy improvements: from 77.09% to 83.61% on BraTS (MRI) and from 54.33% to 73.29% on VinDr-CXR (Chest X-ray) compared to non-guided models. Grad-CAM visualizations confirmed better attention alignment with diagnostic regions.

Conclusion: Expert-guided attention supervision effectively bridges the gap between performance and interpretability in few-shot medical image diagnosis, improving both predictive reliability and clinical trustworthiness.

Abstract: Medical image analysis often faces significant challenges due to limited expert-annotated data, hindering both model generalization and clinical adoption. We propose an expert-guided explainable few-shot learning framework that integrates radiologist-provided regions-of-interests (ROIs) into model training to simultaneously enhance classification performance and interpretability. Leveraging Grad-CAM for spatial attention supervision, we introduce an explanation loss based on Dice similarity to align model attention with diagnostically relevant regions during training. This explanation loss is jointly optimized with a standard prototypical network objective, encouraging the model to focus on clinically meaningful features even under limited data conditions. We evaluate our framework on two distinct datasets: BraTS (MRI) and VinDr-CXR (Chest X-ray), achieving significant accuracy improvements from 77.09% to 83.61% on BraTS and from 54.33% to 73.29% on VinDr-CXR compared to non-guided models. Grad-CAM visualizations further confirm that expert-guided training consistently aligns attention with diagnostic regions, improving both predictive reliability and clinical trustworthiness. Our findings demonstrate the effectiveness of incorporating expert-guided attention supervision to bridge the gap between performance and interpretability in few-shot medical image diagnosis.

[325] Validation of a CT-brain analysis tool for measuring global cortical atrophy in older patient cohorts

Sukhdeep Bal, Emma Colbourne, Jasmine Gan, Ludovica Griffanti, Taylor Hanayik, Nele Demeyere, Jim Davies, Sarah T Pendlebury, Mark Jenkinson

Main category: eess.IV

TL;DR: Deep learning tool accurately measures brain atrophy from CT scans, matching human raters’ performance and correlating with age and cognitive scores.

DetailsMotivation: Current brain atrophy quantification requires time-consuming visual rating scales, necessitating automated analysis tools for efficient and standardized measurement.

Method: Developed and validated a deep learning tool to measure Global Cerebral Atrophy (GCA) scores from CT-brain scans, comparing against trained human raters using mean absolute error and Cohen’s weighted kappa metrics.

Result: The DL tool achieved mean absolute error of 3.2 overall compared to human raters, with half of predictions having error between -2 and 2. Inter-rater agreement was comparable to human-human agreement (Kappa=0.45 vs 0.28). GCA scores correlated significantly with age and cognitive impairment (p<0.001).

Conclusion: The automated DL tool provides accurate, standardized brain atrophy quantification without user input, enabling large-scale health data research and serving as proof-of-concept for clinical point-of-care applications.

Abstract: Quantification of brain atrophy currently requires visual rating scales which are time consuming and automated brain image analysis is warranted. We validated our automated deep learning (DL) tool measuring the Global Cerebral Atrophy (GCA) score against trained human raters, and associations with age and cognitive impairment, in representative older (>65 years) patients. CT-brain scans were obtained from patients in acute medicine (ORCHARD-EPR), acute stroke (OCS studies) and a legacy sample. Scans were divided in a 60/20/20 ratio for training, optimisation and testing. CT-images were assessed by two trained raters (rater-1=864 scans, rater-2=20 scans). Agreement between DL tool-predicted GCA scores (range 0-39) and the visual ratings was evaluated using mean absolute error (MAE) and Cohen’s weighted kappa. Among 864 scans (ORCHARD-EPR=578, OCS=200, legacy scans=86), MAE between the DL tool and rater-1 GCA scores was 3.2 overall, 3.1 for ORCHARD-EPR, 3.3 for OCS and 2.6 for the legacy scans and half had DL-predicted GCA error between -2 and 2. Inter-rater agreement was Kappa=0.45 between the DL-tool and rater-1, and 0.41 between the tool and rater- 2 whereas it was lower at 0.28 for rater-1 and rater-2. There was no difference in GCA scores from the DL-tool and the two raters (one-way ANOVA, p=0.35) or in mean GCA scores between the DL-tool and rater-1 (paired t-test, t=-0.43, p=0.66), the tool and rater-2 (t=1.35, p=0.18) or between rater-1 and rater-2 (t=0.99, p=0.32). DL-tool GCA scores correlated with age and cognitive scores (both p<0.001). Our DL CT-brain analysis tool measured GCA score accurately and without user input in real-world scans acquired from older patients. Our tool will enable extraction of standardised quantitative measures of atrophy at scale for use in health data research and will act as proof-of-concept towards a point-of-care clinically approved tool.

[326] CardioComposer: Flexible and Compositional Anatomical Structure Generation with Disentangled Geometric Guidance

Karim Kadry, Shoaib Goraya, Ajay Manicka, Abdalla Abdelwahed, Farhad Nezami, Elazer Edelman

Main category: eess.IV

TL;DR: A programmable framework using ellipsoidal primitives to guide unconditional diffusion models for generating controllable and anatomically realistic 3D human anatomy.

DetailsMotivation: Current generative models of 3D anatomy face a trade-off between controllability and anatomical realism, limiting their utility for clinical research and medical device design.

Method: Uses interpretable ellipsoidal primitives embedded in 3D space with geometric moment losses applied to selected tissues in multi-tissue segmentation maps to guide the reverse diffusion process.

Result: The framework enables independent control over size, shape, position, and composition of multi-component constraints during inference.

Conclusion: The proposed method provides a programmable and compositional approach to generate anatomically realistic 3D human anatomy with enhanced controllability for structure-function relationship studies.

Abstract: Generative models of 3D anatomy, when integrated with biophysical simulators, enable the study of structure-function relationships for clinical research and medical device design. However, current models face a trade-off between controllability and anatomical realism. We propose a programmable and compositional framework for guiding unconditional diffusion models of human anatomy using interpretable ellipsoidal primitives embedded in 3D space. Our method involves the selection of certain tissues within multi-tissue segmentation maps, upon which we apply geometric moment losses to guide the reverse diffusion process. This framework supports the independent control over size, shape, and position, as well as the composition of multi-component constraints during inference.

[327] Enhancing Privacy Preservation and Reducing Analysis Time with Federated Transfer Learning in Digital Twins-based Computed Tomography Scan Analysis

Avais Jan, Qasim Zia, Murray Patterson

Main category: eess.IV

TL;DR: Federated Transfer Learning (FTL) combines Digital Twin technology and Federated Learning for CT scan analysis, addressing data privacy, resource constraints, and data heterogeneity while outperforming traditional FL methods.

DetailsMotivation: To overcome challenges in biomedical image analysis including data privacy concerns, limited computing resources, and data heterogeneity in CT scan analysis while enabling real-time collaboration.

Method: Proposes FTL framework using pre-trained models and knowledge transfer between peer nodes, allowing cloud servers and Digital Twin-enabled CT scanners to collaborate while protecting patient privacy.

Result: FTL outperforms conventional FL and Clustered FL methods with better precision, accuracy, recall, and F1-score, particularly effective for non-IID data distributions.

Conclusion: FTL provides reliable, efficient, and secure solutions for medical diagnosis, enabling improved decision-making in digital twin-based CT analysis and opening new possibilities for precision medicine and smart healthcare systems.

Abstract: The application of Digital Twin (DT) technology and Federated Learning (FL) has great potential to change the field of biomedical image analysis, particularly for Computed Tomography (CT) scans. This paper presents Federated Transfer Learning (FTL) as a new Digital Twin-based CT scan analysis paradigm. FTL uses pre-trained models and knowledge transfer between peer nodes to solve problems such as data privacy, limited computing resources, and data heterogeneity. The proposed framework allows real-time collaboration between cloud servers and Digital Twin-enabled CT scanners while protecting patient identity. We apply the FTL method to a heterogeneous CT scan dataset and assess model performance using convergence time, model accuracy, precision, recall, F1 score, and confusion matrix. It has been shown to perform better than conventional FL and Clustered Federated Learning (CFL) methods with better precision, accuracy, recall, and F1-score. The technique is beneficial in settings where the data is not independently and identically distributed (non-IID), and it offers reliable, efficient, and secure solutions for medical diagnosis. These findings highlight the possibility of using FTL to improve decision-making in digital twin-based CT scan analysis, secure and efficient medical image analysis, promote privacy, and open new possibilities for applying precision medicine and smart healthcare systems.

[328] Physics-Guided Rectified Flow for Low-light RAW Image Enhancement

Juntai Zeng

Main category: eess.IV

TL;DR: Proposes PGRF - a physics-guided rectified flow framework that combines composite noise modeling with per-pixel calibration for enhanced low-light RAW image enhancement, validated on new LLID dataset.

DetailsMotivation: Existing synthetic datasets for low-light RAW enhancement only consider additive noise, ignore multiplicative components, and use global calibration that overlooks pixel-level manufacturing variations, leading to inaccurate sensor noise reproduction.

Method: Derives composite noise model integrating additive and multiplicative noise from physical noise generation mechanisms. Introduces physics-based per-pixel noise simulation and calibration. Combines with rectified flow generative framework (PGRF) that uses physical guidance to steer generation toward clean images.

Result: Established LLID indoor low-light benchmark dataset. Experimental results demonstrate significant improvements in low-light RAW image enhancement compared to existing methods.

Conclusion: The proposed PGRF framework effectively addresses limitations of existing noise modeling approaches by capturing spatial noise variations and leveraging both physical guidance and rectified flow generative capabilities for superior low-light image enhancement.

Abstract: Enhancing RAW images captured under low light conditions is a challenging task. Recent deep learning based RAW enhancement methods have shifted from using real paired data to relying on synthetic datasets. These synthetic datasets are typically generated by physically modeling sensor noise, but existing approaches often consider only additive noise, ignore multiplicative components, and rely on global calibration that overlooks pixel level manufacturing variations. As a result, such methods struggle to accurately reproduce real sensor noise. To address these limitations, this paper derives a noise model from the physical noise generation mechanisms that occur under low illumination and proposes a novel composite model that integrates both additive and multiplicative noise. To solve the model, we introduce a physics based per pixel noise simulation and calibration scheme that estimates and synthesizes noise for each individual pixel, thereby overcoming the restrictions of traditional global calibration and capturing spatial noise variations induced by microscopic CMOS manufacturing differences. Motivated by the strong performance of rectified flow methods in image generation and processing, we further combine the physics-based noise synthesis with a rectified flow generative framework and present PGRF a physics-guided rectified flow framework for low light image enhancement. PGRF leverages the ability of rectified flows to model complex data distributions and uses physical guidance to steer the generation toward the desired clean image. To validate the effectiveness of the proposed model, we established the LLID dataset, an indoor low light benchmark captured with the Sony A7S II camera. Experimental results demonstrate that the proposed framework achieves significant improvements in low light RAW image enhancement.

[329] Multispectral CT Denoising via Simulation-Trained Deep Learning: Experimental Results at the ESRF BM18

Peter Gänz, Steffen Kieß, Guangpu Yang, Jajnabalkya Guhathakurta, Tanja Pienkny, Charls Clark, Paul Tafforeau, Andreas Balles, Astrid Hölzing, Simon Zabler, Sven Simon

Main category: eess.IV

TL;DR: Neural network-based denoising method for multispectral CT projections that leverages angular, spatial, and spectral redundancies to reduce noise while preserving structural details.

DetailsMotivation: Multispectral CT suffers from reduced photon count per energy bin leading to substantial noise, which either prolongs acquisition times or degrades image quality with strong noise artifacts.

Method: Specialized sub-networks combined via stacked generalization and attention mechanism to exploit redundancies across angular, spatial, and spectral domains, leveraging non-local similarities and spectral correlations. Trained exclusively on simulated data replicating BM18 beamline characteristics.

Result: Substantial improvements in image quality compared to classical denoising methods and baseline CNN models. Superior performance across broad spectral range with effective generalization to real-world experimental data.

Conclusion: The proposed method significantly reduces noise without compromising structural fidelity, demonstrating robust noise suppression while preserving fine details in multispectral CT imaging.

Abstract: Multispectral computed tomography (CT) enables advanced material characterization by acquiring energy-resolved projection data. However, since the incoming X-ray flux is be distributed across multiple narrow energy bins, the photon count per bin is greatly reduced compared to standard energy-integrated imaging. This inevitably introduces substantial noise, which can either prolong acquisition times and make scan durations infeasible or degrade image quality with strong noise artifacts. To address this challenge, we present a dedicated neural network-based denoising approach tailored for multispectral CT projections acquired at the BM18 beamline of the ESRF. The method exploits redundancies across angular, spatial, and spectral domains through specialized sub-networks combined via stacked generalization and an attention mechanism. Non-local similarities in the angular-spatial domain are leveraged alongside correlations between adjacent energy bands in the spectral domain, enabling robust noise suppression while preserving fine structural details. Training was performed exclusively on simulated data replicating the physical and noise characteristics of the BM18 setup, with validation conducted on CT scans of custom-designed phantoms containing both high-Z and low-Z materials. The denoised projections and reconstructions demonstrate substantial improvements in image quality compared to classical denoising methods and baseline CNN models. Quantitative evaluations confirm that the proposed method achieves superior performance across a broad spectral range, generalizing effectively to real-world experimental data while significantly reducing noise without compromising structural fidelity.

[330] CNN-ViT Hybrid for Pneumonia Detection: Theory and Empiric on Limited Data without Pretraining

Prashant Singh Basnet, Roshan Chitrakar

Main category: eess.IV

TL;DR: Hybrid CNN-ViT model achieves superior performance with limited data and class imbalance, outperforming standalone CNN and ViT models while maintaining comparable training time.

DetailsMotivation: To explore the architectural strengths of hybrid CNN-ViT models when trained from scratch on limited datasets with class imbalance, addressing the need for reliable diagnostic models in data-constrained scenarios.

Method: Trained hybrid CNN-ViT model from scratch on varied data fractions with both balanced and imbalanced datasets, comparing performance against standalone CNN and ViT models across different training conditions.

Result: Achieved highest recall of 0.9443 (50% data fraction in balanced dataset) and consistent F1 score around 0.85. Outperformed both CNN and ViT in imbalanced datasets while requiring comparable training time to transformers.

Conclusion: The hybrid CNN-ViT model effectively combines the strengths of both architectures, demonstrating reliability in diagnosis and robustness to class imbalance, making it suitable for practical applications with limited training data.

Abstract: This research explored the hybridization of CNN and ViT within a training dataset of limited size, and introduced a distinct class imbalance. The training was made from scratch with a mere focus on theoretically and experimentally exploring the architectural strengths of the proposed hybrid model. Experiments were conducted across varied data fractions with balanced and imbalanced training datasets. Comparatively, the hybrid model, complementing the strengths of CNN and ViT, achieved the highest recall of 0.9443 (50% data fraction in balanced) and consistency in F1 score around 0.85, suggesting reliability in diagnosis. Additionally, the model was successful in outperforming CNN and ViT in imbalanced datasets. Despite its complex architecture, it required comparable training time to the transformers in all data fractions.

[331] RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts

Lauren H. Cooke, Matthias Jung, Jan M. Brendel, Nora M. Kerkovits, Borek Foldyna, Michael T. Lu, Vineet K. Raghu

Main category: eess.IV

TL;DR: RoentMod is a counterfactual image editing framework that generates realistic chest X-rays with synthetic pathology to detect and correct shortcut learning in medical AI models, improving model performance by 3-19% AUC.

DetailsMotivation: Deep learning models for chest X-ray interpretation are vulnerable to shortcut learning, where they rely on spurious correlations rather than clinically relevant features, limiting their specificity and reliability.

Method: Combines an open-source medical image generator (RoentGen) with an image-to-image modification model to generate anatomically realistic CXRs with user-specified synthetic pathology while preserving original anatomical features.

Result: RoentMod images appeared realistic in 93% of cases, correctly incorporated specified findings in 89-99% of cases, and improved model discrimination by 3-19% AUC in internal validation and 1-11% for 5/6 pathologies in external testing.

Conclusion: RoentMod is an effective tool for probing and correcting shortcut learning in medical AI, enhancing robustness and interpretability of CXR interpretation models through controlled counterfactual interventions.

Abstract: Chest radiographs (CXRs) are among the most common tests in medicine. Automated image interpretation may reduce radiologists' workload and expand access to diagnostic expertise. Deep learning multi-task and foundation models have shown strong performance for CXR interpretation but are vulnerable to shortcut learning, where models rely on spurious and off-target correlations rather than clinically relevant features to make decisions. We introduce RoentMod, a counterfactual image editing framework that generates anatomically realistic CXRs with user-specified, synthetic pathology while preserving unrelated anatomical features of the original scan. RoentMod combines an open-source medical image generator (RoentGen) with an image-to-image modification model without requiring retraining. In reader studies with board-certified radiologists and radiology residents, RoentMod-produced images appeared realistic in 93% of cases, correctly incorporated the specified finding in 89-99% of cases, and preserved native anatomy comparable to real follow-up CXRs. Using RoentMod, we demonstrate that state-of-the-art multi-task and foundation models frequently exploit off-target pathology as shortcuts, limiting their specificity. Incorporating RoentMod-generated counterfactual images during training mitigated this vulnerability, improving model discrimination across multiple pathologies by 3-19% AUC in internal validation and by 1-11% for 5 out of 6 tested pathologies in external testing. These findings establish RoentMod as a broadly applicable tool for probing and correcting shortcut learning in medical AI. By enabling controlled counterfactual interventions, RoentMod enhances the robustness and interpretability of CXR interpretation models and provides a generalizable strategy for improving foundation models in medical imaging.

[332] Deep Unrolling of Sparsity-Induced RDO for 3D Point Cloud Attribute Coding

Tam Thuc Do, Philip A. Chou, Gene Cheung

Main category: eess.IV

TL;DR: Lossy attribute compression for 3D point clouds using multi-resolution B-spline projection with end-to-end differentiable RD optimization and data-driven prediction.

DetailsMotivation: To efficiently compress 3D point cloud attributes when geometry is already available at the decoder, using a structured multi-resolution framework.

Method: Projects attribute function onto nested B-spline subspaces, computes coefficients via RD-optimized feed-forward network with sparsity-promoting L1-norm, and uses data-driven coarse-to-fine prediction adjustment.

Result: Creates an end-to-end differentiable compression framework that optimizes both rate and distortion through learned projections and predictions.

Conclusion: The method provides an optimized, differentiable approach for multi-resolution attribute compression in 3D point clouds with available geometry.

Abstract: Given encoded 3D point cloud geometry available at the decoder, we study the problem of lossy attribute compression in a multi-resolution B-spline projection framework. A target continuous 3D attribute function is first projected onto a sequence of nested subspaces $\mathcal{F}^{(p)}{l_0} \subseteq \cdots \subseteq \mathcal{F}^{(p)}{L}$, where $\mathcal{F}^{(p)}_{l}$ is a family of functions spanned by a B-spline basis function of order $p$ at a chosen scale and its integer shifts. The projected low-pass coefficients $F_l^*$ are computed by variable-complexity unrolling of a rate-distortion (RD) optimization algorithm into a feed-forward network, where the rate term is the sparsity-promoting $\ell_1$-norm. Thus, the projection operation is end-to-end differentiable. For a chosen coarse-to-fine predictor, the coefficients are then adjusted to account for the prediction from a lower-resolution to a higher-resolution, which is also optimized in a data-driven manner.

[333] Spatial-Spectral Chromatic Coding of Interference Signatures in SAR Imagery: Signal Modeling and Physical-Visual Interpretation

Huizhang Yang, Chengzhi Chen, Liyuan Chen, Zhongling Huang, Zhong Liu, Jian Yang

Main category: eess.IV

TL;DR: Novel color coding method for SAR imagery that visualizes interference patterns through spatial-spectral decomposition and RGB/HSV dual-space coding, enabling rapid visual interpretation of interference characteristics.

DetailsMotivation: Traditional grayscale SAR amplitude representations fail to explicitly reveal interference patterns caused by external radio emitters and unfocused signals, limiting visual analysis capabilities.

Method: Spectral subband decomposition to generate spatial-spectral images, followed by chromatic coding using RGB/HSV dual-space with specially designed color palettes to encode spatial-spectral properties into visually discernible patterns.

Result: Effectively highlights interference regions and unfocused echo/signal responses (blurring, ambiguities, moving target effects) in real datasets, enabling rapid visual interpretation without additional processing.

Conclusion: Provides analysts with a practical tool for visual interpretation, quality assessment, and data diagnosis in SAR imagery by making interference patterns visually discernible through chromatic coding.

Abstract: Synthetic Aperture Radar (SAR) images are conventionally visualized as grayscale amplitude representations, which often fail to explicitly reveal interference characteristics caused by external radio emitters and unfocused signals. This paper proposes a novel spatial-spectral chromatic coding method for visual analysis of interference patterns in single-look complex (SLC) SAR imagery. The method first generates a series of spatial-spectral images via spectral subband decomposition that preserve both spatial structures and spectral signatures. These images are subsequently chromatically coded into a color representation using RGB/HSV dual-space coding, using a set of specifically designed color palette. This method intrinsically encodes the spatial-spectral properties of interference into visually discernible patterns, enabling rapid visual interpretation without additional processing. To facilitate physical interpretation, mathematical models are established to theoretically analyze the physical mechanisms of responses to various interference types. Experiments using real datasets demonstrate that the method effectively highlights interference regions and unfocused echo or signal responses (e.g., blurring, ambiguities, and moving target effects), providing analysts with a practical tool for visual interpretation, quality assessment, and data diagnosis in SAR imagery.

[334] Recursive Aperture Decoded Ultrasound Imaging (READI) With Estimated Motion-Compensated Compounding (EMC2)

Tyler Keith Henry, Darren Dahunsi, Randy Palamar, Negar Majidi, Mohammad Rahim Sobhani, Roger Zemp

Main category: eess.IV

TL;DR: READI with EMC2 is a novel motion compensation technique for FORCES ultrasound imaging that produces multiple low-resolution images from subsets of data, estimates motion between them, and aligns them for coherent compounding to restore image quality in moving tissues.

DetailsMotivation: FORCES imaging provides higher SNR and better penetration than traditional STA techniques but suffers from motion sensitivity due to ensemble size and aperture encoding, limiting its effectiveness in dynamic imaging scenarios like cardiac imaging.

Method: READI produces multiple low-resolution images from subsets of the FORCES sequence that are less motion-sensitive. EMC2 compares these images to estimate underlying motion, warps them for alignment, and performs coherent compounding to form the complete image.

Result: READI with EMC2 fully recovers images corrupted by probe motion, restores tissue speckle and sharpness in beating heart images, improves over sparse STA schemes with same transmit count, and recovers blood speckle at 42 cm/s flow rates.

Conclusion: The combined READI and EMC2 approach effectively addresses motion sensitivity in FORCES imaging, enabling high-quality ultrasound imaging in dynamic scenarios while maintaining the SNR and penetration advantages of the FORCES technique.

Abstract: Fast Orthogonal Row-Column Electronic Scanning (FORCES) is a Hadamard-encoded Synthetic Transmit Aperture (STA) imaging sequence using bias-sensitive Top-Orthogonal to Bottom Electrode (TOBE) arrays. It produces images with a higher Signal-to-Noise Ratio (SNR) and improved penetration depth compared to traditional STA techniques, but suffers from motion sensitivity due to ensemble size and aperture encoding. This work presents Recursive Aperture Decoded Ultrasound Imaging (READI), a novel decoding and beamforming technique for FORCES that produces multiple low-resolution images out of subsets of the FORCES sequence that are less susceptible to motion, but sum to form the complete FORCES image. Estimated Motion-Compensated Compounding (EMC2) describes the process of comparing these low-resolution images to estimate the underlying motion, then warping them to align before coherent compounding. READI with EMC2 is shown to fully recover images corrupted by probe motion, and restore tissue speckle and sharpness to an image of a beating heart. READI low-resolution images by themselves are demonstrated to be a marked improvement over sparse STA schemes with the same transmit count, and are shown to recover blood speckle at a flow rate of 42 cm/s.

[335] Low-Cost and Detunable Wireless Resonator Glasses for Enhanced Eye MRI with Concurrent High-Quality Whole Brain MRI

Ming Lu, Xiaoyue Yang, Jason Moore, Pingping Li, Adam W. Anderson, John C. Gore, Seth A. Smith, Xinqiang Yan

Main category: eess.IV

TL;DR: Wireless resonator glasses developed for 7T MRI that provide 3x SNR improvement in eye region without compromising whole-brain image quality.

DetailsMotivation: To enhance eye MRI signal-to-noise ratio at ultrahigh field (7T) while maintaining whole-brain imaging capabilities, addressing the need for better ocular imaging without specialized hardware modifications.

Method: Integrated two detunable LC loop resonators into lightweight 3D-printed glasses frame positioned near eyes. The resonators passively couple to standard head coil without hardware changes. Bench tests evaluated tuning/isolation/detuning, with B1+ mapping in phantom and SNR measurements in both phantom and human subjects.

Result: Bench tests confirmed accurate tuning, strong isolation, and effective detuning. Phantom B1+ mapping showed negligible differences. Both phantom and in vivo imaging demonstrated ~3x SNR gain in eye region with no measurable SNR loss in brain areas.

Conclusion: The wireless resonator glasses offer a low-cost, easy-to-use solution that improves ocular SNR while preserving whole-brain image quality, enabling both dedicated eye MRI and simultaneous eye-brain imaging at 7T.

Abstract: Purpose: To develop and evaluate a wearable wireless resonator glasses design that enhances eye MRI signal-to-noise ratio (SNR) without compromising whole-brain image quality at 7 T. Methods: The device integrates two detunable LC loop resonators into a lightweight, 3D-printed frame positioned near the eyes. The resonators passively couple to a standard 2Tx/32Rx head coil without hardware modifications. Bench tests assessed tuning, isolation, and detuning performance. B1$^+$ maps were measured in a head/shoulder phantom, and SNR maps were obtained in both phantom and in vivo experiments. Results: Bench measurements confirmed accurate tuning, strong inter-element isolation, and effective passive detuning. Phantom B1$^+$ mapping showed negligible differences between configurations with and without the resonators. Phantom and in vivo imaging demonstrated up to about a 3-fold SNR gain in the eye region, with no measurable SNR loss in the brain. Conclusion: The wireless resonator glasses provide a low-cost, easy-to-use solution that improves ocular SNR while preserving whole-brain image quality, enabling both dedicated eye MRI and simultaneous eye-brain imaging at ultrahigh field.

Last updated: 2025-09-15
Built with Hugo, theme modified on Stack