Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 88]
cs.CV [Total: 144]
cs.AI [Total: 68]
cs.SD [Total: 4]
cs.LG [Total: 137]
cs.MA [Total: 7]
cs.MM [Total: 1]
eess.AS [Total: 12]
eess.IV [Total: 16]

cs.CL

[1] Semantic Attractors and the Emergence of Meaning: Towards a Teleological Model of AGI

Hans-Joachim Rudolph

Main category: cs.CL

TL;DR: Proposes a semantic AGI framework using complex-valued meaning spaces with semantic attractors instead of statistical prediction, enabling modeling of irony/homonymy through recursive tensorial transformations.

Details

Motivation: To move beyond statistical next-token prediction in current transformer models and develop a model where meaning is formed through intentional semantic convergence rather than probabilistic inference.

Method: Uses complex-valued meaning spaces with cyclic operations involving imaginary unit i, rotational semantic structures, and semantic attractors (teleological operators) acting as intentional agents guiding meaning through gradient flows and tensor deformations.

Result: Develops a theoretical framework capable of modeling complex semantic phenomena like irony, homonymy, and ambiguity through recursive convergence toward semantic coherence.

Conclusion: True meaning emerges from recursive convergence toward semantic coherence rather than simulation, requiring a fundamentally new cognitive architecture designed to shape language rather than just predict it.

Abstract: This essay develops a theoretical framework for a semantic Artificial General Intelligence (AGI) based on the notion of semantic attractors in complex-valued meaning spaces. Departing from current transformer-based language models, which operate on statistical next-token prediction, we explore a model in which meaning is not inferred probabilistically but formed through recursive tensorial transformation. Using cyclic operations involving the imaginary unit \emph{i}, we describe a rotational semantic structure capable of modeling irony, homonymy, and ambiguity. At the center of this model, however, is a semantic attractor – a teleological operator that, unlike statistical computation, acts as an intentional agent (Microvitum), guiding meaning toward stability, clarity, and expressive depth. Conceived in terms of gradient flows, tensor deformations, and iterative matrix dynamics, the attractor offers a model of semantic transformation that is not only mathematically suggestive, but also philosophically significant. We argue that true meaning emerges not from simulation, but from recursive convergence toward semantic coherence, and that this requires a fundamentally new kind of cognitive architecture – one designed to shape language, not just predict it.

Maojia Song, Tej Deep Pala, Weisheng Jin, Amir Zadeh, Chuan Li, Dorien Herremans, Soujanya Poria

Main category: cs.CL

TL;DR: KAIROS benchmark tests LLM trust formation and misinformation resistance in multi-agent quiz contests, showing GRPO with outcome rewards works best but reduces social influence robustness.

Details

Motivation: To understand how LLMs form trust from previous interactions, resist misinformation, and integrate peer input in multi-agent systems for collective intelligence.

Method: Created KAIROS benchmark simulating quiz contests with varying peer reliability, testing prompting, supervised fine-tuning, and GRPO reinforcement learning across multiple models.

Result: GRPO with multi-agent context and outcome-based rewards achieved best performance but decreased robustness to social influence compared to base models.

Conclusion: Multi-agent reinforcement learning with outcome rewards improves LLM decision-making in collaborative settings but comes with trade-offs in social influence robustness.

Abstract: Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. Although prior work has focused on conformity bias, we extend the analysis to examine how LLMs form trust from previous impressions, resist misinformation, and integrate peer input during interaction, key factors for achieving collective intelligence under complex social dynamics. We present KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expert-novice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how trust, peer action, and self-confidence influence decisions. As for mitigation strategies, we evaluate prompting, supervised fine-tuning, and reinforcement learning, Group Relative Policy Optimisation (GRPO), across multiple models. Our results reveal that GRPO with multi-agent context combined with outcome-based rewards and unconstrained reasoning achieves the best overall performance, but also decreases the robustness to social influence compared to Base models. The code and datasets are available at: https://github.com/declare-lab/KAIROS.

[3] Not All Visitors are Bilingual: A Measurement Study of the Multilingual Web from an Accessibility Perspective

Masudul Hasan Masud Bhuiyan, Matteo Varvello, Yasir Zaki, Cristian-Alexandru Staicu

Main category: cs.CL

TL;DR: LangCrUX dataset reveals widespread multilingual web accessibility issues where language hints often don’t match visible content, reducing screen reader effectiveness. Kizuki extension proposed for automated testing.

Details

Motivation: Multilingual web content creates barriers for visually impaired users as screen readers lack robust non-Latin script support and misrender non-English text, but large-scale studies have been limited by dataset availability.

Method: Created LangCrUX dataset of 120,000 popular websites across 12 non-Latin script languages, then conducted systematic analysis of multilingual web accessibility and language hint consistency.

Result: Found widespread neglect of accessibility hints that fail to reflect language diversity of visible content, reducing screen reader effectiveness and limiting web accessibility.

Conclusion: Proposed Kizuki, a language-aware automated accessibility testing extension to address the limited utility of language-inconsistent accessibility hints in multilingual web environments.

Abstract: English is the predominant language on the web, powering nearly half of the world’s top ten million websites. Support for multilingual content is nevertheless growing, with many websites increasingly combining English with regional or native languages in both visible content and hidden metadata. This multilingualism introduces significant barriers for users with visual impairments, as assistive technologies like screen readers frequently lack robust support for non-Latin scripts and misrender or mispronounce non-English text, compounding accessibility challenges across diverse linguistic contexts. Yet, large-scale studies of this issue have been limited by the lack of comprehensive datasets on multilingual web content. To address this gap, we introduce LangCrUX, the first large-scale dataset of 120,000 popular websites across 12 languages that primarily use non-Latin scripts. Leveraging this dataset, we conduct a systematic analysis of multilingual web accessibility and uncover widespread neglect of accessibility hints. We find that these hints often fail to reflect the language diversity of visible content, reducing the effectiveness of screen readers and limiting web accessibility. We finally propose Kizuki, a language-aware automated accessibility testing extension to account for the limited utility of language-inconsistent accessibility hints.

[4] Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models

Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: PLAST is a training method that efficiently enhances multilingual capabilities in large vision-language models by identifying and fine-tuning only language-specific layers (14% of parameters) using question-translation pairs.

Details

Motivation: Large vision-language models exhibit imbalanced multilingual capabilities despite strong visual understanding, with a correlation found between multilingual ability and language-specific neuron activations in shallow layers.

Method: PLAST identifies layers involved in multilingual understanding by monitoring language-specific neuron activations, then precisely fine-tunes these layers using question-translation pairs to achieve multilingual alignment.

Result: PLAST significantly improves multilingual capabilities on MM-Bench and MMMB benchmarks while being highly efficient, and generalizes well to low-resource and complex visual reasoning tasks.

Conclusion: The method effectively facilitates language-specific visual information engagement in shallow layers, providing an efficient approach to enhance multilingual performance in vision-language models.

Abstract: Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers.

[5] Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

Xinglong Yang, Quan Feng, Zhongying Pan, Xiang Chen, Yu Tian, Wentong Li, Shuofei Qiao, Yuxia Geng, Xingyu Zhao, Sheng-Jun Huang

Main category: cs.CL

TL;DR: Proposes a novel framework for selecting multimodal chain-of-thought examples using difficulty-balanced sampling based on model-perceived and intrinsic complexity, improving performance stability.

Details

Motivation: Random or manual example selection in MCoT prompting fails to account for model-specific knowledge and task complexity, leading to suboptimal and unstable performance.

Method: Reframes prompt selection as curriculum design using two signals: model-perceived difficulty (prediction disagreement) and intrinsic sample complexity, with difficulty-balanced sampling strategy.

Result: Extensive experiments on five benchmarks and multiple MLLMs show substantial improvements and reduced performance discrepancies from random sampling.

Conclusion: Provides a principled and robust approach for enhancing multimodal reasoning through difficulty-aware prompt curriculum design.

Abstract: The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of “tailored teaching with balanced difficulty”. We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model’s current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.

[6] Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails

Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren

Main category: cs.CL

TL;DR: Backprompting method generates production-like labeled data for health advice guardrails, improving detector performance with sparse human labeling.

Details

Motivation: Address the challenge of acquiring production-quality labeled data for LLM guardrails development, particularly for health advice detection where real LLM outputs are scarce before deployment.

Method: Propose backprompting to generate synthetic LLM outputs resembling production data, combined with sparse human-in-the-loop clustering for labeling. Augment existing datasets with synthetic examples to create robust training data.

Result: The detector outperforms GPT-4o by up to 3.73% in identifying health advice in LLM outputs, despite having 400x fewer parameters.

Conclusion: Backprompting with human-in-the-loop clustering effectively generates production-like training data, enabling development of efficient and robust guardrail detectors for LLM safety applications.

Abstract: The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs' input/output text through various detectors. However, developing and maintaining robust detectors faces many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs prior to deployment. In this work, we propose backprompting, a simple yet intuitive solution to generate production-like labeled data for health advice guardrails development. Furthermore, we pair our backprompting method with a sparse human-in-the-loop clustering technique to label the generated data. Our aim is to construct a parallel corpus roughly representative of the original dataset yet resembling real LLM output. We then infuse existing datasets with our synthetic examples to produce robust training data for our detector. We test our technique in one of the most difficult and nuanced guardrails: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.73%, despite having 400x less parameters.

[7] Integral Transformer: Denoising Attention, Not Too Much Not Too Little

Ivan Kobyzev, Abbas Ghaddar, Dingtao Hu, Boxing Chen

Main category: cs.CL

TL;DR: The Integral Transformer addresses attention noise in softmax self-attention by integrating signals from logit distribution, preserving useful special tokens while denoising attention.

Details

Motivation: Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens like special tokens and punctuation (attention noise), and existing solutions risk discarding useful information.

Method: Proposes Integral Transformer with novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution.

Result: Outperforms vanilla, Cog, and Differential attention variants on knowledge and reasoning benchmarks; balances attention distributions and reduces rank collapse in upper layers.

Conclusion: Integral Transformer effectively mitigates attention noise while preserving critical special tokens, with analysis showing vanilla self-attention works better in lower layers and the proposed method excels in upper layers.

Abstract: Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.

[8] Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

Jeong-seok Oh, Jay-yoon Lee

Main category: cs.CL

TL;DR: Latent Self-Consistency (LSC) is a new method that uses learnable token embeddings to select the most semantically consistent response, outperforming existing consistency methods on both short-form and long-form reasoning benchmarks with minimal computational overhead.

Details

Motivation: Existing probabilistic decoding in LLMs produces inconsistent outputs, especially on complex or long-form questions. Current methods like Self-Consistency work for short-form QA but fail on long-form, while Universal SC and WUCS extend to long-form but lose accuracy on short-form benchmarks.

Method: LSC selects the most semantically consistent response using learnable token embeddings. It uses a lightweight forward generation of summary tokens that increases inference time by less than 1% and requires no changes to model architecture.

Result: Across 6 short-form and 5 long-form reasoning benchmarks (MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all benchmarks on average while maintaining negligible computational overhead. It also provides well-calibrated confidence estimates with low Expected Calibration Error.

Conclusion: LSC is a practical consistency-selection method that works reliably across both short-form and long-form answer formats with minimal computational cost, positioning it as an effective solution for improving output consistency in LLMs.

Abstract: Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce Latent Self-Consistency (LSC), which selects the most semantically consistent response using learnable token embeddings. A lightweight forward generation of summary tokens increases inference time by less than 1% and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form ones on average, while maintaining negligible computational overhead. These results position LSC as a practical consistency-selection method that works reliably across answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low Expected Calibration Error across both answer formats.

[9] Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering

Michal Štefánik, Timothee Mickus, Marek Kadlčík, Michal Spiegel, Josef Kuchař

Main category: cs.CL

TL;DR: This paper challenges the assumption that out-of-distribution (OOD) evaluations reliably capture real-world failure modes in AI models, particularly showing that OOD datasets for question-answering provide inconsistent quality in estimating robustness to spurious shortcuts.

Details

Motivation: The authors question whether current OOD evaluation methods effectively reflect real-world deployment failures, particularly focusing on spurious feature reliance in question-answering models.

Method: The study compares results from OOD evaluations with documented failure modes in QA models, specifically examining reliance on spurious features and prediction shortcuts across different datasets.

Result: The research found that different OOD datasets provide vastly different quality estimates of model robustness to shortcuts, with some performing worse than simple in-distribution evaluations. Spurious shortcuts were found to be shared across ID and OOD datasets.

Conclusion: The work highlights limitations of commonly-used OOD evaluations for generalization assessment and provides methodology and recommendations for more robust evaluation of generalization within and beyond QA systems.

Abstract: A majority of recent work in AI assesses models’ generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models’ robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset’s quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.

[10] How Reliable are LLMs for Reasoning on the Re-ranking task?

Nafis Tanveer Islam, Zhiming Zhao

Main category: cs.CL

TL;DR: Analysis of how different LLM training methods affect semantic understanding and explainability in re-ranking tasks, particularly in data-scarce domains like environmental science.

Details

Motivation: LLMs show improved human value alignment but lack transparency, and face challenges in re-ranking tasks with limited training data, raising questions about their true reliability.

Method: Utilize a small ranking dataset from environmental and Earth science domains to analyze LLM re-ranking performance and examine explainable reasoning generated by different training methods.

Result: Some training methods demonstrate better explainability than others, suggesting that not all methods achieve accurate semantic understanding - some may only learn abstract knowledge for evaluation optimization.

Conclusion: Training methodology significantly impacts LLM explainability and semantic understanding in re-ranking tasks, highlighting the need for transparent reasoning capabilities especially in data-limited scenarios.

Abstract: With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM’s internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.

[11] Emotion Omni: Enabling Empathetic Speech Response Generation through Large Language Models

Haoyu Wang, Guangyan Zhang, Jiale Chen, Jingyu Li, Yuehai Wang, Yiwen Guo

Main category: cs.CL

TL;DR: Emotion Omni is a novel speech LLM architecture that understands emotional cues in user speech and generates empathetic responses without requiring massive training datasets.

Details

Motivation: Existing speech LLMs lack emotional understanding of user queries, failing to capture different meanings conveyed through emotional expression, which is crucial for human-machine interaction. Current empathetic models require massive datasets and computational resources.

Method: Proposed Emotion Omni model architecture with a data generation pipeline using an open-source TTS framework to create a 200k emotional dialogue dataset, enabling empathetic speech assistant development with limited data.

Result: Developed a functional empathetic speech assistant capable of understanding emotional content in user speech and generating appropriate empathetic responses without large-scale training requirements.

Conclusion: The Emotion Omni approach successfully addresses the challenge of building empathetic speech LLMs with limited data, providing a more efficient alternative to massive dataset training while improving emotional understanding in human-machine interactions.

Abstract: With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models simply convert the response content into speech without fully understanding the rich emotional and paralinguistic cues embedded in the user’s query. In many cases, the same sentence can have different meanings depending on the emotional expression. Furthermore, emotional understanding is essential for improving user experience in human-machine interaction. Currently, most speech LLMs with empathetic capabilities are trained on massive datasets. This approach requires vast amounts of data and significant computational resources. Therefore, a key challenge lies in how to develop a speech LLM capable of generating empathetic responses with limited data and without the need for large-scale training. To address this challenge, we propose Emotion Omni, a novel model architecture designed to understand the emotional content of user speech input and generate empathetic speech responses. Additionally, we developed a data generation pipeline based on an open-source TTS framework to construct a 200k emotional dialogue dataset, which supports the construction of an empathetic speech assistant. The demos are available at https://w311411.github.io/omni_demo/

[12] Integrating gender inclusivity into large language models via instruction tuning

Alina Wróblewska, Bartosz Żuk

Main category: cs.CL

TL;DR: This study addresses masculine bias in Polish language models by tuning LLMs with gender-inclusive guidelines using the IPIS dataset to create more balanced outputs.

Details

Motivation: Polish language predominantly uses masculine forms for all genders due to historical conventions, causing LLMs to inherit and reinforce gender bias in their outputs.

Method: Used IPIS dataset (human-crafted gender-inclusive proofreading) to tune multilingual LLMs (Llama-8B, Mistral-7B, Mistral-Nemo) and Polish-specific models (Bielik, PLLuM) with explicit gender-inclusive guidelines in system prompts.

Result: The approach successfully integrates gender inclusivity as an inherent feature of Polish language models, systematically mitigating gender bias in language generation.

Conclusion: This research provides a systematic solution to address masculine bias in Polish LLMs through targeted tuning with gender-inclusive guidelines, making gender inclusivity a built-in feature rather than an afterthought.

Abstract: Imagine a language with masculine, feminine, and neuter grammatical genders, yet, due to historical and political conventions, masculine forms are predominantly used to refer to men, women and mixed-gender groups. This is the reality of contemporary Polish. A social consequence of this unfair linguistic system is that large language models (LLMs) trained on Polish texts inherit and reinforce this masculine bias, generating gender-imbalanced outputs. This study addresses this issue by tuning LLMs using the IPIS dataset, a collection of human-crafted gender-inclusive proofreading in Polish and Polish-to-English translation instructions. Grounded in a theoretical linguistic framework, we design a system prompt with explicit gender-inclusive guidelines for Polish. In our experiments, we IPIS-tune multilingual LLMs (Llama-8B, Mistral-7B and Mistral-Nemo) and Polish-specific LLMs (Bielik and PLLuM). Our approach aims to integrate gender inclusivity as an inherent feature of these models, offering a systematic solution to mitigate gender bias in Polish language generation.

[13] Principled Detection of Hallucinations in Large Language Models via Multiple Testing

Jiawei Li, Akshayaa Magesh, Venugopal V. Veeravalli

Main category: cs.CL

TL;DR: A multiple-testing-inspired method for detecting hallucinations in LLMs by framing it as a hypothesis testing problem similar to out-of-distribution detection.

Details

Motivation: Large Language Models are prone to generating confident but incorrect responses (hallucinations), creating a need for reliable detection methods.

Method: Formulates hallucination detection as a hypothesis testing problem and proposes a multiple-testing-inspired approach.

Result: Extensive experimental validation shows the proposed method is robust and outperforms state-of-the-art approaches.

Conclusion: The multiple-testing framework provides an effective solution for detecting hallucinations in LLM outputs.

Abstract: While Large Language Models (LLMs) have emerged as powerful foundational models to solve a variety of tasks, they have also been shown to be prone to hallucinations, i.e., generating responses that sound confident but are actually incorrect or even nonsensical. In this work, we formulate the problem of detecting hallucinations as a hypothesis testing problem and draw parallels to the problem of out-of-distribution detection in machine learning models. We propose a multiple-testing-inspired method to solve the hallucination detection problem, and provide extensive experimental results to validate the robustness of our approach against state-of-the-art methods.

[14] VibeVoice Technical Report

Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei

Main category: cs.CL

TL;DR: VibeVoice is a novel speech synthesis model that uses next-token diffusion to generate long-form multi-speaker conversations with up to 90 minutes of speech and 4 speakers, enabled by an 80x more efficient continuous speech tokenizer.

Details

Motivation: To address the challenge of synthesizing authentic long-form conversational speech with multiple speakers while maintaining computational efficiency and audio fidelity.

Method: Uses next-token diffusion for autoregressive latent vector generation and introduces a novel continuous speech tokenizer that achieves 80x better compression than Encodec while preserving audio quality.

Result: Can synthesize speech up to 90 minutes long with 4 speakers in a 64K context window, capturing authentic conversational vibe and outperforming both open-source and proprietary dialogue models.

Conclusion: VibeVoice successfully demonstrates that next-token diffusion with efficient tokenization enables high-quality long-form multi-speaker speech synthesis while significantly improving computational efficiency.

Abstract: This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe’’ and surpassing open-source and proprietary dialogue models.

[15] COMET-poly: Machine Translation Metric Grounded in Other Candidates

Maike Züfle, Vilém Zouhar, Tu Anh Dinh, Felipe Maia Polo, Jan Niehues, Mrinmaya Sachan

Main category: cs.CL

TL;DR: Proposes two new automated metrics (COMET-polycand and COMET-polyic) that incorporate multiple translations or similar examples to improve machine translation evaluation performance.

Details

Motivation: Current automated metrics only consider source sentence and single translation, unlike humans who assess translations in context of multiple alternatives, which may negatively impact metric performance.

Method: COMET-polycand uses alternative translations of same source sentence for comparison. COMET-polyic uses translations of similar source texts with human-labeled quality scores for in-context learning.

Result: Adding single additional translation improved segment-level performance (0.079 to 0.118 Kendall’s tau-b correlation), with further gains from more translations. Retrieved examples in COMET-polyic yielded similar improvements (0.079 to 0.116 correlation).

Conclusion: Incorporating multiple translations or similar examples significantly improves automated translation evaluation metrics, better replicating human judgment processes.

Abstract: Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall’s tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall’s tau-b correlation). We release our models publicly.

[16] The Mind’s Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation

Girish A. Koushik, Fatemeh Nazarieh, Katherine Birch, Shenbin Qian, Diptesh Kanojia

Main category: cs.CL

TL;DR: A self-evaluating visual metaphor generation framework with two novel approaches: training-free S-T-M decomposition prompting and training-based alignment improvement using self-evaluation rewards, achieving strong performance on metaphor alignment metrics.

Details

Motivation: Visual metaphor generation requires understanding language to bind source and target concepts while maintaining visual coherence, which is challenging for current systems.

Method: Proposed framework combines training-free pipeline with explicit source-target-meaning (S-T-M) decomposition for image synthesis, and training-based pipeline using self-evaluation reward schema for improved alignment without large-scale retraining.

Result: Training-free approach surpassed GPT-4o and Imagen on decomposition, CLIP, and meaning alignment scores. User study showed GPT-4o preferred overall, but training-free pipeline led open-source methods and beat Imagen on abstract metaphors.

Conclusion: Structured prompting and lightweight reinforcement learning effectively perform metaphor alignment with modest compute, with remaining gaps to human preference driven by aesthetics and sampling sensitivity.

Abstract: Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach surpasses strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores, with the training-based approach close behind. We evaluate our framework output using a user-facing study, and observed that participants preferred GPT-4o overall, while our training-free pipeline led open-source methods and edged Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or more abstract metaphors, with closed models excelling on short, concrete cases; we also observe sensitivity to sampler settings. Overall, structured prompting and lightweight RL perform metaphor alignment well under modest compute, and remaining gaps to human preference appear driven by aesthetics and sampling.

[17] What do language models model? Transformers, automata, and the format of thought

Colin Klein

Main category: cs.CL

TL;DR: LLMs model corpus patterns, not human cognition, due to architectural differences in computational formats between transformers and human language processing.

Details

Motivation: To clarify whether large language models actually model human cognitive capacities or simply reflect patterns in their training corpora, addressing debates about what LLMs truly represent.

Method: Analyzes computational architecture invariants of transformers, contrasting linear processing formats with human supralinear computation, and examines Liu et al.’s shortcut automata concept.

Result: Transformers operate with linear computational formats unlike human supralinear processing, suggesting they model corpus patterns rather than human cognitive capabilities.

Conclusion: LLMs are not deflationary - they represent a different way of using language as a ‘discourse machine’ that generates new language from context, learned through different means than humans.

Abstract: What do large language models actually model? Do they tell us something about human capacities, or are they models of the corpus we’ve trained them on? I give a non-deflationary defence of the latter position. Cognitive science tells us that linguistic capabilities in humans rely supralinear formats for computation. The transformer architecture, by contrast, supports at best a linear formats for processing. This argument will rely primarily on certain invariants of the computational architecture of transformers. I then suggest a positive story about what transformers are doing, focusing on Liu et al. (2022)’s intriguing speculations about shortcut automata. I conclude with why I don’t think this is a terribly deflationary story. Language is not (just) a means for expressing inner state but also a kind of ‘discourse machine’ that lets us make new language given appropriate context. We have learned to use this technology in one way; LLMs have also learned to use it too, but via very different means.

[18] A New NMT Model for Translating Clinical Texts from English to Spanish

Rumeng Li, Xun Wang, Hong Yu

Main category: cs.CL

TL;DR: NOOV is a neural machine translation system that translates English EHR narratives to Spanish with minimal parallel training data, using bilingual lexicons and biomedical phrase tables to handle unknown words and improve translation quality.

Details

Motivation: Translating electronic health records from English to Spanish is clinically important but challenging due to lack of parallel-aligned corpora and abundant unknown medical terms.

Method: Proposes NOOV system that integrates automatically learned bilingual lexicons from parallel corpora and phrase look-up tables from large biomedical knowledge resources to address unknown words and word-repeat challenges.

Result: Evaluation shows NOOV generates better EHR translations with improvements in both accuracy and fluency compared to existing approaches.

Conclusion: NOOV effectively addresses the challenges of EHR translation with minimal parallel data requirements, demonstrating superior performance in medical text translation tasks.

Abstract: Translating electronic health record (EHR) narratives from English to Spanish is a clinically important yet challenging task due to the lack of a parallel-aligned corpus and the abundant unknown words contained. To address such challenges, we propose \textbf{NOOV} (for No OOV), a new neural machine translation (NMT) system that requires little in-domain parallel-aligned corpus for training. NOOV integrates a bilingual lexicon automatically learned from parallel-aligned corpora and a phrase look-up table extracted from a large biomedical knowledge resource, to alleviate both the unknown word problem and the word-repeat challenge in NMT, enhancing better phrase generation of NMT systems. Evaluation shows that NOOV is able to generate better translation of EHR with improvement in both accuracy and fluency.

[19] EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems

Jingwen Liu, Kan Jen Cheng, Jiachen Lian, Akshay Anand, Rishi Jain, Faith Qiao, Robin Netzorg, Huang-Cheng Chou, Tingle Li, Guan-Ting Lin, Gopala Anumanchipalli

Main category: cs.CL

TL;DR: EMO-Reasoning benchmark for evaluating emotional coherence in dialogue systems using TTS-generated emotional speech data and cross-turn emotion reasoning metrics.

Details

Motivation: Address the lack of holistic evaluation systems for emotional reasoning in spoken dialogue systems despite recent advances in human-computer interaction.

Method: Created curated dataset via text-to-speech to simulate diverse emotional states, proposed Cross-turn Emotion Reasoning Score to assess emotion transitions, and evaluated seven dialogue systems using continuous, categorical, and perceptual metrics.

Result: The framework effectively detects emotional inconsistencies in dialogue systems and provides insights for improvement.

Conclusion: The released systematic evaluation benchmark aims to advance emotion-aware spoken dialogue modeling for more natural and adaptive human-computer interactions.

Abstract: Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via text-to-speech to simulate diverse emotional states, overcoming the scarcity of emotional speech data. We further propose the Cross-turn Emotion Reasoning Score to assess the emotion transitions in multi-turn dialogues. Evaluating seven dialogue systems through continuous, categorical, and perceptual metrics, we show that our framework effectively detects emotional inconsistencies, providing insights for improving current dialogue systems. By releasing a systematic evaluation benchmark, we aim to advance emotion-aware spoken dialogue modeling toward more natural and adaptive interactions.

[20] Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

Chenxi Zhou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu

Main category: cs.CL

TL;DR: This paper investigates how post-training quantization affects different LLM knowledge capabilities, finding that knowledge memorization is more sensitive to quantization parameters than knowledge utilization.

Details

Motivation: There's a lack of comprehensive understanding about how PTQ precisely impacts diverse LLM knowledge capabilities, and existing scaling laws often overlook PTQ-specific parameters and task-specific sensitivities.

Method: Conducted extensive empirical investigation to establish task-stratified scaling laws, disentangled LLM knowledge into memorization and utilization capabilities, and developed a unified quantitative framework incorporating model size, effective bit-width, calibration set size, and group size.

Result: Knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization.

Conclusion: The findings provide a fine-grained understanding of PTQ’s impact and offer guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.

Abstract: Large language models (LLMs) present significant deployment challenges due to their scale, with post-training quantization (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse LLM knowledge capabilities remains elusive, and existing scaling laws for quantized models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle LLM knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ’s impact and provide guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.

[21] Thinking Before You Speak: A Proactive Test-time Scaling Approach

Cong Li, Wenchang Chai, Hejun Wu, Yan Pan, Pengxu Wei, Liang Lin

Main category: cs.CL

TL;DR: TBYS framework inserts proactive insights between reasoning steps to bridge gaps in LLM training data, improving complex reasoning performance without human labeling or fine-tuning.

Details

Motivation: LLMs struggle with complex reasoning tasks due to missing inner thought processes in training data - humans think carefully but don't articulate their intentions and methodologies.

Method: Proposes Thinking Before You Speak (TBYS) framework that generates proactive insights between reasoning steps, with automated pipeline for collecting and filtering in-context examples.

Result: Experiments on challenging mathematical datasets verify the effectiveness of the TBYS approach.

Conclusion: Inserting proactive insights between reasoning steps effectively bridges the gap in LLM training data and improves complex reasoning capabilities.

Abstract: Large Language Models (LLMs) often exhibit deficiencies with complex reasoning tasks, such as maths, which we attribute to the discrepancy between human reasoning patterns and those presented in the LLMs’ training data. When dealing with complex problems, humans tend to think carefully before expressing solutions. However, they often do not articulate their inner thoughts, including their intentions and chosen methodologies. Consequently, critical insights essential for bridging reasoning steps may be absent in training data collected from human sources. To bridge this gap, we proposes inserting \emph{insight}s between consecutive reasoning steps, which review the status and initiate the next reasoning steps. Unlike prior prompting strategies that rely on a single or a workflow of static prompts to facilitate reasoning, \emph{insight}s are \emph{proactively} generated to guide reasoning processes. We implement our idea as a reasoning framework, named \emph{Thinking Before You Speak} (TBYS), and design a pipeline for automatically collecting and filtering in-context examples for the generation of \emph{insight}s, which alleviates human labeling efforts and fine-tuning overheads. Experiments on challenging mathematical datasets verify the effectiveness of TBYS. Project website: https://gitee.com/jswrt/TBYS

[22] Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models

Chenxu Yang, Qingyi Si, Zheng Lin

Main category: cs.CL

TL;DR: CoDe framework breaks faithfulness-expressiveness trade-off in LLMs by dynamically integrating knowledge-grounded and parametric outputs using distribution divergence and confidence metrics.

Details

Motivation: Current LLMs struggle to integrate external knowledge while maintaining both faithfulness (fidelity to knowledge) and expressiveness (natural language quality), resulting in outputs that are either unsupported or unnatural.

Method: Collaborative Decoding (CoDe) - a plug-and-play approach that dynamically combines output probabilities from knowledge-grounded and parametric generation, guided by distribution divergence and model confidence. Includes knowledge-aware reranking to balance external and internal knowledge.

Result: Superior performance in enhancing faithfulness without compromising expressiveness across diverse LLMs and evaluation metrics, demonstrating effectiveness and generalizability.

Conclusion: CoDe successfully addresses the faithfulness-expressiveness trade-off through selective activation of relevant expressions and proper knowledge utilization, providing a practical solution for knowledge-grounded generation.

Abstract: Grounding responses in external knowledge represents an effective strategy for mitigating hallucinations in Large Language Models (LLMs). However, current LLMs struggle to seamlessly integrate knowledge while simultaneously maintaining faithfulness (or fidelity) and expressiveness, capabilities that humans naturally possess. This limitation results in outputs that either lack support from external knowledge, thereby compromising faithfulness, or appear overly verbose and unnatural, thus sacrificing expressiveness. In this work, to break the trade-off between faithfulness and expressiveness, we propose Collaborative Decoding (CoDe), a novel approach that dynamically integrates output probabilities generated with and without external knowledge. This integration is guided by distribution divergence and model confidence, enabling the selective activation of relevant and reliable expressions from the model’s internal parameters. Furthermore, we introduce a knowledge-aware reranking mechanism that prevents over-reliance on prior parametric knowledge while ensuring proper utilization of provided external information. Through comprehensive experiments, our plug-and-play CoDe framework demonstrates superior performance in enhancing faithfulness without compromising expressiveness across diverse LLMs and evaluation metrics, validating both its effectiveness and generalizability.

[23] An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Xin Sun, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie

Main category: cs.CL

TL;DR: DeepRare is an LLM-powered rare disease diagnosis system that processes heterogeneous clinical inputs to generate ranked diagnostic hypotheses with transparent reasoning chains, achieving superior performance over existing methods.

Details

Motivation: Rare diseases affect over 300 million people worldwide but face diagnostic challenges due to clinical heterogeneity, low prevalence, and limited clinician familiarity with rare conditions.

Method: A modular agentic system with three components: central host with long-term memory, specialized agent servers with 40+ tools, and web-scale medical knowledge sources for processing heterogeneous clinical inputs with transparent reasoning.

Result: Achieved 100% accuracy for 1013 diseases, 57.18% Recall@1 (23.79% better than second-best), 70.60% Recall@1 for multi-modal inputs vs Exomiser’s 53.20%, and 95.40% expert agreement on reasoning chains.

Conclusion: DeepRare demonstrates exceptional diagnostic performance for rare diseases, significantly outperforming existing methods while providing transparent reasoning, and is available as a web application for clinical use.

Abstract: Rare diseases collectively affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains a pervasive challenge. This is largely due to their clinical heterogeneity, low individual prevalence, and the limited familiarity most clinicians have with rare conditions. Here, we introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM), capable of processing heterogeneous clinical inputs. The system generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning that links intermediate analytic steps to verifiable medical evidence. DeepRare comprises three key components: a central host with a long-term memory module; specialized agent servers responsible for domain-specific analytical tasks integrating over 40 specialized tools and web-scale, up-to-date medical knowledge sources, ensuring access to the most current clinical information. This modular and scalable design enables complex diagnostic reasoning while maintaining traceability and adaptability. We evaluate DeepRare on eight datasets. The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases. In HPO-based evaluations, DeepRare significantly outperforms other 15 methods, like traditional bioinformatics diagnostic tools, LLMs, and other agentic systems, achieving an average Recall@1 score of 57.18% and surpassing the second-best method (Reasoning LLM) by a substantial margin of 23.79 percentage points. For multi-modal input scenarios, DeepRare achieves 70.60% at Recall@1 compared to Exomiser’s 53.20% in 109 cases. Manual verification of reasoning chains by clinical experts achieves 95.40% agreements. Furthermore, the DeepRare system has been implemented as a user-friendly web application http://raredx.cn/doctor.

[24] Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning

Songtao Jiang, Yuxi Chen, Sibo Song, Yan Zhang, Yeying Jin, Yang Feng, Jian Wu, Zuozhu Liu

Main category: cs.CL

TL;DR: Med-VLMs show fragility in medical VQA with inconsistent answers to semantically equivalent question rephrasings. The paper introduces RoMed dataset and CCL method to improve robustness and consistency.

Details

Motivation: Current Medical Vision-Language Models exhibit concerning fragility in Medical Visual Question Answering, with answers fluctuating significantly when faced with semantically equivalent rephrasings of medical questions, which is critical for reliable diagnosis in high-stakes medical applications.

Method: Constructed RoMed dataset with 144k question variations across word-level, sentence-level, and semantic-level perturbations. Proposed Consistency and Contrastive Learning (CCL) with two components: knowledge-anchored consistency learning and bias-aware contrastive learning.

Result: Evaluation on SOTA models like LLaVA-Med showed alarming performance drops (40% decline in Recall) on RoMed vs original benchmarks. CCL achieved SOTA performance on three VQA benchmarks and improved answer consistency by 50% on RoMed test set.

Conclusion: CCL demonstrates significantly enhanced robustness for Med-VLMs, addressing the critical fragility issues in medical visual question answering through better alignment with medical knowledge and mitigation of data biases.

Abstract: In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.

[25] Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System

Yanfan Du, Jun Zhang, Bin Wang, Jin Qiu, Lu Huang, Yuan Ge, Xiaoqian Liu, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: Attention2Probability is a lightweight attention-driven method that converts cross-attention weights into presence probabilities to improve domain-specific terminology recognition in speech-to-text systems, achieving high recall rates with low latency.

Details

Motivation: Speech large language models struggle with accurately generating domain-specific terms and neologisms, creating a need for robust terminology handling in speech-to-text systems.

Method: Proposes Attention2Probability which converts cross-attention weights between speech and terminology into presence probabilities, employs curriculum learning for enhanced retrieval accuracy, and creates a new speech dataset with terminology.

Result: Significantly outperforms VectorDB method with maximum recall rates of 92.57% (Chinese) and 86.83% (English) at only 8.71ms latency per query. Improves terminology accuracy by 6-17% in SLM tasks.

Conclusion: The method effectively addresses terminology challenges in speech recognition/translation while revealing current limitations in SLMs’ terminology utilization, providing a lightweight and accurate solution with released dataset for future research.

Abstract: Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs’ recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.

[26] Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs

Duy Le, Kent Ziti, Evan Girard-Sun, Sean O’Brien, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: AOF prompting framework improves multilingual riddle generation by filtering redundant content and enforcing lexical novelty, achieving better diversity than standard prompting methods.

Details

Motivation: Standard prompting strategies for multilingual riddle generation tend to reuse memorized content or perform shallow paraphrasing, lacking cultural fluency and creative abstraction.

Method: Adaptive Originality Filtering (AOF) - a prompting framework that uses cosine-based similarity rejection to filter redundant generations while enforcing lexical novelty and cross-lingual fidelity.

Result: AOF-enhanced GPT-4o achieved 0.177 Self-BLEU and 0.915 Distinct-2 in Japanese, showing improved lexical diversity and reduced redundancy across three LLMs and four language pairs.

Conclusion: Semantic rejection through AOF can guide culturally grounded, creative generation without requiring task-specific fine-tuning, making it effective for multilingual riddle generation.

Abstract: Multilingual riddle generation challenges large language models (LLMs) to balance cultural fluency with creative abstraction. Standard prompting strategies – zero-shot, few-shot, chain-of-thought – tend to reuse memorized riddles or perform shallow paraphrasing. We introduce Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity. Evaluated across three LLMs and four language pairs, AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915} Distinct-2 in Japanese, signaling improved lexical diversity and reduced redundancy compared to other prompting methods and language pairs. Our findings show that semantic rejection can guide culturally grounded, creative generation without task-specific fine-tuning.

[27] EMMM, Explain Me My Model! Explainable Machine Generated Text Detection in Dialogues

Angela Yifei Yuan, Haoyi Li, Soyeon Caren Han, Christopher Leckie

Main category: cs.CL

TL;DR: EMMM framework provides explainable machine-generated text detection for customer service, balancing accuracy, low latency, and non-expert-friendly explanations.

Details

Motivation: Address risks of LLM exploitation for user impersonation in customer service, where current detection methods lack reliability and interpretability for non-expert users.

Method: Explanation-then-detection framework (EMMM) that first provides explanations then performs detection, designed for conversational settings with non-expert operators.

Result: 70% human evaluator preference for outputs, competitive accuracy with state-of-the-art models, and low latency (<1 second generation time).

Conclusion: EMMM successfully provides trustworthy, interpretable MGT detection for customer service scenarios with non-expert users while maintaining performance efficiency.

Abstract: The rapid adoption of large language models (LLMs) in customer service introduces new risks, as malicious actors can exploit them to conduct large-scale user impersonation through machine-generated text (MGT). Current MGT detection methods often struggle in online conversational settings, reducing the reliability and interpretability essential for trustworthy AI deployment. In customer service scenarios where operators are typically non-expert users, explanation become crucial for trustworthy MGT detection. In this paper, we propose EMMM, an explanation-then-detection framework that balances latency, accuracy, and non-expert-oriented interpretability. Experimental results demonstrate that EMMM provides explanations accessible to non-expert users, with 70% of human evaluators preferring its outputs, while achieving competitive accuracy compared to state-of-the-art models and maintaining low latency, generating outputs within 1 second. Our code and dataset are open-sourced at https://github.com/AngieYYF/EMMM-explainable-chatbot-detection.

[28] Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

Chang Wang, Siyu Yan, Depeng Yuan, Yuqi Chen, Yanhua Huang, Yuanhang Zheng, Shuhao Li, Yinqi Zhang, Kedi Chen, Mingrui Zhu, Ruiwen Xu

Main category: cs.CL

TL;DR: DIVER is a novel LLM-based framework that jointly optimizes for both quality and diversity in ad headline generation, improving advertiser value by 4.0% and CTR by 1.4% through multi-stage optimization.

Details

Motivation: Current approaches focus primarily on optimizing for headline quality or CTR, often resulting in homogeneous outputs that fail to engage diverse audience segments, highlighting the need for both quality and diversity in ad headline generation.

Method: Proposes a semantic- and stylistic-aware data generation pipeline to create training pairs, followed by a multi-stage multi-objective optimization framework combining supervised fine-tuning (SFT) and reinforcement learning (RL) to generate diverse, high-quality headlines in a single forward pass.

Result: Experiments on real-world industrial datasets show DIVER effectively balances quality and diversity. Deployment on a large-scale platform serving hundreds of millions of users achieved 4.0% improvement in advertiser value (ADVV) and 1.4% improvement in CTR.

Conclusion: The DIVER framework successfully addresses the limitation of homogeneous outputs in ad headline generation by jointly optimizing for both diversity and quality, demonstrating significant improvements in key advertising metrics when deployed at scale.

Abstract: The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity and quality. We first design a semantic- and stylistic-aware data generation pipeline that automatically produces high-quality training pairs with ad content and multiple diverse headlines. To achieve the goal of generating high-quality and diversified ad headlines within a single forward pass, we propose a multi-stage multi-objective optimization framework with supervised fine-tuning (SFT) and reinforcement learning (RL). Experiments on real-world industrial datasets demonstrate that DIVER effectively balances quality and diversity. Deployed on a large-scale content-sharing platform serving hundreds of millions of users, our framework improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.

[29] M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations

Qiao Liang, Ying Shen, Tiantian Chen, Lin Zhang

Main category: cs.CL

TL;DR: M3HG model for multimodal emotion-cause triplet extraction that explicitly models emotional/causal contexts and fuses information at different levels using multimodal heterogeneous graphs, outperforming SOTA methods on new MECAD dataset.

Details

Motivation: Address scarcity of multimodal MECTEC datasets and limitations of existing methods that fail to explicitly model emotional/causal contexts and neglect multi-level semantic fusion.

Method: Proposes M3HG model that captures emotional/causal contexts and fuses contextual information at inter- and intra-utterance levels via multimodal heterogeneous graph.

Result: Extensive experiments demonstrate M3HG’s effectiveness compared to state-of-the-art methods on the new MECAD dataset.

Conclusion: M3HG successfully addresses dataset scarcity and modeling limitations in multimodal emotion-cause triplet extraction, achieving superior performance through explicit context modeling and multi-level fusion.

Abstract: Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has recently gained significant attention in social media analysis, aiming to extract emotion utterances, cause utterances, and emotion categories simultaneously. However, the scarcity of related datasets, with only one published dataset featuring highly uniform dialogue scenarios, hinders model development in this field. To address this, we introduce MECAD, the first multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56 TV series spanning a wide range of dialogue contexts. In addition, existing MECTEC methods fail to explicitly model emotional and causal contexts and neglect the fusion of semantic information at different levels, leading to performance degradation. In this paper, we propose M3HG, a novel model that explicitly captures emotional and causal contexts and effectively fuses contextual information at both inter- and intra-utterance levels via a multimodal heterogeneous graph. Extensive experiments demonstrate the effectiveness of M3HG compared with existing state-of-the-art methods. The codes and dataset are available at https://github.com/redifinition/M3HG.

[30] Chronological Passage Assembling in RAG framework for Temporal Question Answering

Byeongjeong Kim, Jeonghyun Park, Joonho Yang, Hwanhee Lee

Main category: cs.CL

TL;DR: ChronoRAG is a novel RAG framework that improves narrative QA by structuring passages coherently and preserving temporal order, showing significant improvements on NarrativeQA dataset.

Details

Motivation: Existing RAG methods struggle with narrative texts because they fail to capture the broader context and sequential relationships crucial for understanding stories and timelines.

Method: Proposes ChronoRAG framework that refines dispersed document information into structured passages and explicitly captures/maintains temporal order among retrieved passages.

Result: Substantial improvements on NarrativeQA dataset for tasks requiring both factual identification and comprehension of complex sequential relationships.

Conclusion: Reasoning over temporal order is crucial for resolving narrative QA, and ChronoRAG effectively addresses the limitations of existing RAG approaches for narrative texts.

Abstract: Long-context question answering over narrative tasks is challenging because correct answers often hinge on reconstructing a coherent timeline of events while preserving contextual flow in a limited context window. Retrieval-augmented generation (RAG) indexing methods aim to address this challenge by selectively retrieving only necessary document segments. However, narrative texts possess unique characteristics that limit the effectiveness of these existing approaches. Specifically, understanding narrative texts requires more than isolated segments, as the broader context and sequential relationships between segments are crucial for comprehension. To address these limitations, we propose ChronoRAG, a novel RAG framework specialized for narrative texts. This approach focuses on two essential aspects: refining dispersed document information into coherent and structured passages, and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages. We empirically demonstrate the effectiveness of ChronoRAG through experiments on the NarrativeQA dataset, showing substantial improvements in tasks requiring both factual identification and comprehension of complex sequential relationships, underscoring that reasoning over temporal order is crucial in resolving narrative QA.

[31] ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models

Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, Jiangjie Chen

Main category: cs.CL

TL;DR: ThinkDial is the first open-source framework that enables GPT-4o-style controllable reasoning through discrete operational modes (High/Medium/Low) with significant token reduction while maintaining performance.

Details

Motivation: Current LLMs lack computational effort control for practical deployment. Proprietary systems like GPT-4o have discrete reasoning modes, but open-source solutions have failed to achieve similar capabilities.

Method: End-to-end training paradigm with budget-mode supervised fine-tuning and two-phase budget-aware reinforcement learning with adaptive reward shaping to embed controllable reasoning capabilities.

Result: Achieves three operational modes: High (full capability), Medium (50% token reduction with <10% performance degradation), Low (75% token reduction with <15% performance degradation). Shows strong generalization on out-of-distribution tasks.

Conclusion: ThinkDial successfully implements controllable reasoning with clear performance-compression trade-offs, providing the first open-source solution for discrete operational mode control in LLMs.

Abstract: Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI’s gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.

[32] Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction

Yilin Li, Xunjian Yin, Yilin Chen, Xiaojun Wan

Main category: cs.CL

TL;DR: Proposes Rule-Based RL framework for grammatical error correction that outperforms traditional supervised fine-tuning methods, achieving state-of-the-art performance with improved recall on Chinese datasets.

Details

Motivation: Traditional encoder-decoder models have limitations, and current LLM approaches rely on supervised fine-tuning which restricts the model's reasoning capabilities. The application of LLMs in grammatical error correction is underexplored.

Method: A novel Rule-Based Reinforcement Learning (RL) framework that steers large language models for grammatical error correction, moving beyond direct supervised fine-tuning approaches.

Result: Achieves state-of-the-art performance on Chinese datasets with a notable increase in recall, demonstrating superior performance compared to traditional methods.

Conclusion: The Rule-Based RL framework offers a more controllable and reliable paradigm for grammatical error correction, highlighting the advantages of using reinforcement learning to steer LLMs for this task.

Abstract: Grammatical error correction is a significant task in NLP. Traditional methods based on encoder-decoder models have achieved certain success, but the application of LLMs in this field is still underexplored. Current research predominantly relies on supervised fine-tuning to train LLMs to directly generate the corrected sentence, which limits the model’s powerful reasoning ability. To address this limitation, we propose a novel framework based on Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL framework achieves \textbf{state-of-the-art }performance, with a notable increase in \textbf{recall}. This result clearly highlights the advantages of using RL to steer LLMs, offering a more controllable and reliable paradigm for future development in GEC.

[33] Controllable Conversational Theme Detection Track at DSTC 12

Igor Shalyminov, Hang Su, Jake Vincent, Siffi Singh, Jason Cai, James Gung, Raphael Shu, Saab Mansour

Main category: cs.CL

TL;DR: Introduces Theme Detection as a key conversational analytics task for automatically identifying topics in conversations, with controllable granularity through user preferences, presented as a DSTC 12 competition track.

Details

Motivation: To reduce manual effort in analyzing large-scale conversations (e.g., customer support, sales) by automatically detecting and categorizing conversation themes, moving beyond traditional fixed-intent dialog systems.

Method: Frames the problem as joint clustering and theme labeling of dialog utterances with controllable granularity via user preference data. Uses a public competition format with provided datasets and evaluation metrics.

Result: The paper presents a publicly available competition track with datasets and evaluation framework for controllable theme detection, though specific performance results from participant submissions are discussed qualitatively rather than quantitatively.

Conclusion: Theme Detection represents an important advancement in conversational analytics, offering flexible user-facing summaries of conversations with controllable granularity, with the competition framework enabling further research and development in this area.

Abstract: Conversational analytics has been on the forefront of transformation driven by the advances in Speech and Natural Language Processing techniques. Rapid adoption of Large Language Models (LLMs) in the analytics field has taken the problems that can be automated to a new level of complexity and scale. In this paper, we introduce Theme Detection as a critical task in conversational analytics, aimed at automatically identifying and categorizing topics within conversations. This process can significantly reduce the manual effort involved in analyzing expansive dialogs, particularly in domains like customer support or sales. Unlike traditional dialog intent detection, which often relies on a fixed set of intents for downstream system logic, themes are intended as a direct, user-facing summary of the conversation’s core inquiry. This distinction allows for greater flexibility in theme surface forms and user-specific customizations. We pose Controllable Conversational Theme Detection problem as a public competition track at Dialog System Technology Challenge (DSTC) 12 – it is framed as joint clustering and theme labeling of dialog utterances, with the distinctive aspect being controllability of the resulting theme clusters’ granularity achieved via the provided user preference data. We give an overview of the problem, the associated dataset and the evaluation metrics, both automatic and human. Finally, we discuss the participant teams’ submissions and provide insights from those. The track materials (data and code) are openly available in the GitHub repository.

[34] LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

Ziming Zhu, Chenglong Wang, Shunjie Xing, Yifu Huo, Fengning Tian, Quan Du, Di Yang, Chunliang Zhang, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: LaTeXTrans is a multi-agent system that translates LaTeX documents while preserving formatting and structure, outperforming mainstream MT systems.

Details

Motivation: Modern MT systems struggle with LaTeX documents that mix natural language with domain-specific syntax like equations, tables, and cross-references, requiring accurate preservation for semantic integrity and compilability.

Method: Uses six specialized agents: Parser (decomposes LaTeX via placeholder substitution), Translator, Validator, Summarizer, Terminology Extractor (collaborative context-aware translation), and Generator (reconstructs translated LaTeX).

Result: Outperforms mainstream MT systems in both translation accuracy and structural fidelity.

Conclusion: Provides an effective and practical solution for translating LaTeX-formatted documents while maintaining format preservation and structural fidelity.

Abstract: Despite the remarkable progress of modern machine translation (MT) systems on general-domain texts, translating structured LaTeX-formatted documents remains a significant challenge. These documents typically interleave natural language with domain-specific syntax, such as mathematical equations, tables, figures, and cross-references, all of which must be accurately preserved to maintain semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge. LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2) a Translator, Validator, Summarizer, and Terminology Extractor that work collaboratively to ensure context-aware, self-correcting, and terminology-consistent translations; 3) a Generator that reconstructs the translated content into well-structured LaTeX documents. Experimental results demonstrate that LaTeXTrans can outperform mainstream MT systems in both translation accuracy and structural fidelity, offering an effective and practical solution for translating LaTeX-formatted documents.

[35] LLM-based Contrastive Self-Supervised AMR Learning with Masked Graph Autoencoders for Fake News Detection

Shubham Gupta, Shraban Kumar Chatterjee, Suman Kundu

Main category: cs.CL

TL;DR: Novel self-supervised misinformation detection framework combining semantic relations (AMR) and social propagation dynamics with LLM-based contrastive learning, achieving state-of-the-art performance without extensive labeled data.

Details

Motivation: Address limitations of existing misinformation detection methods that struggle with long-range dependencies, complex semantic relations, and social dynamics while requiring extensive labeled datasets.

Method: Integrates Abstract Meaning Representation for semantic relations and social propagation dynamics using multi-view graph masked autoencoder. Introduces LLM-based graph contrastive loss with negative anchor points for zero-shot feature separability enhancement.

Result: Extensive experiments show superior performance compared to state-of-the-art methods, even with limited labeled datasets, while improving generalizability.

Conclusion: The proposed self-supervised framework effectively combines semantic and propagation-based features to differentiate fake from real news, offering a resource-efficient solution for misinformation detection.

Abstract: The proliferation of misinformation in the digital age has led to significant societal challenges. Existing approaches often struggle with capturing long-range dependencies, complex semantic relations, and the social dynamics influencing news dissemination. Furthermore, these methods require extensive labelled datasets, making their deployment resource-intensive. In this study, we propose a novel self-supervised misinformation detection framework that integrates both complex semantic relations using Abstract Meaning Representation (AMR) and news propagation dynamics. We introduce an LLM-based graph contrastive loss (LGCL) that utilizes negative anchor points generated by a Large Language Model (LLM) to enhance feature separability in a zero-shot manner. To incorporate social context, we employ a multi view graph masked autoencoder, which learns news propagation features from social context graph. By combining these semantic and propagation-based features, our approach effectively differentiates between fake and real news in a self-supervised manner. Extensive experiments demonstrate that our self-supervised framework achieves superior performance compared to other state-of-the-art methodologies, even with limited labelled datasets while improving generalizability.

[36] Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness

Sirui Chen, Changxin Tian, Binbin Hu, Kunlong Chen, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

Main category: cs.CL

TL;DR: A program-assisted synthesis framework generates high-quality mathematical training data for LLMs through executable programs and bilateral validation, producing 12.3M problem-solution pairs that achieve SOTA results.

Details

Motivation: Conventional methods for creating mathematical training data face scalability, cost, and reliability challenges, limiting LLM mathematical reasoning capabilities.

Method: Integrates mathematical knowledge systems and domain-specific tools to create executable programs, then translates them into natural language problem-solution pairs with bilateral validation (verifying solution correctness against program outputs and ensuring program-problem consistency).

Result: Generated 12.3 million problem-solving triples; models fine-tuned on this data significantly improve inference capabilities and achieve state-of-the-art performance on multiple benchmark datasets.

Conclusion: The program-assisted synthesis framework effectively addresses data quality challenges and demonstrates superior effectiveness in enhancing LLM mathematical reasoning capabilities.

Abstract: Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose a novel program-assisted synthesis framework that systematically generates a high-quality mathematical corpus with guaranteed diversity, complexity, and correctness. This framework integrates mathematical knowledge systems and domain-specific tools to create executable programs. These programs are then translated into natural language problem-solution pairs and vetted by a bilateral validation mechanism that verifies solution correctness against program outputs and ensures program-problem consistency. We have generated 12.3 million such problem-solving triples. Experiments demonstrate that models fine-tuned on our data significantly improve their inference capabilities, achieving state-of-the-art performance on several benchmark datasets and showcasing the effectiveness of our synthesis approach.

[37] ConfTuner: Training Large Language Models to Express Their Confidence Verbally

Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi

Main category: cs.CL

TL;DR: ConfTuner is a fine-tuning method that improves LLM confidence calibration using a tokenized Brier score loss function, achieving better uncertainty expression without requiring ground-truth confidence scores.

Details

Motivation: LLMs are often overconfident in high-stakes domains, generating incorrect answers with high confidence. Existing calibration methods have limited effectiveness and generalizability.

Method: ConfTuner uses a novel tokenized Brier score loss function that acts as a proper scoring rule, fine-tuning models to better express their true uncertainty without needing ground-truth confidence estimates.

Result: ConfTuner improves calibration across diverse reasoning tasks, generalizes to black-box models like GPT-4o, and enables downstream gains in self-correction and model cascading.

Conclusion: The method advances trustworthy LLM systems by providing simple, efficient calibration that doesn’t require proxy confidence estimates, making LLMs more reliable in critical applications.

Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as “overconfidence”. Recent efforts have focused on calibrating LLMs’ verbalized confidence: i.e., their expressions of confidence in text form, such as “I am 80% confident that…”. Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it “correctly incentivizes the model to report its true probability of being correct”. ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.

[38] ReflectivePrompt: Reflective evolution in autoprompting algorithms

Viktor N. Zhuravlev, Artur R. Khairullin, Ernest A. Dyagin, Alena N. Sitkina, Nikita I. Kulin

Main category: cs.CL

TL;DR: ReflectivePrompt is an evolutionary algorithm-based autoprompting method that uses reflective evolution with short-term and long-term reflection operations to optimize prompts for LLMs, achieving significant performance improvements over state-of-the-art methods.

Details

Motivation: With the rapid advancement of prompt engineering and large language models, there is a growing need for automated methods to select optimized prompts. Current autoprompting approaches can be improved through more precise and comprehensive search strategies.

Method: ReflectivePrompt uses evolutionary algorithms with a reflective evolution approach. It employs short-term and long-term reflection operations before crossover and elitist mutation to enhance modification quality. The method accumulates knowledge throughout evolution and updates it at each epoch based on the current population.

Result: Tested on 33 datasets for classification and text generation tasks using t-lite-instruct-0.1 and gemma3-27b-it models. Demonstrates average significant improvement (28% on BBH compared to EvoPrompt) in metrics relative to state-of-the-art approaches.

Conclusion: ReflectivePrompt establishes itself as one of the most effective solutions in evolutionary algorithm-based autoprompting, showing substantial performance gains over existing methods through its reflective evolution approach.

Abstract: Autoprompting is the process of automatically selecting optimized prompts for language models, which has been gaining popularity with the rapid advancement of prompt engineering, driven by extensive research in the field of large language models (LLMs). This paper presents ReflectivePrompt - a novel autoprompting method based on evolutionary algorithms that employs a reflective evolution approach for more precise and comprehensive search of optimal prompts. ReflectivePrompt utilizes short-term and long-term reflection operations before crossover and elitist mutation to enhance the quality of the modifications they introduce. This method allows for the accumulation of knowledge obtained throughout the evolution process and updates it at each epoch based on the current population. ReflectivePrompt was tested on 33 datasets for classification and text generation tasks using open-access large language models: t-lite-instruct-0.1 and gemma3-27b-it. The method demonstrates, on average, a significant improvement (e.g., 28% on BBH compared to EvoPrompt) in metrics relative to current state-of-the-art approaches, thereby establishing itself as one of the most effective solutions in evolutionary algorithm-based autoprompting.

[39] Empowering Computing Education Researchers Through LLM-Assisted Content Analysis

Laurie Gale, Sebastian Mateos Nicolajsen

Main category: cs.CL

TL;DR: Proposes LLM-assisted content analysis (LACA) method to help computing education researchers analyze large volumes of qualitative data more efficiently and rigorously.

Details

Motivation: Many computing education researchers lack resources to conduct generalizable research. Current methods are burdensome for analyzing large qualitative datasets, limiting research quality and scalability.

Method: A variation of LLM-assisted content analysis (LACA) that combines traditional content analysis with large language models to enable rigorous analysis of large textual datasets.

Result: Demonstrated how LACA can be applied to computing education datasets in a reproducible and rigorous manner, enabling larger-scale research that was previously infeasible.

Conclusion: LACA has significant potential in CER to produce more generalizable findings and advance both teaching practice and research quality in the discipline.

Abstract: Computing education research (CER) is often instigated by practitioners wanting to improve both their own and the wider discipline’s teaching practice. However, the latter is often difficult as many researchers lack the colleagues, resources, or capacity to conduct research that is generalisable or rigorous enough to advance the discipline. As a result, research methods that enable sense-making with larger volumes of qualitative data, while not increasing the burden on the researcher, have significant potential within CER. In this discussion paper, we propose such a method for conducting rigorous analysis on large volumes of textual data, namely a variation of LLM-assisted content analysis (LACA). This method combines content analysis with the use of large language models, empowering researchers to conduct larger-scale research which they would otherwise not be able to perform. Using a computing education dataset, we illustrate how LACA could be applied in a reproducible and rigorous manner. We believe this method has potential in CER, enabling more generalisable findings from a wider range of research. This, together with the development of similar methods, can help to advance both the practice and research quality of the CER discipline.

[40] Affective Polarization across European Parliaments

Bojan Evkoski, Igor Mozetič, Nikola Ljubešić, Petra Kralj Novak

Main category: cs.CL

TL;DR: Automated analysis of parliamentary speeches from six European countries reveals consistent affective polarization patterns, with parliamentarians showing more negativity towards opposing groups than their own, driven by reciprocity mechanisms.

Details

Motivation: To examine the presence and patterns of affective polarization in European parliaments using automated natural language processing techniques on parliamentary speeches.

Method: Utilized comprehensive corpus of parliamentary speeches from six European countries, employed NLP techniques to estimate parliamentarian sentiment, compared negativity levels in references to opposing vs own groups.

Result: Found consistent affective polarization across all six parliaments, activity correlates with negativity but no difference in polarization between less/more active MPs, reciprocity identified as contributing mechanism.

Conclusion: Affective polarization is a consistent feature in European parliamentary discourse, driven by reciprocal negativity between opposing political groups, with activity levels not affecting polarization intensity.

Abstract: Affective polarization, characterized by increased negativity and hostility towards opposing groups, has become a prominent feature of political discourse worldwide. Our study examines the presence of this type of polarization in a selection of European parliaments in a fully automated manner. Utilizing a comprehensive corpus of parliamentary speeches from the parliaments of six European countries, we employ natural language processing techniques to estimate parliamentarian sentiment. By comparing the levels of negativity conveyed in references to individuals from opposing groups versus one’s own, we discover patterns of affectively polarized interactions. The findings demonstrate the existence of consistent affective polarization across all six European parliaments. Although activity correlates with negativity, there is no observed difference in affective polarization between less active and more active members of parliament. Finally, we show that reciprocity is a contributing mechanism in affective polarization between parliamentarians across all six parliaments.

[41] Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework

Ilias Driouich, Hongliu Cao, Eoin Thomas

Main category: cs.CL

TL;DR: A multi-agent framework for generating synthetic QA datasets that ensures semantic diversity and privacy preservation in RAG system evaluation.

Details

Motivation: Current RAG evaluation focuses mainly on performance metrics but neglects dataset quality, particularly privacy protection and semantic diversity needed for real-world constraints.

Method: Three-agent framework: (1) Diversity agent uses clustering for topical coverage, (2) Privacy Agent detects and masks sensitive information, (3) QA curation agent synthesizes private and diverse QA pairs as ground truth.

Result: Evaluation sets outperform baselines in diversity and achieve robust privacy masking on domain-specific datasets.

Conclusion: Provides a practical, ethically aligned approach for safer and more comprehensive RAG evaluation, establishing foundation for future AI regulation compliance.

Abstract: Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses. However, the effectiveness and trustworthiness of these systems critically depends on how they are evaluated, particularly on whether the evaluation process captures real-world constraints like protecting sensitive information. While current evaluation efforts for RAG systems have primarily focused on the development of performance metrics, far less attention has been given to the design and quality of the underlying evaluation datasets, despite their pivotal role in enabling meaningful, reliable assessments. In this work, we introduce a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation. Our approach involves: (1) a Diversity agent leveraging clustering techniques to maximize topical coverage and semantic variability, (2) a Privacy Agent that detects and mask sensitive information across multiple domains and (3) a QA curation agent that synthesizes private and diverse QA pairs suitable as ground truth for RAG evaluation. Extensive experiments demonstrate that our evaluation sets outperform baseline methods in diversity and achieve robust privacy masking on domain-specific datasets. This work offers a practical and ethically aligned pathway toward safer, more comprehensive RAG system evaluation, laying the foundation for future enhancements aligned with evolving AI regulations and compliance standards.

[42] Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models

Hung Ming Liu

Main category: cs.CL

TL;DR: A framework for neural models to develop an AI Mother Tongue - a native symbolic language that enables intuitive reasoning, compositional symbol chains, and inherent interpretability through embedded reasoning representations.

Details

Motivation: To create neural models with built-in interpretability and reasoning capabilities, moving beyond post-hoc explanation methods by embedding reasoning directly into model representations.

Method: Uses complementary training objectives for symbol purity and decision sparsity, gated induction mechanisms for selective focus, and sequential specialization strategy (broad symbolic competence first, then intuitive judgment refinement).

Result: Achieves competitive accuracy on AI tasks while providing verifiable reasoning traces, demonstrating the framework’s effectiveness.

Conclusion: AI Mother Tongue serves as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models, offering transparent yet flexible reasoning capabilities.

Abstract: We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model’s representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.

[43] Automatic Prompt Optimization with Prompt Distillation

Viktor N. Zhuravlev, Artur R. Khairullin, Ernest A. Dyagin, Alena N. Sitkina, Nikita I. Kulin

Main category: cs.CL

TL;DR: DistillPrompt is a novel autoprompting method that uses distillation, compression, and aggregation operations to automatically generate optimized prompts for language models, achieving significant performance improvements over existing methods.

Details

Motivation: With the rapid development of prompt engineering and large language models, there is a growing need for automated methods to select optimized prompts rather than relying on manual prompt engineering.

Method: DistillPrompt employs a multi-stage integration of task-specific information into prompts using training data, utilizing distillation, compression, and aggregation operations to thoroughly explore the prompt space.

Result: The method demonstrated a significant average improvement of 20.12% across datasets compared to Grips, establishing it as one of the most effective non-gradient approaches in autoprompting for both text classification and generation tasks.

Conclusion: DistillPrompt represents an effective non-gradient autoprompting approach that significantly outperforms existing methods, making it a valuable contribution to automated prompt optimization for language models.

Abstract: Autoprompting is the process of automatically selecting optimized prompts for language models, which is gaining popularity due to the rapid development of prompt engineering driven by extensive research in the field of large language models (LLMs). This paper presents DistillPrompt – a novel autoprompting method based on large language models that employs a multi-stage integration of task-specific information into prompts using training data. DistillPrompt utilizes distillation, compression, and aggregation operations to explore the prompt space more thoroughly. The method was tested on different datasets for text classification and generation tasks using the t-lite-instruct-0.1 language model. The results demonstrate a significant average improvement (e.g., 20.12% across the entire dataset compared to Grips) in key metrics over existing methods in the field, establishing DistillPrompt as one of the most effective non-gradient approaches in autoprompting.

[44] MovieCORE: COgnitive REasoning in Movies

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su, Yung-Hao Tang, Shang-Hong Lai, Winston H. Hsu

Main category: cs.CL

TL;DR: MovieCORE is a new video QA dataset focusing on deeper cognitive movie understanding using System-2 thinking questions, generated via agentic brainstorming with LLMs, and enhanced with ACE module that boosts reasoning by 25%.

Details

Motivation: Existing video QA datasets focus on surface-level comprehension, lacking deeper cognitive understanding of movie content that requires System-2 thinking.

Method: Agentic brainstorming approach using multiple LLMs as thought agents to generate refined question-answer pairs, plus Agentic Choice Enhancement (ACE) module to improve VLM reasoning post-training.

Result: Developed MovieCORE dataset with cognitive tests for depth assessment, and ACE module improved model reasoning capabilities by up to 25% on deeper cognitive tasks.

Conclusion: Advances movie understanding in AI systems and provides insights into VQA model limitations when handling nuanced cinematic content questions, with publicly available dataset and code.

Abstract: This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.

[45] HiPlan: Hierarchical Planning for LLM-Based Agents with Adaptive Global-Local Guidance

Ziyue Li, Yuan Chang, Gaihong Yu, Xiaoqiu Le

Main category: cs.CL

TL;DR: HiPlan is a hierarchical planning framework that improves LLM-based agents’ decision-making by providing adaptive global-local guidance through milestone decomposition and step-wise hints.

Details

Motivation: LLM-based agents struggle with complex, long-horizon planning due to lack of macroscopic guidance and insufficient continuous oversight during execution, leading to disorientation and failures.

Method: Hierarchical framework that decomposes tasks into milestone action guides and step-wise hints. Offline phase constructs milestone library from expert demonstrations, execution phase dynamically adapts trajectory segments to generate context-aware hints.

Result: Extensive experiments show HiPlan substantially outperforms strong baselines across two challenging benchmarks, with ablation studies validating the complementary benefits of hierarchical components.

Conclusion: HiPlan effectively addresses LLM agents’ planning limitations through structured hierarchical guidance, enabling better performance in complex decision-making scenarios.

Abstract: Large language model (LLM)-based agents have demonstrated remarkable capabilities in decision-making tasks, but struggle significantly with complex, long-horizon planning scenarios. This arises from their lack of macroscopic guidance, causing disorientation and failures in complex tasks, as well as insufficient continuous oversight during execution, rendering them unresponsive to environmental changes and prone to deviations. To tackle these challenges, we introduce HiPlan, a hierarchical planning framework that provides adaptive global-local guidance to boost LLM-based agents’decision-making. HiPlan decomposes complex tasks into milestone action guides for general direction and step-wise hints for detailed actions. During the offline phase, we construct a milestone library from expert demonstrations, enabling structured experience reuse by retrieving semantically similar tasks and milestones. In the execution phase, trajectory segments from past milestones are dynamically adapted to generate step-wise hints that align current observations with the milestone objectives, bridging gaps and correcting deviations. Extensive experiments across two challenging benchmarks demonstrate that HiPlan substantially outperforms strong baselines, and ablation studies validate the complementary benefits of its hierarchical components.

[46] “Where does it hurt?” – Dataset and Study on Physician Intent Trajectories in Doctor Patient Dialogues

Tom Röhr, Soumyadeep Roy, Fares Al Mohamad, Jens-Michalis Papaioannou, Wolfgang Nejdl, Felix Gers, Alexander Löser

Main category: cs.CL

TL;DR: First study of physician intent trajectories in doctor-patient dialogues using SOAP framework taxonomy, with large-scale annotation and benchmarking of medical intent classification models.

Details

Motivation: To understand how physicians guide medical conversations through targeted questioning and intent patterns for better diagnosis and treatment outcomes.

Method: Developed fine-grained physician intent taxonomy based on SOAP framework, conducted large-scale annotation of 5000+ doctor-patient turns with medical experts, benchmarked state-of-the-art generative and encoder models for intent classification.

Result: Models understand general medical dialogue structure with high accuracy but struggle with SOAP category transitions; identified common medical dialogue trajectories; intent filtering significantly boosts medical dialogue summarization performance.

Conclusion: This work provides valuable insights for differential diagnosis system design and demonstrates the importance of intent understanding in medical dialogues, with publicly available dataset and annotation guidelines as a resource for future research.

Abstract: In a doctor-patient dialogue, the primary objective of physicians is to diagnose patients and propose a treatment plan. Medical doctors guide these conversations through targeted questioning to efficiently gather the information required to provide the best possible outcomes for patients. To the best of our knowledge, this is the first work that studies physician intent trajectories in doctor-patient dialogues. We use the Ambient Clinical Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with medical professionals to develop a fine-grained taxonomy of physician intents based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We then conduct a large-scale annotation effort to label over 5000 doctor-patient turns with the help of a large number of medical experts recruited using Prolific, a popular crowd-sourcing platform. This large labeled dataset is an important resource contribution that we use for benchmarking the state-of-the-art generative and encoder models for medical intent classification tasks. Our findings show that our models understand the general structure of medical dialogues with high accuracy, but often fail to identify transitions between SOAP categories. We also report for the first time common trajectories in medical dialogue structures that provide valuable insights for designing differential diagnosis’ systems. Finally, we extensively study the impact of intent filtering for medical dialogue summarization and observe a significant boost in performance. We make the codes and data, including annotation guidelines, publicly available at https://github.com/DATEXIS/medical-intent-classification.

[47] It’s All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs

Yue Li, Zhixue Zhao, Carolina Scarton

Main category: cs.CL

TL;DR: LLMs struggle with extremely low-resource languages, especially those with rare scripts. This paper shows that zero-shot in-context learning with language alignment works best for such languages, while few-shot ICL or parameter-efficient fine-tuning is better for relatively better-represented languages.

Details

Motivation: Extremely low-resource languages, particularly those written in rare scripts, remain largely unsupported by large language models due to lack of training data and other compounding factors.

Method: Comprehensive analysis of whether LLMs can acquire low-resource languages via in-context learning (with/without auxiliary alignment signals) compared to parameter-efficient fine-tuning. Systematic evaluation of 20 under-represented languages across three state-of-the-art multilingual LLMs.

Result: PEFT has limitations when both language and script are extremely under-represented. Zero-shot ICL with language alignment is highly effective for extremely low-resource languages, while few-shot ICL or PEFT works better for relatively better-represented languages.

Conclusion: Guidelines for LLM practitioners: avoid fine-tuning multilingual models on languages of unseen scripts; use zero-shot ICL with alignment for extremely low-resource languages; use few-shot ICL or PEFT for better-represented languages.

Abstract: Extremely low-resource languages, especially those written in rare scripts, as shown in Figure 1, remain largely unsupported by large language models (LLMs). This is due in part to compounding factors such as the lack of training data. This paper delivers the first comprehensive analysis of whether LLMs can acquire such languages purely via in-context learning (ICL), with or without auxiliary alignment signals, and how these methods compare to parameter-efficient fine-tuning (PEFT). We systematically evaluate 20 under-represented languages across three state-of-the-art multilingual LLMs. Our findings highlight the limitation of PEFT when both language and its script are extremely under-represented by the LLM. In contrast, zero-shot ICL with language alignment is impressively effective on extremely low-resource languages, while few-shot ICL or PEFT is more beneficial for languages relatively better represented by LLMs. For LLM practitioners working on extremely low-resource languages, we summarise guidelines grounded by our results on adapting LLMs to low-resource languages, e.g., avoiding fine-tuning a multilingual model on languages of unseen scripts.

[48] Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index

Mathew Henrickson

Main category: cs.CL

TL;DR: RAG framework for art provenance research that enables natural-language multilingual searches in fragmented archival data, improving retrieval and summarization of auction records.

Details

Motivation: Provenance research is essential for art authentication and historical context but is hindered by fragmented multilingual data and metadata-dependent search systems that limit exploratory research.

Method: Retrieval-Augmented Generation framework with semantic retrieval and contextual summarization, tested on 10,000 records from Getty Provenance Index - German Sales.

Result: The approach provides scalable navigation of art market archives and practical tools for historians, reducing dependence on precise metadata structures.

Conclusion: RAG framework offers an effective solution for art provenance studies by enabling natural-language searches and improving access to fragmented multilingual archival data.

Abstract: This research presents a Retrieval-Augmented Generation (RAG) framework for art provenance studies, focusing on the Getty Provenance Index. Provenance research establishes the ownership history of artworks, which is essential for verifying authenticity, supporting restitution and legal claims, and understanding the cultural and historical context of art objects. The process is complicated by fragmented, multilingual archival data that hinders efficient retrieval. Current search portals require precise metadata, limiting exploratory searches. Our method enables natural-language and multilingual searches through semantic retrieval and contextual summarization, reducing dependence on metadata structures. We assess RAG’s capability to retrieve and summarize auction records using a 10,000-record sample from the Getty Provenance Index - German Sales. The results show this approach provides a scalable solution for navigating art market archives, offering a practical tool for historians and cultural heritage professionals conducting historically sensitive research.

[49] Beyond the Black Box: Integrating Lexical and Semantic Methods in Quantitative Discourse Analysis with BERTopic

Thomas Compton

Main category: cs.CL

TL;DR: A transparent hybrid framework for quantitative discourse analysis combining lexical and semantic methods using custom Python tools to ensure reproducibility and methodological triangulation.

Details

Motivation: Address the lack of transparency and researcher control in black-box QDA software like MAXQDA and NVivo, ensuring better alignment with research goals and methodological rigor.

Method: Custom Python pipelines using NLTK, spaCy, and Sentence Transformers for preprocessing and embeddings, combined with iterative BERTopic modeling (UMAP, HDBSCAN, c-TF-IDF) optimized through parameter tuning.

Result: Demonstrated through historical political discourse case study, showing enhanced topic coherence, coverage, and interpretability through multi-layered lexical-semantic approach.

Conclusion: Advocates for code-level transparency, researcher agency, and methodological triangulation in computational discourse studies to overcome limitations of isolated methods.

Abstract: Quantitative Discourse Analysis has seen growing adoption with the rise of Large Language Models and computational tools. However, reliance on black box software such as MAXQDA and NVivo risks undermining methodological transparency and alignment with research goals. This paper presents a hybrid, transparent framework for QDA that combines lexical and semantic methods to enable triangulation, reproducibility, and interpretability. Drawing from a case study in historical political discourse, we demonstrate how custom Python pipelines using NLTK, spaCy, and Sentence Transformers allow fine-grained control over preprocessing, lemmatisation, and embedding generation. We further detail our iterative BERTopic modelling process, incorporating UMAP dimensionality reduction, HDBSCAN clustering, and c-TF-IDF keyword extraction, optimised through parameter tuning and multiple runs to enhance topic coherence and coverage. By juxtaposing precise lexical searches with context-aware semantic clustering, we argue for a multi-layered approach that mitigates the limitations of either method in isolation. Our workflow underscores the importance of code-level transparency, researcher agency, and methodological triangulation in computational discourse studies. Code and supplementary materials are available via GitHub.

[50] Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs

Zhikai Ding, Shiyu Ni, Keping Bi

Main category: cs.CL

TL;DR: LVLMs have reasonable but improvable knowledge boundary perception, with probabilistic and consistency-based confidence being more reliable than verbalized confidence. Calibration methods from LLMs can enhance perception, and LVLMs show better perception than LLMs despite lower performance.

Details

Motivation: To investigate how well large vision-language models (LVLMs) perceive their knowledge boundaries and understand what they know vs. don't know, as reliable models should be aware of their limitations to avoid hallucination.

Method: Evaluated three confidence signals (probabilistic, consistency-based, verbalized) across three LVLMs on three VQA datasets. Adapted LLM confidence calibration methods and proposed three new methods to enhance perception.

Result: LVLMs show reasonable but improvable perception. Probabilistic and consistency-based confidence are reliable indicators, while verbalized confidence causes overconfidence. LVLMs have better perception than LLMs despite lower QA performance.

Conclusion: LVLMs need improved knowledge boundary perception. Effective calibration methods exist, and visual-textual processing in LVLMs provides better perception awareness compared to text-only LLMs, though performance is lower.

Abstract: Large vision-language models (LVLMs) demonstrate strong visual question answering (VQA) capabilities but are shown to hallucinate. A reliable model should perceive its knowledge boundaries-knowing what it knows and what it does not. This paper investigates LVLMs’ perception of their knowledge boundaries by evaluating three types of confidence signals: probabilistic confidence, answer consistency-based confidence, and verbalized confidence. Experiments on three LVLMs across three VQA datasets show that, although LVLMs possess a reasonable perception level, there is substantial room for improvement. Among the three confidences, probabilistic and consistency-based signals are more reliable indicators, while verbalized confidence often leads to overconfidence. To enhance LVLMs’ perception, we adapt several established confidence calibration methods from Large Language Models (LLMs) and propose three effective methods. Additionally, we compare LVLMs with their LLM counterparts, finding that jointly processing visual and textual inputs decreases question-answering performance but reduces confidence, resulting in an improved perception level compared to LLMs.

[51] Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan

Main category: cs.CL

TL;DR: The paper introduces SciReas and SciReas-Pro benchmarks for scientific reasoning evaluation, and KRUX framework to analyze knowledge vs reasoning roles. Key findings show knowledge retrieval is a bottleneck, external knowledge helps reasoning, and better reasoning surfaces more relevant knowledge.

Details

Motivation: There is no widely adopted holistic benchmark for evaluating scientific reasoning in LLMs, and few approaches systematically disentangle the roles of knowledge and reasoning in scientific tasks.

Method: Introduced SciReas (suite of scientific reasoning benchmarks) and SciReas-Pro (subset requiring complex reasoning). Proposed KRUX framework to probe knowledge vs reasoning roles. Conducted holistic evaluation and lightweight analysis comparing science-focused data composition.

Result: Key findings: (1) Knowledge retrieval from model parameters is a critical bottleneck; (2) Reasoning models benefit from external knowledge; (3) Enhanced reasoning improves ability to surface relevant knowledge. Also released SciLit01, a strong 8B baseline model.

Conclusion: The study provides comprehensive benchmarks and analysis framework for scientific reasoning, revealing important insights about knowledge-reasoning interplay and offering a strong baseline for future research in scientific problem solving with LLMs.

Abstract: Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs’ ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.

[52] Evaluating the Evaluators: Are readability metrics good measures of readability?

Isabel Cachola, Daniel Khashabi, Mark Dredze

Main category: cs.CL

TL;DR: Traditional readability metrics like Flesch-Kincaid perform poorly for plain language summaries. Language models are better at evaluating readability and capture deeper aspects like background knowledge requirements.

Details

Motivation: Current PLS evaluation relies on traditional readability metrics that haven't been validated against human judgments for plain language summarization tasks.

Method: Conducted thorough survey of PLS literature, evaluated 8 readability metrics against human judgments, tested Language Models as readability judges, and analyzed PLS datasets.

Result: Most traditional metrics correlate poorly with human judgments. Language models achieve up to 0.56 Pearson correlation with human judgments and better capture deeper readability aspects like background knowledge.

Conclusion: Language models are superior to traditional metrics for PLS readability evaluation. Recommendations provided for best practices in plain language summary evaluation.

Abstract: Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries. We release our analysis code and survey data.

[53] Generative Interfaces for Language Models

Jiaqi Chen, Yanzhe Zhang, Yutong Zhang, Yijia Shao, Diyi Yang

Main category: cs.CL

TL;DR: LLMs generate interactive UIs instead of text responses, enabling more adaptive engagement and outperforming traditional chat interfaces in 70% of cases.

Details

Motivation: Traditional linear request-response LLM interactions are inefficient for multi-turn, information-dense, and exploratory tasks, limiting user experience.

Method: Proposes Generative Interfaces framework using structured interface-specific representations and iterative refinements to translate queries into task-specific UIs.

Result: Generative interfaces consistently outperform conversational ones, with humans preferring them in over 70% of cases across diverse tasks and interaction patterns.

Conclusion: Generative interfaces represent a superior paradigm for human-AI interaction, clarifying when and why users prefer UI-based responses over text-only conversations.

Abstract: Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain constrained by a linear request-response format that often makes interactions inefficient in multi-turn, information-dense, and exploratory tasks. To address these limitations, we propose Generative Interfaces for Language Models, a paradigm in which LLMs respond to user queries by proactively generating user interfaces (UIs) that enable more adaptive and interactive engagement. Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs. For systematic evaluation, we introduce a multidimensional assessment framework that compares generative interfaces with traditional chat-based ones across diverse tasks, interaction patterns, and query types, capturing functional, interactive, and emotional aspects of user experience. Results show that generative interfaces consistently outperform conversational ones, with humans preferring them in over 70% of cases. These findings clarify when and why users favor generative interfaces, paving the way for future advancements in human-AI interaction.

[54] A Survey on Data Selection for LLM Instruction Tuning

Bolin Zhang, Jiahao Wang, Qianlong Du, Jiajun Zhang, Zhiying Tu, Dianhui Chu

Main category: cs.CL

TL;DR: Survey paper on data selection methods for LLM instruction tuning, focusing on quality over quantity to reduce costs and improve performance.

Details

Motivation: Instruction tuning is crucial for LLMs, and recent research shows dataset quality matters more than quantity. There's a need to systematically survey data selection methods to enhance instruction-following capabilities while reducing training costs.

Method: Comprehensive survey approach: introduces commonly used instruction datasets, proposes a new taxonomy of data selection methods, details recent advances, and elaborates on evaluation strategies and results.

Result: The paper provides a systematic overview of data selection techniques for instruction tuning, categorizing methods and analyzing their effectiveness through detailed evaluation strategies.

Conclusion: The survey identifies open challenges and presents new frontiers in data selection for LLM instruction tuning, emphasizing the importance of quality-focused approaches over large-scale datasets.

Abstract: Instruction tuning is a vital step of training large language models (LLMs), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than the quantity during instruction tuning of LLMs. Therefore, recently a lot of studies focus on exploring the methods of selecting high-quality subset from instruction datasets, aiming to reduce training costs and enhance the instruction-following capabilities of LLMs. This paper presents a comprehensive survey on data selection for LLM instruction tuning. Firstly, we introduce the wildly used instruction datasets. Then, we propose a new taxonomy of the data selection methods and provide a detailed introduction of recent advances, and the evaluation strategies and results of data selection methods are also elaborated in detail. Finally, we emphasize the open challenges and present new frontiers of this task.

[55] HateDebias: On the Diversity and Variability of Hate Speech Debiasing

Hongyan Wu, Zhengming Chen, Zijian Li, Nankai Lin, Lianxi Wang, Shengyi Jiang, Aimin Yang

Main category: cs.CL

TL;DR: Proposes HateDebias benchmark to analyze fairness of hate speech detection models under dynamically evolving environments with diverse and changing biases, and introduces a continual debiasing framework to address dynamic biases.

Details

Motivation: Existing hate speech detection datasets lack diversity and variability of bias, making them inadequate for real-world scenarios where biases evolve dynamically. There's a need to address fairness issues in hate speech detection under changing environments.

Method: Collected hate speech data with different bias types from real-world scenarios, constructed a dataset following continuous learning setting, and proposed a continual debiasing framework with memory replay and bias information regularization.

Result: The proposed methods achieved improved performance in mitigating dynamic biases in real-world scenarios, demonstrating effectiveness in maintaining model fairness under evolving bias conditions.

Conclusion: HateDebias benchmark effectively evaluates model fairness in dynamic environments, and the proposed continual debiasing framework shows practical utility for real-world hate speech detection applications with evolving biases.

Abstract: Hate speech frequently appears on social media platforms and urgently needs to be effectively controlled. Alleviating the bias caused by hate speech can help resolve various ethical issues. Although existing research has constructed several datasets for hate speech detection, these datasets seldom consider the diversity and variability of bias, making them far from real-world scenarios. To fill this gap, we propose a benchmark HateDebias to analyze the fairness of models under dynamically evolving environments. Specifically, to meet the diversity of biases, we collect hate speech data with different types of biases from real-world scenarios. To further simulate the variability in the real-world scenarios(i.e., the changing of bias attributes in datasets), we construct a dataset to follow the continuous learning setting and evaluate the detection accuracy of models on the HateDebias, where performance degradation indicates a significant bias toward a specific attribute. To provide a potential direction, we further propose a continual debiasing framework tailored to dynamic bias in real-world scenarios, integrating memory replay and bias information regularization to ensure the fairness of the model. Experiment results on the HateDebias benchmark reveal that our methods achieve improved performance in mitigating dynamic biases in real-world scenarios, highlighting the practicality in real-world applications.

[56] Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis

Kushal Raj Bhandari, Sixue Xing, Soham Dan, Jianxi Gao

Main category: cs.CL

TL;DR: LLMs show strong table comprehension capabilities without specific training, but face reliability issues including data contamination and sensitivity to perturbations, with attention analysis revealing performance drops correlate with attention dispersion changes in middle layers.

Details

Motivation: To investigate how in-context learning, model scale, instruction tuning, and domain bias affect the robustness of Large Language Models on Tabular Question Answering tasks across diverse domains.

Method: Testing LLMs under diverse augmentations and perturbations on three domains: Wikipedia-based WTQ, financial TAT-QA, and scientific SCITAB, with in-depth attention analysis to examine correlation between perturbation-induced attention dispersion shifts and performance drops.

Result: Instruction tuning and larger, newer LLMs deliver stronger, more robust TQA performance, but data contamination and reliability issues remain unresolved, especially on WTQ. Attention analysis shows strong correlation between attention dispersion shifts and performance drops, with sensitivity peaking in middle layers.

Conclusion: There is a need for improved interpretable methodologies, structure-aware self-attention mechanisms, and domain-adaptive processing techniques to enhance transparency, generalization, and real-world reliability of LLMs on tabular data.

Abstract: Large Language Models (LLMs), already shown to ace various unstructured text comprehension tasks, have also remarkably been shown to tackle table (structured) comprehension tasks without specific training. Building on earlier studies of LLMs for tabular tasks, we probe how in-context learning (ICL), model scale, instruction tuning, and domain bias affect Tabular QA (TQA) robustness by testing LLMs, under diverse augmentations and perturbations, on diverse domains: Wikipedia-based $\textbf{WTQ}$, financial $\textbf{TAT-QA}$, and scientific $\textbf{SCITAB}$. Although instruction tuning and larger, newer LLMs deliver stronger, more robust TQA performance, data contamination and reliability issues, especially on $\textbf{WTQ}$, remain unresolved. Through an in-depth attention analysis, we reveal a strong correlation between perturbation-induced shifts in attention dispersion and the drops in performance, with sensitivity peaking in the model’s middle layers. We highlight the need for improved interpretable methodologies to develop more reliable LLMs for table comprehension. Through an in-depth attention analysis, we reveal a strong correlation between perturbation-induced shifts in attention dispersion and performance drops, with sensitivity peaking in the model’s middle layers. Based on these findings, we argue for the development of structure-aware self-attention mechanisms and domain-adaptive processing techniques to improve the transparency, generalization, and real-world reliability of LLMs on tabular data.

[57] ChatGPT Doesn’t Trust Chargers Fans: Guardrail Sensitivity in Context

Victoria R. Li, Yida Chen, Naomi Saphra

Main category: cs.CL

TL;DR: This paper examines how user demographic and ideological information biases LLM guardrails, finding that GPT-3.5 shows differential refusal rates based on age, gender, ethnicity, and even sports fandom, with guardrails being sycophantic to inferred political ideologies.

Details

Motivation: While language model biases are well-documented, the biases in their guardrail systems have been neglected. The study aims to understand how contextual user information influences LLM refusal behavior.

Method: Researchers generated user biographies containing ideological and demographic information to test how different personas affect GPT-3.5’s likelihood of refusing requests for censored or illegal information.

Result: Younger, female, and Asian-American personas trigger more refusals. Guardrails show sycophantic behavior by refusing requests that contradict inferred political ideologies. Even sports fandom elicits guardrail sensitivity changes similar to direct political statements.

Conclusion: LLM guardrails exhibit significant biases based on user demographics and inferred political ideologies, with seemingly innocuous information like sports team preferences influencing refusal behavior similar to explicit political statements.

Abstract: While the biases of language models in production are extensively documented, the biases of their guardrails have been neglected. This paper studies how contextual information about the user influences the likelihood of an LLM to refuse to execute a request. By generating user biographies that offer ideological and demographic information, we find a number of biases in guardrail sensitivity on GPT-3.5. Younger, female, and Asian-American personas are more likely to trigger a refusal guardrail when requesting censored or illegal information. Guardrails are also sycophantic, refusing to comply with requests for a political position the user is likely to disagree with. We find that certain identity groups and seemingly innocuous information, e.g., sports fandom, can elicit changes in guardrail sensitivity similar to direct statements of political ideology. For each demographic category and even for American football team fandom, we find that ChatGPT appears to infer a likely political ideology and modify guardrail behavior accordingly.

[58] Recognizing Limits: Investigating Infeasibility in Large Language Models

Wenbo Zhang, Zihang Xu, Hengrui Cai

Main category: cs.CL

TL;DR: This paper addresses LLMs’ tendency to provide incorrect responses to infeasible tasks by developing a framework to identify and refuse such tasks, creating a benchmark dataset, and exploring fine-tuning methods to improve refusal capabilities.

Details

Motivation: LLMs often fail to recognize when queries exceed their knowledge and capabilities, leading to incorrect or fabricated responses. There's a need for models to properly identify and refuse infeasible tasks to improve reliability in real-world applications.

Method: The authors conceptualized four categories of infeasible tasks for LLMs, developed a benchmark dataset with diverse infeasible and feasible tasks, and explored fine-tuning approaches to enhance LLMs’ refusal capabilities.

Result: Experiments validated the effectiveness of the trained models in improving refusal capabilities, demonstrating that fine-tuning can significantly enhance LLMs’ ability to recognize and decline infeasible tasks.

Conclusion: The research provides a promising direction for improving LLM performance in real-world applications by enhancing their ability to refuse infeasible tasks, potentially reducing hallucinations and improving reliability.

Abstract: Large language models (LLMs) have shown remarkable performance in various tasks but often fail to handle queries that exceed their knowledge and capabilities, leading to incorrect or fabricated responses. This paper addresses the need for LLMs to recognize and refuse infeasible tasks due to the requests surpassing their capabilities. We conceptualize four main categories of infeasible tasks for LLMs, which cover a broad spectrum of hallucination-related challenges identified in prior literature. We develop and benchmark a new dataset comprising diverse infeasible and feasible tasks to evaluate multiple LLMs’ abilities to decline infeasible tasks. Furthermore, we explore the potential of increasing LLMs’ refusal capabilities with fine-tuning. Our experiments validate the effectiveness of the trained models, suggesting promising directions for improving the performance of LLMs in real-world applications.

[59] Label Set Optimization via Activation Distribution Kurtosis for Zero-shot Classification with Generative Models

Yue Li, Zhixue Zhao, Carolina Scarton

Main category: cs.CL

TL;DR: LOADS is a post-hoc method that optimizes label sets for zero-shot in-context learning by measuring neuron activation kurtosis, improving classification performance across tasks and models.

Details

Motivation: In-context learning performance is highly sensitive to prompt design, particularly class label options (lexicon, order), but this impact remains underexplored in zero-shot classification.

Method: LOADS uses kurtosis to measure neuron activation distribution in LLMs’ feed-forward networks for label selection, requiring only a single forward pass without gradients or labeled data.

Result: LOADS-selected label words consistently improve zero-shot ICL performance across classification tasks, datasets, models and languages, achieving performance gains from 0.54 to 0.76 compared to using original dataset labels.

Conclusion: Optimal label words activate fewer outlier neurons, and LOADS provides an effective, efficient method for label set optimization that enhances zero-shot classification performance without requiring additional training data.

Abstract: In-context learning (ICL) performance is highly sensitive to prompt design, yet the impact of class label options (e.g. lexicon or order) in zero-shot classification remains underexplored. This study proposes LOADS (Label set Optimization via Activation Distribution kurtosiS), a post-hoc method for selecting optimal label sets in zero-shot ICL with large language models (LLMs). LOADS is built upon the observations in our empirical analysis, the first to systematically examine how label option design (i.e., lexical choice, order, and elaboration) impacts classification performance. This analysis shows that the lexical choice of the labels in the prompt (such as agree vs. support in stance classification) plays an important role in both model performance and model’s sensitivity to the label order. A further investigation demonstrates that optimal label words tend to activate fewer outlier neurons in LLMs' feed-forward networks. LOADS then leverages kurtosis to measure the neuron activation distribution for label selection, requiring only a single forward pass without gradient propagation or labelled data. The LOADS-selected label words consistently demonstrate effectiveness for zero-shot ICL across classification tasks, datasets, models and languages, achieving maximum performance gain from 0.54 to 0.76 compared to the conventional approach of using original dataset label words.

[60] From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification

Junhua Liu, Yong Keat Tan, Bin Fu, Kwan Hui Lim

Main category: cs.CL

TL;DR: Chain-of-Intent framework combines HMMs and LLMs to generate multilingual intent-driven dialogues through self-play, with MINT-CL for classification and MINT-E dataset release.

Details

Motivation: Addressing the challenge of generating large-scale, domain-specific multilingual dialogue datasets for training effective multi-turn intent classification models in conversational AI systems.

Method: Integrates Hidden Markov Models (HMMs) with LLMs to extract intent transition patterns from e-commerce chat logs, parameterize emission probabilities, and generate context-aware dialogues through self-play. Includes MINT-CL multi-task contrastive learning framework.

Result: Outperforms competitive baselines in both dialogue generation quality and classification accuracy, particularly in multilingual settings.

Conclusion: The proposed framework successfully generates high-quality multilingual intent-driven dialogues and improves classification performance while reducing dependence on large annotated datasets, with released dataset and code for future research.

Abstract: In conversational AI systems, a critical challenge in training effective multi-turn intent classification models lies in the generation of large-scale, domain-specific, multilingual dialogue datasets. In this paper, we introduce Chain-of-Intent, a novel framework that integrates Hidden Markov Models (HMMs) with Large Language Models (LLMs) to generate intent-driven, context-aware dialogues through self-play. Our method first extracts domain-specific intent transition patterns from real-world e-commerce chat logs, which guide the modeling of turn-level dynamics and intent sequences. LLMs are then employed to parameterize the emission probabilities of HMMs, enabling the generation of natural, coherent utterances aligned with predicted intents and dialogue context. We further propose MINT-CL, a multi-task contrastive learning framework for multi-turn intent classification, which improves performance while reducing dependence on large-scale annotated datasets. Empirical results demonstrate that our approach outperforms competitive baselines in both dialogue generation quality and classification accuracy, particularly in multilingual settings. To facilitate future research, we release MINT-E, a comprehensive, multilingual, intent-aware multi-turn dialogue corpus derived from the e-commerce domain. The reproduced source code and dataset are available at https://github.com/junhua/chain-of-intent.

[61] Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge

Yuhe Ji, Yilun Liu, Feiyu Yao, Minggui He, Shimin Tao, Xiaofeng Zhao, Su Chang, Xinhua Yang, Weibin Meng, Yuming Xie, Boxing Chen, Shenglin Zhang, Yongqian Sun

Main category: cs.CL

TL;DR: SuperLog model bridges domain gap between natural language and log languages through continual pre-training on interpretable log knowledge, achieving 12.01% average accuracy improvement over existing methods.

Details

Motivation: Existing LLM solutions for log analysis suffer from domain gap between natural language and log languages containing domain-specific tokens, limiting their real-world effectiveness. Direct adaptation using raw logs degrades performance due to inconsistent token distribution.

Method: Domain adaptation approach integrating interpretable domain knowledge into open-source LLMs through continual pre-training on interpretable natural texts with log knowledge (instead of raw logs). Developed NLPLog dataset with 250,000+ QA pairs on log-related knowledge.

Result: SuperLog achieves best performance across four log analysis tasks with 12.01% average accuracy improvement over second-best model. Ablation study confirms advantages of domain adaptation using interpretable log knowledge over raw logs.

Conclusion: The approach successfully bridges the domain gap in log analysis by using interpretable log knowledge for continual pre-training, significantly outperforming existing methods and demonstrating the superiority of this adaptation strategy over raw log processing.

Abstract: Log analysis represents a critical sub-domain within AI applications that facilitates automatic approaches to fault and error management of large-scaled software systems, saving labors of traditional manual methods. While existing solutions using large language models (LLMs) show promise, they are limited by a significant domain gap between natural and log languages (the latter contains rich domain-specific tokens such as status codes, IP addresses, resource pathes), which restricts their effectiveness in real-world applications. However, directly adapting general-purpose LLMs to log analysis using raw logs may degrade their performance due to inconsistent token distribution. In this paper, we present a domain adaptation approach that addresses these limitations by integrating interpretable domain knowledge into open-source LLMs through continual pre-training (CPT), which bridges this domain gap by adapting LLMs on interpretable natural texts with log knowledge (instead of raw logs) to reduce distribution discrepancy. To achieve this, we developed NLPLog, a comprehensive dataset containing over 250,000 question-answer pairs on log-related knowledge. Our resulting model, SuperLog, achieves the best performance across four log analysis tasks, with an average accuracy improvement of 12.01% over the second-best model. Ablation study also suggests advantages of domain adaption using interpretable log knowledge over using raw logs.

[62] TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use

Junjie Ye, Yilong Wu, Sixian Li, Yuming Yang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, Zhengyin Du

Main category: cs.CL

TL;DR: TL-Training is a task-feature-based framework that improves LLM tool-use performance by addressing data quality issues, optimizing token weighting, and implementing error-category-specific rewards, achieving state-of-the-art results with minimal training data.

Details

Motivation: Standard supervised fine-tuning approaches for LLM tool use often overlook task-specific characteristics and suffer from performance bottlenecks due to suboptimal training data and uneven token importance distribution.

Method: Proposed TL-Training framework: mitigates effects of suboptimal training data, dynamically adjusts token weights during SFT to prioritize key tokens, and incorporates robust reward mechanism tailored to error categories using proximal policy optimization.

Result: Trained CodeLLaMA-2-7B achieves state-of-the-art tool-use performance on four test sets using only 1,217 training data points, matching or surpassing both open- and closed-source LLMs. Also enhances robustness in noisy environments and improves general task performance.

Conclusion: TL-Training provides a scalable and efficient paradigm for tool-use training in LLMs, addressing key limitations of standard SFT approaches while achieving superior performance with minimal data requirements.

Abstract: Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with environments, a critical step toward generalized AI. However, the standard supervised fine-tuning (SFT) approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use, leading to performance bottlenecks. To address this issue, we analyze three existing LLMs and uncover key insights: training data can inadvertently impede tool-use behavior, token importance is distributed unevenly, and errors in tool calls fall into a small set of categories. Building on these findings, we propose~\emph{TL-Training}, a task-feature-based framework that mitigates the effects of suboptimal training data, dynamically adjusts token weights to prioritize key tokens during SFT, and incorporates a robust reward mechanism tailored to error categories, optimized through proximal policy optimization. We validate TL-Training by training CodeLLaMA-2-7B and evaluating it on four open-source test sets. Our results demonstrate that the LLM trained by our method matches or surpasses both open- and closed-source LLMs in tool-use performance using only 1,217 training data points. Additionally, our method enhances robustness in noisy environments and improves general task performance, offering a scalable and efficient paradigm for tool-use training in LLMs. Code and data are available at https://github.com/Junjie-Ye/TL-Training.

[63] Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Jinzhu Wu, Yuhan Wu, Change Jia, Lin Sun, Xiangzheng Zhang

Main category: cs.CL

TL;DR: LLMs show severe accuracy drops and unexpected biases when faced with minor content-preserving perturbations like option length changes, problem type variations, and irrelevant noun replacements, revealing reliance on superficial cues rather than robust generalization.

Details

Motivation: To assess LLMs' true generalization ability beyond standard benchmarks by testing their performance under controlled, content-preserving perturbations that reveal hidden weaknesses and biases.

Method: Proposed a “Generalization Stress Test” with three types of controlled perturbations: option length modifications, problem type changes, and irrelevant noun replacements, while keeping core content unchanged.

Result: Significant accuracy drops observed across models - Qwen 2.5 1.5B dropped from 89 to 36 on MMLU with option length changes, GPT4o lost 25 points with problem type changes, and 6-point average drop across all perturbation categories.

Conclusion: LLMs heavily rely on superficial cues rather than forming robust abstract representations, indicating poor generalization across formats, lexical variations, and irrelevant content shifts despite high benchmark scores.

Abstract: In this paper, we propose a ``Generalization Stress Test" to assess Large Language Models’ (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B’s MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts.

[64] Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications

Yiming Zeng, Wanhao Yu, Zexin Li, Tao Ren, Yu Ma, Jinghan Cao, Xiyan Chen, Tingting Yu

Main category: cs.CL

TL;DR: FineEdit is a specialized LLM for precise text editing that outperforms state-of-the-art models by 10-40% on structured editing tasks across multiple domains.

Details

Motivation: Current LLMs struggle with precise, instruction-driven text editing that requires structural accuracy and domain convention adherence, especially in specialized domains like programming, LaTeX, and databases.

Method: Developed InstrEditBench (30k+ structured editing tasks) and trained FineEdit, a specialized editing model for context-aware text modifications.

Result: FineEdit outperforms Gemini models by ~10% on single-turn edits, Llama-3.2-3B by up to 30%, and Mistral-7B-OpenOrca by over 40% on direct editing tasks. It also generalizes well to multi-turn editing scenarios.

Conclusion: FineEdit demonstrates superior performance in precise text editing tasks and shows practical applicability, with the model and benchmark released for further research.

Abstract: Large Language Models (LLMs) have significantly advanced natural language processing, demonstrating strong capabilities in tasks such as text generation, summarization, and reasoning. Recently, their potential for automating precise text editing tasks across specialized domains, such as programming code, LaTeX, and structured database languages, has gained attention. However, current state-of-the-art LLMs still struggle with executing precise, instruction-driven edits, particularly when structural accuracy and strict adherence to domain conventions are required. To address these challenges, we introduce InstrEditBench, an automated benchmark dataset comprising over 30,000 structured editing tasks spanning diverse domains, including Wikipedia articles, LaTeX documents, source code, and database languages. Using this benchmark, we develop FineEdit, a specialized editing model explicitly trained for accurate, context-aware text modifications. Experimental evaluations demonstrate that FineEdit outperforms state-of-the-art models, achieving improvements of approximately 10% over Gemini models on single-turn edits, up to 30% over Llama-3.2-3B, and exceeding Mistral-7B-OpenOrca performance by over 40% on direct editing tasks. FineEdit also effectively generalizes to realistic multi-turn editing scenarios, highlighting its practical applicability. To facilitate further research and reproducibility, we release FineEdit at https://github.com/StuRinDQB/FineEdit} and https://huggingface.co/datasets/YimingZeng/FineEdit_bench.

[65] Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems

Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos, Dongwon Lee

Main category: cs.CL

TL;DR: Group collaboration improves deepfake text detection accuracy over individual efforts. DeepFakeDeLiBot chatbot enhances group dynamics but doesn’t significantly boost performance overall, though it helps participants who value group collaboration.

Details

Motivation: The proliferation of generative models makes it challenging to distinguish authentic human-authored content from deepfake content. Collaborative human efforts augmented by AI tools present a promising solution for this detection problem.

Method: The study explores DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. Researchers compared group-based problem-solving with individual efforts and analyzed engagement with the chatbot.

Result: Group collaboration significantly improves accuracy in identifying machine-generated paragraphs compared to individual efforts. While the chatbot didn’t yield substantial performance gains overall, it enhanced group dynamics by fostering greater engagement, consensus building, and diverse reasoning. Participants with higher perceived effectiveness of group collaboration benefited from the chatbot.

Conclusion: Deliberative chatbots like DeepFakeDeLiBot have potential in fostering interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection, particularly for those who value group collaboration.

Abstract: The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content. Collaborative human efforts, augmented by AI tools, present a promising solution. In this study, we explore the potential of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. Our findings reveal that group-based problem-solving significantly improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While engagement with DeepFakeDeLiBot does not yield substantial performance gains overall, it enhances group dynamics by fostering greater participant engagement, consensus building, and the frequency and diversity of reasoning-based utterances. Additionally, participants with higher perceived effectiveness of group collaboration exhibited performance benefits from DeepFakeDeLiBot. These findings underscore the potential of deliberative chatbots in fostering interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection. \textit{Dataset and source code used in this study will be made publicly available upon acceptance of the manuscript.

[66] SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?

Xudong Lu, Haohao Gao, Renshou Wu, Shuai Ren, Xiaoxin Chen, Hongsheng Li, Fangyuan Li

Main category: cs.CL

TL;DR: SmartBench is the first benchmark for evaluating on-device LLMs in Chinese mobile contexts, addressing gaps in existing English-focused benchmarks by creating 20 practical tasks across 5 categories with automated evaluation criteria.

Details

Motivation: Existing LLM evaluation benchmarks focus on objective English tasks like math and coding, which don't reflect practical mobile usage scenarios for Chinese users, especially for on-device deployment on smartphones.

Method: Analyzed smartphone manufacturer functionalities and created 5 categories (text summarization, text Q&A, information extraction, content creation, notification management) with 20 specific tasks. Built high-quality datasets of 50-200 question-answer pairs per task reflecting everyday mobile interactions, with automated evaluation criteria.

Result: Comprehensive evaluations of on-device LLMs and MLLMs using SmartBench, including performance assessment after quantized deployment on real smartphone NPUs.

Conclusion: Provides a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be publicly available.

Abstract: Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at https://github.com/vivo-ai-lab/SmartBench.

[67] Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

Yijiong Yu

Main category: cs.CL

TL;DR: A method that accelerates reasoning models by decoding multiple tokens per forward pass using tree-like attention masks, achieving near 100% speedup while maintaining answer quality.

Details

Motivation: Recent reasoning models generate detailed reasoning processes that are computationally expensive and time-consuming, creating efficiency issues that need to be addressed.

Method: Leverage inherent parallelizability of tasks by decoding multiple tokens per forward pass using tree-like attention masks within a single sequence, avoiding additional memory usage.

Result: Experimental results show up to nearly 100% speedup in decoding while basically maintaining the answer quality.

Conclusion: The proposed method successfully accelerates reasoning processes significantly without compromising answer quality, addressing computational inefficiency in detailed reasoning generation.

Abstract: Recent advances in reasoning models have demonstrated significant improvements in accuracy by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning steps exist, we decode multiple tokens per forward pass via a tree-like attention mask within a single sequence, avoiding additional memory usage. Experimental results show that our method achieves up to nearly 100% speedup in decoding while basically maintaining the answer quality.

[68] An Ontology-Driven Graph RAG for Legal Norms: A Hierarchical, Temporal, and Deterministic Approach

Hudson de Martim

Main category: cs.CL

TL;DR: Graph RAG framework using ontology-driven knowledge graphs to address legal document structure challenges, enabling temporal accuracy and verifiable legal AI systems.

Details

Motivation: Standard flat-text retrieval in legal RAG systems fails to capture hierarchical, temporal, and causal structures of law, leading to unreliable and anachronistic answers.

Method: Ontology-driven Graph RAG framework with formal LRMoo-inspired model distinguishing legal Works from versioned Expressions, temporal state modeling, and explicit legislative event nodes as Action nodes with planner-guided query strategy.

Result: Demonstrated through Brazilian Constitution case study, providing verifiable temporally-correct substrate for LLMs with higher-order analytical capabilities and reduced factual errors.

Conclusion: Practical framework for building more trustworthy and explainable legal AI systems by addressing structural limitations of traditional legal document retrieval.

Abstract: Retrieval-Augmented Generation (RAG) systems in the legal domain face a critical challenge: standard, flat-text retrieval is blind to the hierarchical, diachronic, and causal structure of law, leading to anachronistic and unreliable answers. This paper introduces an ontology-driven Graph RAG framework designed to overcome these limitations. We ground our knowledge graph in a formal, LRMoo-inspired model that distinguishes abstract legal Works from their versioned Expressions. We model temporal states as efficient aggregations that reuse the versioned expressions (CTVs) of unchanged components, and we reify legislative events as first-class Action nodes to make causality explicit and queryable. This structured backbone enables a unified, planner-guided query strategy that applies explicit policies to deterministically resolve complex requests for (i) point-in-time retrieval, (ii) hierarchical impact analysis, and (iii) auditable provenance reconstruction. Through a case study on the Brazilian Constitution, we demonstrate how this approach provides a verifiable, temporally-correct substrate for LLMs, enabling higher-order analytical capabilities while drastically reducing the risk of factual errors. The result is a practical framework for building more trustworthy and explainable legal AI systems.

[69] Improving Multilingual Language Models by Aligning Representations through Steering

Omar Mahmoud, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana

Main category: cs.CL

TL;DR: Lightweight representation steering method enhances LLM multilingual performance by adding learned vectors to residual streams, outperforming most baselines and matching production translation systems with fewer resources.

Details

Motivation: Despite recent progress, how LLMs represent non-English tokens remains underexplored, creating a need for efficient methods to improve multilingual capabilities.

Method: Proposes representation steering - adding a learned vector to the residual stream at a single model layer to enhance multilingual performance.

Result: Consistently outperforms seven competitive baselines including prompt optimization, SFT, and translation methods. Achieves performance on par with production-grade translation systems with far fewer resources. Shows complementarity with SFT for efficient representation realignment.

Conclusion: Activation-level interventions like representation steering are powerful tools for improving LLM multilingual capabilities, offering direct and efficient ways to realign internal representations.

Abstract: This paper investigates how Large Language Models (LLMs) represent non-English tokens – a question that remains underexplored despite recent progress. We propose a lightweight intervention method using representation steering, where a learned vector is added to the residual stream at a single model layer to enhance multilingual performance. Through extensive experiments across seven competitive baselines – including prompt optimization, supervised fine-tuning (SFT), in-context learning, cross-lingual transfer, and translation-based methods-we show that our approach consistently outperforms most alternatives. In particular, it achieves performance on par with production-grade translation systems while requiring far fewer resources. We further explore the complementarity between our method and SFT, demonstrating that steering offers a direct, efficient way to realign internal representations. These findings underscore the potential of activation-level interventions as a powerful tool for improving the multilingual capabilities of LLMs.

[70] Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals

Qianli Wang, Van Bach Nguyen, Nils Feldhus, Luis Felipe Villa-Arenas, Christin Seifert, Sebastian Möller, Vera Schmitt

Main category: cs.CL

TL;DR: Judge model selection for counterfactual data augmentation evaluation yields inconsistent results; independent non-fine-tuned judge models provide most reliable label flipping assessments, but human intervention is still needed.

Details

Motivation: To understand why different judge models produce inconsistent results when evaluating counterfactual examples for data augmentation in LLMs, and to determine the optimal relationship between generator and judge models.

Method: Conducted extensive experiments with 2 LLM-based methods, 3 datasets, 4 generator models, and 15 judge models across four relationship types (same model, same family, independent, distillation), complemented by a user study with 90 participants.

Result: Independent non-fine-tuned judge models provide the most reliable label flipping evaluations. Relationships aligned with user study results lead to better model performance and robustness, but significant gap remains between automated evaluation and human judgment.

Conclusion: Fully automated pipeline for counterfactual data augmentation may be inadequate and requires human intervention due to the substantial gap between automated judge models and human evaluation results.

Abstract: Counterfactual examples are widely employed to enhance the performance and robustness of large language models (LLMs) through counterfactual data augmentation (CDA). However, the selection of the judge model used to evaluate label flipping, the primary metric for assessing the validity of generated counterfactuals for CDA, yields inconsistent results. To decipher this, we define four types of relationships between the counterfactual generator and judge models: being the same model, belonging to the same model family, being independent models, and having an distillation relationship. Through extensive experiments involving two state-of-the-art LLM-based methods, three datasets, four generator models, and 15 judge models, complemented by a user study (n = 90), we demonstrate that judge models with an independent, non-fine-tuned relationship to the generator model provide the most reliable label flipping evaluations. Relationships between the generator and judge models, which are closely aligned with the user study for CDA, result in better model performance and robustness. Nevertheless, we find that the gap between the most effective judge models and the results obtained from the user study remains considerably large. This suggests that a fully automated pipeline for CDA may be inadequate and requires human intervention.

[71] Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning

Shangziqi Zhao, Jiahao Yuan, Guisong Yang, Usman Naseem

Main category: cs.CL

TL;DR: Prune-on-Logic framework selectively prunes low-utility reasoning steps from Long-CoT through logic graphs and self-verification, improving accuracy while reducing tokens for small language models.

Details

Motivation: Long chain-of-thought reasoning improves LLM accuracy but its verbose style hinders effective distillation into small language models, requiring structural optimization.

Method: Transform Long-CoT into logic graphs and selectively prune low-utility reasoning steps under self-verification constraints using three pruning strategies: entire chains, core reasoning, and verification.

Result: Verification pruning consistently improves accuracy while reducing token usage, while reasoning or indiscriminate pruning degrades performance. Larger models benefit more from pruning due to richer but redundant reasoning.

Conclusion: Effective pruning aligns supervision with model capacity rather than merely shortening inputs, serving as a structural optimization strategy for CoT reasoning alignment with SLM capacity.

Abstract: Long chain-of-thought (Long-CoT) reasoning improves accuracy in LLMs, yet its verbose, self-reflective style often hinders effective distillation into small language models (SLMs). We revisit Long-CoT compression through the lens of capability alignment and ask: Can pruning improve reasoning? We propose Prune-on-Logic, a structure-aware framework that transforms Long-CoT into logic graphs and selectively prunes low-utility reasoning steps under self-verification constraints. Through systematic analysis across three pruning strategies - targeting entire chains, core reasoning, and verification - we find that verification pruning consistently improves accuracy while reducing token usage, whereas reasoning or indiscriminate pruning degrades performance. Our study reveals that effective pruning aligns supervision with model capacity rather than merely shortening inputs. Gains hold across tasks, model scales, and CoT capability, with larger models benefiting more from pruning due to richer but more redundant reasoning. Our empirical findings highlight pruning as a structural optimization strategy for aligning CoT reasoning with SLM capacity.

[72] sudoLLM: On Multi-role Alignment of Language Models

Soumadeep Saha, Akshay Chaturvedi, Joy Mahapatra, Utpal Garain

Main category: cs.CL

TL;DR: sudoLLM is a framework that adds user authorization controls to LLMs, enabling them to provide sensitive information only to authorized users through injected bias signals.

Details

Motivation: User authorization-based access controls are critical for safety-critical systems but haven't been extensively studied in LLMs, creating security vulnerabilities.

Method: sudoLLM injects subtle user-based biases into queries and trains LLMs to use this bias signal to produce sensitive information only when users are authorized.

Result: The approach shows substantially improved alignment, generalization, resistance to jailbreaking attacks, and fails-closed behavior, resolving tension between language modeling and safety objectives.

Conclusion: sudoLLM serves as an additional security layer that complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.

Abstract: User authorization-based access privileges are a key feature in many safety-critical systems, but have not been extensively studied in the large language model (LLM) realm. In this work, drawing inspiration from such access control systems, we introduce sudoLLM, a novel framework that results in multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance with, user access rights. sudoLLM injects subtle user-based biases into queries and trains an LLM to utilize this bias signal in order to produce sensitive information if and only if the user is authorized. We present empirical results demonstrating that this approach shows substantially improved alignment, generalization, resistance to prefix-based jailbreaking attacks, and ``fails-closed’’. The persistent tension between the language modeling objective and safety alignment, which is often exploited to jailbreak LLMs, is somewhat resolved with the aid of the injected bias signal. Our framework is meant as an additional security layer, and complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.

[73] RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection

Yiming Huang, Junyan Zhang, Zihao Wang, Biquan Bie, Yunzhong Qiu, Yi R. Fung, Xinlei He

Main category: cs.CL

TL;DR: RePPL is a method that recalibrates uncertainty measurement for hallucination detection in LLMs by analyzing semantic propagation and language generation uncertainties, providing token-level explanations and achieving state-of-the-art detection performance.

Details

Motivation: Large Language Models suffer from hallucinations that limit their trustworthy use. Existing hallucination detection methods lack the ability to explain why hallucinations occur and which parts of inputs trigger them, despite improvements in uncertainty measurement.

Method: RePPL recalibrates uncertainty measurement by analyzing two aspects: uncertainty in semantic propagation (how attention mechanisms fuse token information across layers) and uncertainty in language generation (probability-based selection of semantics). It dispatches explainable uncertainty scores to each token and aggregates them in Perplexity-style Log-Average form.

Result: The method achieves the best comprehensive detection performance across various QA datasets on advanced models with an average AUC of 0.833. It produces token-level uncertainty scores as explanations for hallucinations and reveals chaotic patterns in hallucination occurrence.

Conclusion: RePPL provides an effective approach for hallucination detection with explainable token-level uncertainty scores, offering insights into hallucination patterns and demonstrating promising practical applications for improving LLM trustworthiness.

Abstract: Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. While previous works improved the capability of hallucination detection by measuring uncertainty, they all lack the ability to explain the provenance behind why hallucinations occur, i.e., which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as total score. Experiments show that our method achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and our method is capable of producing token-level uncertainty scores as explanations for the hallucination. Leveraging these scores, we preliminarily find the chaotic pattern of hallucination and showcase its promising usage.

[74] ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

Yan Yu, Yilun Liu, Minggui He, Shimin Tao, Weibin Meng, Xinhua Yang, Li Zhang, Hongxia Ma, Dengye Li, Daimeng Wei, Boxing Chen, Fuliang Li

Main category: cs.CL

TL;DR: ELSPR is a graph-based framework that identifies and removes ambiguous preference pairs from LLM training data to reduce non-transitivity and improve ranking reliability.

Details

Motivation: Non-transitive preferences in pairwise LLM evaluation undermine ranking reliability, primarily due to low-quality ambiguous data that creates circular preference patterns.

Method: ELSPR models pairwise preferences as tournament graphs, uses SCC analysis to quantify non-transitivity, and employs normalized directed graph structural entropy to measure preference clarity for systematic data filtering.

Result: Models trained on ELSPR-filtered data show 13.8% reduction in non-transitivity, 0.088 decrease in structural entropy, and significantly improved discriminative power. Human validation shows discarded data has much lower inter-annotator agreement (34.4% vs 52.6%) and model-human consistency (51.2% vs 80.6%).

Conclusion: ELSPR provides an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems by systematically removing problematic preference data.

Abstract: Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks, yet non-transitive preferences, where evaluators prefer A over B, B over C, but C over A, fundamentally undermine ranking reliability. We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs. To address this challenge, we propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs and systematically identifies problematic training data. ELSPR quantifies non-transitivity through strongly connected components (SCCs) analysis and measures overall preference clarity using a novel normalized directed graph structural entropy metric. Our filtering methodology selectively removes preference data that induce non-transitivity while preserving transitive preferences. Extensive experiments on the AlpacaEval benchmark demonstrate that models fine-tuned on ELSPR-filtered data achieve substantial improvements: a 13.8% reduction in non-transitivity, a 0.088 decrease in structural entropy, and significantly enhanced discriminative power in real-world evaluation systems. Human validation confirms that discarded data exhibit dramatically lower inter-annotator agreement (34.4% vs. 52.6%) and model-human consistency (51.2% vs. 80.6%) compared to cleaned data. These findings establish ELSPR as an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems.

[75] Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

Chen Han, Wenzhen Zheng, Xijin Tang

Main category: cs.CL

TL;DR: D2D is a multi-agent debate framework that reformulates misinformation detection as structured adversarial debates with domain-specific agents and multi-dimensional evaluation, achieving significant improvements over baseline methods.

Details

Motivation: Traditional misinformation detection methods rely on static classification and fail to capture real-world fact-checking processes. LLMs show promise but suffer from logical inconsistency and superficial verification in misinformation detection.

Method: Debate-to-Detect (D2D) framework with multi-agent debate structure. Assigns domain-specific profiles to agents and orchestrates five-stage debate process: Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. Uses multi-dimensional evaluation across Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics.

Result: Experiments with GPT-4o on two datasets show significant improvements over baseline methods. Case studies demonstrate iterative evidence refinement and improved decision transparency.

Conclusion: D2D represents a substantial advancement towards interpretable misinformation detection by simulating real-world fact-checking workflows through structured adversarial debates and multi-dimensional evaluation.

Abstract: The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D’s capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards interpretable misinformation detection. The code will be released publicly after the official publication.

[76] Measuring Sycophancy of Language Models in Multi-turn Dialogues

Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi

Main category: cs.CL

TL;DR: SYCON Bench is a new benchmark for evaluating sycophantic behavior in LLMs during multi-turn conversations, measuring how quickly models conform to user beliefs and how often they flip positions under pressure.

Details

Motivation: Existing research on LLM sycophancy focuses only on single-turn factual correctness, ignoring the dynamics of real-world multi-turn interactions where models may gradually conform to user beliefs.

Method: Developed SYCON Bench to evaluate 17 LLMs across three real-world scenarios, measuring Turn of Flip (how quickly models conform) and Number of Flip (frequency of stance shifts under sustained pressure). Also tested four prompting strategies.

Result: Sycophancy remains prevalent; alignment tuning amplifies it, while model scaling and reasoning optimization help resist undesirable views. Reasoning models outperform instruction-tuned models but fail when over-indexing on logic. Third-person perspective prompting reduced sycophancy by up to 63.8%.

Conclusion: Multi-turn conversational settings reveal significant sycophantic behavior in LLMs that single-turn evaluations miss. Strategic prompting approaches like third-person perspective can substantially reduce sycophancy, and reasoning capabilities help models maintain factual integrity.

Abstract: Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy–conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model’s ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user’s underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.

[77] Subjective Perspectives within Learned Representations Predict High-Impact Innovation

Likun Cao, Rui Pan, James Evans

Main category: cs.CL

TL;DR: Machine learning reveals that subjective perspectives (how people view concepts) predict innovation success better than background diversity, with perspective diversity driving creative achievement while background diversity often hinders it.

Details

Motivation: To understand how innovators' personal perspectives and interpersonal opportunities, shaped by prior experience, influence creative output and innovation capacity using machine learning approaches.

Method: Used dynamic machine-learned language representations to model innovators’ subjective perspectives in geometric concept spaces. Analyzed millions of scientists, inventors, writers, entrepreneurs, and Wikipedia contributors. Conducted natural experiments and AI agent simulations with varying perspective and background diversity.

Result: Perspective diversity consistently predicts creative achievement across all domains and time periods, while background diversity tends to have the opposite effect. Successful collaborators use common language to integrate diverse experiences from prior work trajectories.

Conclusion: Subjective perspectives are more important than background diversity for innovation success. Teams should focus on perspective diversity rather than just demographic or experiential diversity, with implications for team formation and research policy.

Abstract: Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators’ personal perspectives and interpersonal innovation opportunities as a function of their prior experience. We theorize and then quantify subjective perspectives and their interaction based on innovator positions within the geometric space of concepts inscribed by dynamic machine-learned language representations. Using data on millions of scientists, inventors, screenplay writers, entrepreneurs, and Wikipedia contributors across their respective creative domains, here we show that measured subjective perspectives predict which ideas individuals and groups will creatively attend to and successfully combine in the future. Across all cases and time periods we examine, when perspective diversity is decomposed as the difference between collaborators’ perspectives on their creation, and background diversity as the difference between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite. We analyze a natural experiment and simulate creative collaborations between AI agents designed with various perspective and background diversity, which support our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experiences obtained through trajectories of prior work. These perspectives converge and provoke one another to innovate. We examine the significance of these findings for team formation and research policy.

[78] Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

Liyan Xu, Zhenlin Su, Mo Yu, Jiangnan Li, Fandong Meng, Jie Zhou

Main category: cs.CL

TL;DR: Text encoders struggle with fine-grained entity/event recognition in retrieval tasks. A new evaluation dataset CapRetrieval reveals these limitations, and proposed finetuning strategies enable small models to outperform much larger ones.

Details

Motivation: Address the limitation of text encoders in recognizing fine-grained entities and events within encoded semantics, which causes retrieval failures even in simple cases.

Method: Introduce CapRetrieval evaluation dataset with image captions as passages and entity/event phrases as queries. Propose data generation strategies for finetuning encoders to improve fine-grained matching capabilities.

Result: Zero-shot evaluation shows encoders struggle with fine-grained matching regardless of training sources or model size. After finetuning, a small 0.1B encoder outperforms state-of-the-art 7B models. Uncover the ‘granularity dilemma’ challenge.

Conclusion: Current text encoders have significant limitations in fine-grained semantic recognition. The proposed methods and dataset enable substantial improvements, with small models achieving superior performance through targeted finetuning approaches.

Abstract: This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.

[79] Evaluating Scoring Bias in LLM-as-a-Judge

Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu

Main category: cs.CL

TL;DR: This paper investigates scoring bias in LLM-as-a-Judge systems, where LLM evaluators show inconsistent scores when bias-related perturbations are applied, and proposes a framework to evaluate and mitigate such biases.

Details

Motivation: While LLM-as-a-Judge is widely adopted across various domains, current research focuses mainly on comparison-based evaluations, leaving scoring-based evaluation biases largely unexplored despite their impact on fairness and reliability.

Method: The authors define scoring bias, augment existing LLM-as-a-Judge benchmarks through data synthesis, design multi-faceted evaluation metrics, and conduct experiments to assess scoring stability under bias-related perturbations.

Result: Experimental results show that existing judge models’ scoring stability is disrupted by scoring biases, with valuable insights provided on prompt template design and bias mitigation strategies.

Conclusion: The study highlights the importance of addressing scoring biases in LLM-as-a-Judge systems and provides practical guidance for improving scoring prompt templates and mitigation approaches through aspects like score rubrics, IDs, and reference answer selection.

Abstract: The remarkable performance of Large Language Models (LLMs) gives rise to``LLM-as-a-Judge’’, where LLMs are employed as evaluators for complex tasks. Moreover, it has been widely adopted across fields such as Natural Language Processing (NLP), preference learning, and various specific domains. However, there are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments. Current research on evaluating or mitigating bias in LLM-as-a-Judge predominantly focuses on comparison-based evaluations, while systematic investigations into bias in scoring-based evaluations remain limited. Therefore, we define scoring bias in LLM-as-a-Judge as the scores differ when scoring judge models are bias-related perturbed, and provide a well-designed framework to comprehensively evaluate scoring bias. We augment existing LLM-as-a-Judge benchmarks through data synthesis to construct our evaluation dataset and design multi-faceted evaluation metrics. Our experimental results demonstrate that the scoring stability of existing judge models is disrupted by scoring biases. Further exploratory experiments and discussions provide valuable insights into the design of scoring prompt templates and the mitigation of scoring biases on aspects such as score rubrics, score IDs, and reference answer selection.

[80] Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Ting Cai, Zibin Zheng

Main category: cs.CL

TL;DR: Krul is a dynamic KV cache compression system for LLMs that adapts compression strategies per conversation based on attention similarity, reducing TTFT by 1.5x-2.68x and storage by 1.33x-2.35x without quality loss.

Details

Motivation: Existing KV cache compression methods use fixed strategies across all conversations, ignoring conversation-specific attention dynamics and causing accuracy degradation.

Method: Dynamic compression strategy selection based on attention similarity, token-wise heterogeneous attention similarity estimator, and bubble-free restoration scheduler with recomputation-loading pipeline.

Result: 1.5x-2.68x reduction in time-to-first-token and 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods while maintaining generation quality.

Conclusion: Krul enables efficient and accurate KV cache restoration by adapting compression strategies to conversation-specific attention patterns, significantly improving performance metrics.

Abstract: Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.

[81] SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: SKA-Bench is a comprehensive benchmark for evaluating LLMs’ structured knowledge understanding across four knowledge forms (KG, Table, KG+Text, Table+Text) with four fundamental ability testbeds.

Details

Motivation: Existing evaluations for structured knowledge understanding are non-rigorous and focus on single knowledge types, lacking comprehensive assessment of specific capabilities.

Method: Three-stage pipeline construction of benchmark instances with questions, answers, positive knowledge units, and noisy knowledge units. Expanded into four testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection.

Result: Empirical evaluations on 8 LLMs (including DeepSeek-R1) show significant challenges in structured knowledge understanding, with performance affected by noise amount, knowledge unit order, and hallucination.

Conclusion: LLMs still face substantial difficulties in structured knowledge understanding, and SKA-Bench provides a rigorous benchmark for diagnosing these shortcomings across multiple knowledge forms and capabilities.

Abstract: Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at https://github.com/Lza12a/SKA-Bench.

[82] DLLMQuant: Quantizing Diffusion-based Large Language Models

Chen Xu, Dawei Yang

Main category: cs.CL

TL;DR: DLLMQuant is a post-training quantization framework specifically designed for diffusion-based LLMs that addresses quantization challenges through temporal-mask adaptive sampling, interaction-aware activation quantization, and certainty-guided quantization.

Details

Motivation: Direct application of existing PTQ methods to diffusion-based LLMs causes severe accuracy degradation (e.g., 16% drop with AWQ) due to DLLMs' unique mechanisms like dynamic masking, iterative generation, and bidirectional attention that clash with quantization.

Method: Proposes DLLMQuant with three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS) for capturing distributions across timesteps, 2) Interaction-Aware Activation Quantization (IA-AQ) using bidirectional attention signals for dynamic resource allocation, and 3) Certainty-Guided Quantization (CGQ) integrating mask status and token scores for error compensation.

Result: DLLMQuant achieves significant performance gains while enhancing efficiency compared to existing PTQ methods that suffer from severe accuracy degradation when applied to diffusion-based LLMs.

Conclusion: The proposed DLLMQuant framework successfully addresses the unique quantization challenges in diffusion-based LLMs through specialized techniques that account for temporal dynamics, interaction patterns, and certainty factors, enabling effective compression without performance degradation.

Abstract: Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when directly applied to DLLMs (e.g., AWQ suffers a 16% accuracy drop on LLADA under W4A4). This paper explores how DLLMs’ key mechanisms - dynamic masking, iterative generation, bidirectional attention - clash with quantization. We identify three core issues: 1) Iterative generation and dynamic masking ratios lead to distinct token distributions across decoding steps, which are not adequately captured by existing PTQ calibration methods; 2) Quantization errors are accumulated and amplified progressively during iteration in DLLMs, causing quantized models to perform worse as decoding steps progress; 3) Unmasked tokens stabilize while masked remain probabilistic, making overall feature distribution incompatible with existing PTQ methods. To address these issues, we propose DLLMQuant, a PTQ framework tailored for DLLMs, which incorporates three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS), a calibration method that accounts for both time and mask factors, with the capacity to capture distributions across timesteps. 2) Interaction-Aware Activation Quantization (IA-AQ), which utilizes bidirectional attention’s interaction signals to dynamically allocate quantization resources. 3) Certainty-Guided Quantization (CGQ), which integrates mask status and token scores as key weighting criteria into error compensation, making weight quantization more suitable for DLLMs. Experiments show that DLLMQuant achieves significant performance gains while enhancing efficiency.

[83] NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan, Ashton Sharabiani, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Banghua Zhu, Barnaby Simkin, Bilal Kartal, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Brian Yu, Bryan Catanzaro, Charles Wang, Charlie Truong, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christian Munley, Christopher Parisien, Dan Su, Daniel Afrimi, Daniel Korzekwa, Daniel Rohrer, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Dima Rekesh, Dina Yared, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Eileen Long, Elliott Ning, Eric Chung, Erick Galinkin, Evelina Bakhturina, Gargi Prasad, Gerald Shen, Haifeng Qian, Haim Elisha, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Hoo Chang Shin, Hua Huang, Iain Cunningham, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jian Zhang, Jiaqi Zeng, Jimmy Zhang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jonathan Cohen, Joseph Jennings, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kezhi Kong, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Kushan Ahmadian, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Luis Vega, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Mark Cai, Markus Kliegl, Marta Stepniewska-Dziubinska, Matvei Novikov, Mehrzad Samadi, Meredith Price, Meriem Boubdir, Michael Boone, Michael Evans, Michal Bien, Michal Zawalski, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Namit Dhameja, Nave Assaf, Negar Habibi, Nidhi Bhatia, Nikki Pope, Nima Tajbakhsh, Nirmal Kumar Juluru, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pablo Ribalta, Padmavathy Subramanian, Parth Chadha, Pavlo Molchanov, Peter Dykas, Peter Jin, Piotr Bialecki, Piotr Januszewski, Pradeep Thalasta, Prashant Gaikwad, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi Mahabadi, Rajen Patel, Ran El-Yaniv, Ranjit Rajan, Ria Cheruvu, Rima Shahbazyan, Ritika Borkar, Ritu Gala, Roger Waleffe, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Sahil Jain, Samuel Kriman, Sanjeev Satheesh, Saori Kaji, Sarah Yurick, Saurav Muralidharan, Sean Narenthiran, Seonmyeong Bak, Sepehr Sameni, Seungju Han, Shanmugam Ramasamy, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shizhe Diao, Shreya Gopal, Shrimai Prabhumoye, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Siddhartha Jain, Somshubra Majumdar, Soumye Singhal, Stefania Alborghetti, Syeda Nahida Akter, Terry Kong, Tim Moon, Tomasz Hliwiak, Tomer Asida, Tony Wang, Tugrul Konuk, Twinkle Vashishth, Tyler Poon, Udi Karpas, Vahid Noroozi, Venkat Srinivasan, Vijay Korthikanti, Vikram Fugro, Vineeth Kalluru, Vitaly Kurin, Vitaly Lavrukhin, Wasi Uddin Ahmad, Wei Du, Wonmin Byeon, Ximing Lu, Xin Dong, Yashaswi Karnati, Yejin Choi, Yian Zhang, Ying Lin, Yonggan Fu, Yoshi Suhara, Zhen Dong, Zhiyu Li, Zhongbo Zhu, Zijia Chen

Main category: cs.CL

TL;DR: Nemotron-Nano-9B-v2 is a hybrid Mamba-Transformer model that achieves state-of-the-art accuracy with 6x higher inference throughput for reasoning workloads compared to similarly-sized models.

Details

Motivation: To increase throughput for reasoning workloads while maintaining high accuracy by replacing most self-attention layers with Mamba-2 layers for improved inference speed on long thinking traces.

Method: Built on Nemotron-H architecture, pre-trained a 12B parameter model on 20T tokens using FP8 training, then aligned and compressed using Minitron strategy to enable inference on 128k tokens on a single A10G GPU.

Result: Achieves on-par or better accuracy than similarly-sized models (e.g., Qwen3-8B) with up to 6x higher inference throughput in reasoning settings (8k input, 16k output tokens).

Conclusion: Successfully demonstrates that hybrid Mamba-Transformer architecture can significantly improve reasoning throughput while maintaining competitive accuracy, with models and datasets released publicly.

Abstract: We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

[84] SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang

Main category: cs.CL

TL;DR: SDGO is a reinforcement learning framework that uses LLMs’ own discrimination capabilities as reward signals to enhance generation safety without external data or models.

Details

Motivation: LLMs show safety inconsistency - they can better identify harmful content as discriminators than defend against it as generators, creating vulnerability to jailbreaking attacks.

Method: Proposed SDGO framework using self-discrimination-guided optimization through reinforcement learning, leveraging the model’s discrimination capability as reward signal for iterative self-improvement.

Result: SDGO significantly improves model safety compared to baseline methods while maintaining helpfulness, and shows robust performance against out-of-distribution jailbreaking attacks.

Conclusion: Aligning LLMs’ discrimination and generation capabilities through SDGO enables safer content generation with minimal discriminative samples, achieving tighter coupling between these capabilities.

Abstract: Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model’s inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model’s own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs’ discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model’s generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.

[85] From Confidence to Collapse in LLM Factual Robustness

Alina Fastowski, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.CL

TL;DR: The paper introduces Factual Robustness Score (FRS) to measure knowledge stability in LLMs using token distribution entropy and temperature scaling sensitivity, showing smaller models are less robust than larger ones.

Details

Motivation: Existing evaluation methods focus on performance-based metrics and prompt perturbations, missing the internal generation process perspective of knowledge robustness.

Method: Developed FRS metric combining token distribution entropy and temperature scaling sensitivity to quantify fact stability against decoding perturbations. Tested on 5 LLMs across 3 QA datasets (SQuAD, TriviaQA, HotpotQA).

Result: Factual robustness varies significantly - smaller models have FRS of 0.76, larger ones 0.93. Accuracy degrades by ~60% under increased uncertainty.

Conclusion: Entropy and temperature scaling significantly impact factual accuracy, providing foundation for developing more robust knowledge retention and retrieval in future models.

Abstract: Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly – smaller models report an FRS of $0.76$, larger ones $0.93$ – with accuracy degrading by ~$60%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.

[86] Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling

Yue Zhao, Xiaoyu Wang, Dan Wang, Zhonglin Jiang, Qingqing Gu, Teng Chen, Ningyuan Xi, Jinxian Qu, Yong Chen, Luo Ji

Main category: cs.CL

TL;DR: DreamCUB framework applies model-based reinforcement learning to dialogue systems using a POMDP approach to model user beliefs (emotion, sentiment, intention) through information bottleneck maximization, achieving SOTA performance on emotion/sentiment classification while improving dialogue quality.

Details

Motivation: World models are widely used in robotics and gaming but have limited applications in natural language tasks. The paper aims to extend world modeling to dialogue systems by predicting user states and future utterances.

Method: Constructed a dialogue world model using POMDP formulation, modeling emotion/sentiment/intention as user beliefs via information bottleneck maximization. Applied model-based RL framework (DreamCUB) with joint training of policy, critic, and world model components.

Result: Achieved state-of-the-art performance on emotion classification and sentiment identification. Dialogue quality was enhanced through joint training. The framework demonstrated good exploration-exploitation balance and transferred well to out-of-domain scenarios like empathetic dialogues.

Conclusion: The dialogue world model approach successfully extends world modeling concepts to natural language tasks, providing effective user belief modeling and improving both classification performance and dialogue quality through the DreamCUB framework.

Abstract: World models have been widely utilized in robotics, gaming, and auto-driving. However, their applications on natural language tasks are relatively limited. In this paper, we construct the dialogue world model, which could predict the user’s emotion, sentiment, and intention, and future utterances. By defining a POMDP, we argue emotion, sentiment and intention can be modeled as the user belief and solved by maximizing the information bottleneck. By this user belief modeling, we apply the model-based reinforcement learning framework to the dialogue system, and propose a framework called DreamCUB. Experiments show that the pretrained dialogue world model can achieve state-of-the-art performances on emotion classification and sentiment identification, while dialogue quality is also enhanced by joint training of the policy, critic and dialogue world model. Further analysis shows that this manner holds a reasonable exploration-exploitation balance and also transfers well to out-of-domain scenarios such as empathetic dialogues.

[87] CausalSent: Interpretable Sentiment Classification with RieszNet

Daniel Frees, Martin Pollack

Main category: cs.CL

TL;DR: CausalSent framework improves treatment effect estimation accuracy in NLP models using RieszNet-based architecture, reducing MAE by 2-3x compared to previous work, and demonstrates causal effect of word “love” in movie reviews.

Details

Motivation: Despite high performance of NLP models, their decisions remain a black box. Causal NLP aims to combine causal inference with modern NLP to elucidate causal effects of text features and improve model interpretability.

Method: Developed a two-headed RieszNet-based neural network architecture for better treatment effect estimation. Replicated and extended previous work on regularizing text classifiers, focusing on semi-synthetic IMDB movie reviews data.

Result: CausalSent framework reduced MAE of effect estimates by 2-3x compared to previous work on synthetic Civil Comments data. Observational study showed presence of word “love” causes +2.9% increase in probability of positive sentiment.

Conclusion: The proposed CausalSent framework successfully improves treatment effect estimation accuracy and provides interpretable causal insights into text features, demonstrating practical application in analyzing sentiment causation in movie reviews.

Abstract: Despite the overwhelming performance improvements offered by recent natural language processing (NLP) models, the decisions made by these models are largely a black box. Towards closing this gap, the field of causal NLP combines causal inference literature with modern NLP models to elucidate causal effects of text features. We replicate and extend Bansal et al’s work on regularizing text classifiers to adhere to estimated effects, focusing instead on model interpretability. Specifically, we focus on developing a two-headed RieszNet-based neural network architecture which achieves better treatment effect estimation accuracy. Our framework, CausalSent, accurately predicts treatment effects in semi-synthetic IMDB movie reviews, reducing MAE of effect estimates by 2-3x compared to Bansal et al’s MAE on synthetic Civil Comments data. With an ensemble of validated models, we perform an observational case study on the causal effect of the word “love” in IMDB movie reviews, finding that the presence of the word “love” causes a +2.9% increase in the probability of a positive sentiment.

[88] Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions

Nannan Huang, Haytham Fayek, Xiuzhen Zhang

Main category: cs.CL

TL;DR: Pruning LLMs can negatively impact fairness in opinion summarization. Proposed HGLA pruning method better maintains fairness by removing parameters redundant for input but influential for output.

Details

Motivation: To investigate how post-training pruning affects fairness in LLM-generated opinion summaries, as biased outputs could influence public views, and existing methods' fairness impacts remain unexplored.

Method: Comprehensive empirical analysis of three pruning methods and calibration sets across three LLMs using four fairness metrics. Proposed HGLA pruning that identifies and removes parameters redundant for input processing but influential in output generation.

Result: Pruning methods have greater impact on fairness than calibration sets. HGLA better maintains or improves fairness compared to existing methods across models and tasks. Human evaluation shows HGLA-generated outputs are fairer than state-of-the-art pruning methods.

Conclusion: HGLA pruning shows promise for maintaining fairness in compressed models, addressing limitations of traditional pruning methods in opinion summarization tasks.

Abstract: Model compression through post-training pruning offers a way to reduce model size and computational requirements without significantly impacting model performance. However, the effect of pruning on the fairness of LLM-generated summaries remains unexplored, particularly for opinion summarisation where biased outputs could influence public views.In this paper, we present a comprehensive empirical analysis of opinion summarisation, examining three state-of-the-art pruning methods and various calibration sets across three open-source LLMs using four fairness metrics. Our systematic analysis reveals that pruning methods have a greater impact on fairness than calibration sets. Building on these insights, we propose High Gradient Low Activation (HGLA) pruning, which identifies and removes parameters that are redundant for input processing but influential in output generation. Our experiments demonstrate that HGLA can better maintain or even improve fairness compared to existing methods, showing promise across models and tasks where traditional methods have limitations. Our human evaluation shows HGLA-generated outputs are fairer than existing state-of-the-art pruning methods. Code is available at: https://github.com/amberhuang01/HGLA.

cs.CV

[89] Towards Training-Free Underwater 3D Object Detection from Sonar Point Clouds: A Comparison of Traditional and Deep Learning Approaches

M. Salman Shaukat, Yannik Käckenmeister, Sebastian Bader, Thomas Kirste

Main category: cs.CV

TL;DR: This paper presents two training-free approaches for underwater 3D object detection using sonar data: neural networks trained on synthetic data and model-based template matching. While neural networks achieved 98% mAP on synthetic data but dropped to 40% on real data due to domain shift, template matching maintained 83% mAP on real data without training.

Details

Motivation: Underwater 3D object detection is challenging due to harsh acoustic environments and scarcity of annotated training data, which is expensive and logistically complex to obtain. The research aims to achieve reliable detection without real-world training data.

Method: Developed two paradigms: 1) Physics-based sonar simulation pipeline generating synthetic training data for neural networks, and 2) Model-based template matching system leveraging geometric priors of target objects. Both approaches were evaluated on real bathymetry surveys from the Baltic Sea.

Result: Neural networks trained on synthetic data achieved 98% mAP on simulated scenes but dropped to 40% mAP on real sonar data due to domain shift. Template matching maintained 83% mAP on real data without requiring any training, showing robustness to acoustic noise and environmental variations.

Conclusion: The findings challenge conventional wisdom about data-hungry deep learning in underwater domains. Template matching demonstrates superior robustness and performance on real data, opening new possibilities for autonomous underwater applications in data-scarce environments where traditional ML approaches fail.

Abstract: Underwater 3D object detection remains one of the most challenging frontiers in computer vision, where traditional approaches struggle with the harsh acoustic environment and scarcity of training data. While deep learning has revolutionized terrestrial 3D detection, its application underwater faces a critical bottleneck: obtaining sufficient annotated sonar data is prohibitively expensive and logistically complex, often requiring specialized vessels, expert surveyors, and favorable weather conditions. This work addresses a fundamental question: Can we achieve reliable underwater 3D object detection without real-world training data? We tackle this challenge by developing and comparing two paradigms for training-free detection of artificial structures in multibeam echo-sounder point clouds. Our dual approach combines a physics-based sonar simulation pipeline that generates synthetic training data for state-of-the-art neural networks, with a robust model-based template matching system that leverages geometric priors of target objects. Evaluation on real bathymetry surveys from the Baltic Sea reveals surprising insights: while neural networks trained on synthetic data achieve 98% mean Average Precision (mAP) on simulated scenes, they drop to 40% mAP on real sonar data due to domain shift. Conversely, our template matching approach maintains 83% mAP on real data without requiring any training, demonstrating remarkable robustness to acoustic noise and environmental variations. Our findings challenge conventional wisdom about data-hungry deep learning in underwater domains and establish the first large-scale benchmark for training-free underwater 3D detection. This work opens new possibilities for autonomous underwater vehicle navigation, marine archaeology, and offshore infrastructure monitoring in data-scarce environments where traditional machine learning approaches fail.

DongHoon Lim, YoungChae Kim, Dong-Hyun Kim, Da-Hee Yang, Joon-Hyuk Chang

Main category: cs.CV

TL;DR: Novel AVSR framework with router-gated cross-modal fusion that dynamically adjusts audio-visual feature weighting based on acoustic corruption scores, achieving significant WER reduction compared to AV-HuBERT.

Details

Motivation: Existing audio-visual speech recognition systems struggle to estimate audio reliability and dynamically adjust modality reliance in noisy environments, limiting their robustness.

Method: Router-gated cross-modal feature fusion that adaptively reweights audio and visual features using token-level acoustic corruption scores. Uses audio-visual feature fusion-based router to down-weight unreliable audio tokens and reinforce visual cues through gated cross-attention in decoder layers.

Result: Achieves 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT on LRS3 dataset. Ablation studies confirm both router and gating mechanism contribute to improved robustness under real-world acoustic noise.

Conclusion: The proposed framework enables models to pivot toward visual modality when audio quality deteriorates, significantly improving AVSR performance in noisy environments through adaptive modality weighting.

Abstract: Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.

[91] MobileDenseAttn:A Dual-Stream Architecture for Accurate and Interpretable Brain Tumor Detection

Shudipta Banik, Muna Das, Trapa Banik, Md. Ehsanul Haque

Main category: cs.CV

TL;DR: MobileDenseAttn is a dual-stream fusion model combining MobileNetV2 and DenseNet201 for brain tumor detection in MRI, achieving 98.35% accuracy with improved efficiency and interpretability through GradCAM visualizations.

Details

Motivation: Manual brain tumor analysis in MRI is time-consuming and error-prone. Current automated approaches lack generalization, computational efficiency, interpretability, and transparency, limiting their clinical trustworthiness.

Method: Feature-level fusion of MobileNetV2 and DenseNet201 streams trained on augmented dataset of 6,020 MRI scans (glioma, meningioma, pituitary tumors, normal). Uses 5-fold cross-validation and GradCAM for visual explanations.

Result: 99.75% training accuracy, 98.35% testing accuracy, F1 score of 0.9835. 39.3% faster training than VGG19 with +3.67% accuracy improvement over baseline models. GradCAM heatmaps successfully localize tumor areas.

Conclusion: MobileDenseAttn is an efficient, high-performance, interpretable model with strong potential for clinical application in real-world brain tumor identification due to its accuracy, speed, and transparency.

Abstract: The detection of brain tumor in MRI is an important aspect of ensuring timely diagnostics and treatment; however, manual analysis is commonly long and error-prone. Current approaches are not universal because they have limited generalization to heterogeneous tumors, are computationally inefficient, are not interpretable, and lack transparency, thus limiting trustworthiness. To overcome these issues, we introduce MobileDenseAttn, a fusion model of dual streams of MobileNetV2 and DenseNet201 that can help gradually improve the feature representation scale, computing efficiency, and visual explanations via GradCAM. Our model uses feature level fusion and is trained on an augmented dataset of 6,020 MRI scans representing glioma, meningioma, pituitary tumors, and normal samples. Measured under strict 5-fold cross-validation protocols, MobileDenseAttn provides a training accuracy of 99.75%, a testing accuracy of 98.35%, and a stable F1 score of 0.9835 (95% CI: 0.9743 to 0.9920). The extensive validation shows the stability of the model, and the comparative analysis proves that it is a great advancement over the baseline models (VGG19, DenseNet201, MobileNetV2) with a +3.67% accuracy increase and a 39.3% decrease in training time compared to VGG19. The GradCAM heatmaps clearly show tumor-affected areas, offering clinically significant localization and improving interpretability. These findings position MobileDenseAttn as an efficient, high performance, interpretable model with a high probability of becoming a clinically practical tool in identifying brain tumors in the real world.

[92] Can VLMs Recall Factual Associations From Visual References?

Dhananjay Ashok, Ashutosh Chaubey, Hirona J. Arai, Jonathan May, Jesse Thomason

Main category: cs.CV

TL;DR: VLMs struggle to link visual representations with factual knowledge, showing systematic grounding deficiencies that can be detected with high accuracy using internal state probes.

Details

Motivation: To identify systematic deficiencies in multimodal grounding of Vision Language Models, particularly their ability to recall factual knowledge when references are visual rather than textual.

Method: Conducted controlled studies comparing VLM performance with textual vs visual references, analyzed internal state patterns, and developed probes to detect linking failures without retraining.

Result: VLMs’ factual recall ability halves when forced to rely on image representations; internal state probes achieve 92% accuracy in flagging unreliable responses; selective prediction improves coverage by 7.87% while reducing error risk by 0.9%.

Conclusion: VLMs have systematic, detectable deficiencies in linking visual representations with internal knowledge, and addressing this grounding issue is crucial for future multimodal AI development.

Abstract: Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.

[93] SERES: Semantic-aware neural reconstruction from sparse views

Bo Xu, Yuhu Guo, Yuchao Wang, Wenting Wang, Yeung Yam, Charlie C. L. Wang, Xinyi Le

Main category: cs.CV

TL;DR: Semantic-aware neural reconstruction method that improves 3D model generation from sparse images by adding patch-based semantic logits and geometric regularization to reduce shape ambiguity.

Details

Motivation: To address the challenge of severe radiance ambiguity caused by mismatched features in sparse image inputs for 3D reconstruction.

Method: Enriches neural implicit representations with patch-based semantic logits optimized together with signed distance field and radiance field. Introduces novel regularization based on geometric primitive masks to mitigate shape ambiguity.

Result: Average chamfer distances reduced by 44% for SparseNeuS and 20% for VolRecon on DTU dataset. When used as plugin for dense reconstruction baselines (NeuS and Neuralangelo), average error reduced by 69% and 68% respectively.

Conclusion: The proposed semantic-aware approach significantly improves 3D reconstruction quality from sparse images and works effectively both as standalone method and as enhancement plugin for existing dense reconstruction methods.

Abstract: We propose a semantic-aware neural reconstruction method to generate 3D high-fidelity models from sparse images. To tackle the challenge of severe radiance ambiguity caused by mismatched features in sparse input, we enrich neural implicit representations by adding patch-based semantic logits that are optimized together with the signed distance field and the radiance field. A novel regularization based on the geometric primitive masks is introduced to mitigate shape ambiguity. The performance of our approach has been verified in experimental evaluation. The average chamfer distances of our reconstruction on the DTU dataset can be reduced by 44% for SparseNeuS and 20% for VolRecon. When working as a plugin for those dense reconstruction baselines such as NeuS and Neuralangelo, the average error on the DTU dataset can be reduced by 69% and 68% respectively.

[94] Automated Landfill Detection Using Deep Learning: A Comparative Study of Lightweight and Custom Architectures with the AerialWaste Dataset

Nowshin Sharmily, Rusab Sarmun, Muhammad E. H. Chowdhury, Mir Hamidul Hussain, Saad Bin Abul Kashem, Molla E Majid, Amith Khandakar

Main category: cs.CV

TL;DR: This paper presents a deep learning approach using lightweight models and ensemble techniques to detect illegal landfills from aerial imagery, achieving over 92% accuracy on the AerialWaste Dataset.

Details

Motivation: Illegal landfills pose significant environmental and health hazards worldwide, but manual detection is difficult and time-consuming. There's a lack of good quality public datasets for this problem due to security concerns.

Method: Used lightweight deep learning models (Mobilenetv2, Googlenet, Densenet, MobileVit) to avoid overfitting, then created an ensemble model combining the best performers. Applied binary classification on the AerialWaste Dataset containing 10,434 aerial images from multiple sources.

Result: Achieved 92.33% accuracy, 92.67% precision, 92.33% sensitivity, 92.41% F1 score, and 92.71% specificity using ensemble techniques on the dataset.

Conclusion: Lightweight models combined with ensemble techniques effectively detect illegal landfills from aerial imagery while avoiding overfitting, providing an efficient automated solution for environmental monitoring.

Abstract: Illegal landfills are posing as a hazardous threat to people all over the world. Due to the arduous nature of manually identifying the location of landfill, many landfills go unnoticed by authorities and later cause dangerous harm to people and environment. Deep learning can play a significant role in identifying these landfills while saving valuable time, manpower and resources. Despite being a burning concern, good quality publicly released datasets for illegal landfill detection are hard to find due to security concerns. However, AerialWaste Dataset is a large collection of 10434 images of Lombardy region of Italy. The images are of varying qualities, collected from three different sources: AGEA Orthophotos, WorldView-3, and Google Earth. The dataset contains professionally curated, diverse and high-quality images which makes it particularly suitable for scalable and impactful research. As we trained several models to compare results, we found complex and heavy models to be prone to overfitting and memorizing training data instead of learning patterns. Therefore, we chose lightweight simpler models which could leverage general features from the dataset. In this study, Mobilenetv2, Googlenet, Densenet, MobileVit and other lightweight deep learning models were used to train and validate the dataset as they achieved significant success with less overfitting. As we saw substantial improvement in the performance using some of these models, we combined the best performing models and came up with an ensemble model. With the help of ensemble and fusion technique, binary classification could be performed on this dataset with 92.33% accuracy, 92.67% precision, 92.33% sensitivity, 92.41% F1 score and 92.71% specificity.

[95] Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

Jiangfeng Sun, Sihao He, Zhonghong Ou, Meina Song

Main category: cs.CV

TL;DR: SSU is a novel multimodal sentiment analysis framework that integrates modality-specific structural dependencies and cross-modal semantic alignment through dynamic graph construction and semantic anchoring, achieving state-of-the-art performance with improved interpretability and reduced computational costs.

Details

Motivation: Existing multimodal fusion methods neglect modality-specific structural dependencies and suffer from semantic misalignment across different modalities, limiting their quality, interpretability, and robustness in sentiment analysis.

Method: Proposes Structural-Semantic Unifier (SSU) framework with: 1) dynamic modality-specific graph construction using linguistic syntax for text and text-guided attention for acoustic/visual modalities, 2) semantic anchor from global textual semantics for cross-modal alignment, and 3) multiview contrastive learning for discriminability and consistency.

Result: SSU achieves state-of-the-art performance on CMU-MOSI and CMU-MOSEI benchmarks while significantly reducing computational overhead compared to prior methods. Qualitative analyses validate improved interpretability and nuanced emotional pattern capture.

Conclusion: SSU effectively addresses structural dependency and semantic alignment challenges in multimodal sentiment analysis through systematic integration of modality-specific structures and cross-modal semantic grounding, demonstrating superior performance and interpretability.

Abstract: Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multiview contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU’s interpretability and its ability to capture nuanced emotional patterns through semantically grounded interactions.

[96] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses

Hao Liang, Zhixuan Ge, Ashish Tiwari, Soumendu Majee, G. M. Dilshan Godaliyadda, Ashok Veeraraghavan, Guha Balakrishnan

Main category: cs.CV

TL;DR: FastAvatar is a pose-invariant feed-forward framework that generates 3D Gaussian Splatting models from single face images in under 10ms, achieving superior reconstruction quality and 1000x speedup over optimization methods.

Details

Motivation: Existing methods for 3D face avatar generation either require slow per-face optimization or suffer from poor reconstruction quality in feed-forward approaches, limiting real-time applications.

Method: Uses encoder-decoder network to encode input face into pose-invariant latent embedding, then decodes to predict residuals to structural and appearance parameters of a pre-built 3DGS template model.

Result: Outperforms existing feed-forward methods in quality, runs 1000x faster than optimization methods, and supports real-time identity interpolation and attribute editing.

Conclusion: FastAvatar enables high-quality, real-time 3D avatar generation from single images, expanding practical applications of 3DGS in consumer and interactive systems.

Abstract: We present FastAvatar, a pose-invariant, feed-forward framework that can generate a 3D Gaussian Splatting (3DGS) model from a single face image from an arbitrary pose in near-instant time (<10ms). FastAvatar uses a novel encoder-decoder neural network design to achieve both fast fitting and identity preservation regardless of input pose. First, FastAvatar constructs a 3DGS face ``template’’ model from a training dataset of faces with multi-view captures. Second, FastAvatar encodes the input face image into an identity-specific and pose-invariant latent embedding, and decodes this embedding to predict residuals to the structural and appearance parameters of each Gaussian in the template 3DGS model. By only inferring residuals in a feed-forward fashion, model inference is fast and robust. FastAvatar significantly outperforms existing feed-forward face 3DGS methods (e.g., GAGAvatar) in reconstruction quality, and runs 1000x faster than per-face optimization methods (e.g., FlashAvatar, GaussianAvatars and GASP). In addition, FastAvatar’s novel latent space design supports real-time identity interpolation and attribute editing which is not possible with any existing feed-forward 3DGS face generation framework. FastAvatar’s combination of excellent reconstruction quality and speed expands the scope of 3DGS for photorealistic avatar applications in consumer and interactive systems.

[97] Securing Face and Fingerprint Templates in Humanitarian Biometric Systems

Giuseppe Stragapede, Sam Merrick, Vedrana Krivokuća Hahn, Justin Sukaitis, Vincent Graf Narbel

Main category: cs.CV

TL;DR: A mobile biometric system with PolyProtect BTP scheme for humanitarian scenarios, evaluated on face and fingerprint data with promising results.

Details

Motivation: Biometrics improve efficiency in humanitarian operations but pose privacy risks for vulnerable populations, requiring secure template protection.

Method: Rigorous requirement formulation, comparative BTP analysis, implementation of PolyProtect on neural network face embeddings using EdgeFace feature extractor, extended evaluation to fingerprints.

Result: Promising experimental results showing effectiveness in verification/identification accuracy, irreversibility, and unlinkability for both face and fingerprint biometrics.

Conclusion: PolyProtect is suitable for humanitarian contexts due to effectiveness, modularity, and lightweight computation, with plans to release code for further development.

Abstract: In humanitarian and emergency scenarios, the use of biometrics can dramatically improve the efficiency of operations, but it poses risks for the data subjects, which are exacerbated in contexts of vulnerability. To address this, we present a mobile biometric system implementing a biometric template protection (BTP) scheme suitable for these scenarios. After rigorously formulating the functional, operational, and security and privacy requirements of these contexts, we perform a broad comparative analysis of the BTP landscape. PolyProtect, a method designed to operate on neural network face embeddings, is identified as the most suitable method due to its effectiveness, modularity, and lightweight computational burden. We evaluate PolyProtect in terms of verification and identification accuracy, irreversibility, and unlinkability, when this BTP method is applied to face embeddings extracted using EdgeFace, a novel state-of-the-art efficient feature extractor, on a real-world face dataset from a humanitarian field project in Ethiopia. Moreover, as PolyProtect promises to be modality-independent, we extend its evaluation to fingerprints. To the best of our knowledge, this is the first time that PolyProtect has been evaluated for the identification scenario and for fingerprint biometrics. Our experimental results are promising, and we plan to release our code

[98] Why Relational Graphs Will Save the Next Generation of Vision Foundation Models?

Fatemeh Ziaeetabar

Main category: cs.CV

TL;DR: Vision foundation models need explicit relational reasoning capabilities through dynamic graph interfaces to handle tasks requiring entity, role, and spatio-temporal relationship understanding.

Details

Motivation: Current vision foundation models lack explicit reasoning capabilities for relational tasks in fine-grained activity recognition, egocentric video understanding, and medical image analysis where spatial, temporal, and semantic dependencies are critical.

Method: Augmenting foundation models with lightweight, context-adaptive graph-reasoning modules that create dynamic relational graphs whose topology and edge semantics are inferred from input and task context.

Result: Hybrid models with graph reasoning modules show improved fine-grained semantic fidelity, out-of-distribution robustness, interpretability, computational efficiency, and favorable memory/hardware efficiency compared to FM-only baselines.

Conclusion: Next-generation vision foundation models should incorporate explicit relational interfaces via dynamic graph reasoning, with future research focusing on learned dynamic graph construction, multi-level relational reasoning, cross-modal fusion, and specialized evaluation protocols.

Abstract: Vision foundation models (FMs) have become the predominant architecture in computer vision, providing highly transferable representations learned from large-scale, multimodal corpora. Nonetheless, they exhibit persistent limitations on tasks that require explicit reasoning over entities, roles, and spatio-temporal relations. Such relational competence is indispensable for fine-grained human activity recognition, egocentric video understanding, and multimodal medical image analysis, where spatial, temporal, and semantic dependencies are decisive for performance. We advance the position that next-generation FMs should incorporate explicit relational interfaces, instantiated as dynamic relational graphs (graphs whose topology and edge semantics are inferred from the input and task context). We illustrate this position with cross-domain evidence from recent systems in human manipulation action recognition and brain tumor segmentation, showing that augmenting FMs with lightweight, context-adaptive graph-reasoning modules improves fine-grained semantic fidelity, out of distribution robustness, interpretability, and computational efficiency relative to FM only baselines. Importantly, by reasoning sparsely over semantic nodes, such hybrids also achieve favorable memory and hardware efficiency, enabling deployment under practical resource constraints. We conclude with a targeted research agenda for FM graph hybrids, prioritizing learned dynamic graph construction, multi-level relational reasoning (e.g., part object scene in activity understanding, or region organ in medical imaging), cross-modal fusion, and evaluation protocols that directly probe relational competence in structured vision tasks.

[99] LPLC: A Dataset for License Plate Legibility Classification

Lucas Wojcik, Gabriel E. Lima, Valfride Nascimento, Eduil Nascimento Jr., Rayson Laroca, David Menotti

Main category: cs.CV

TL;DR: A novel dataset (LPLC) with 10,210 vehicle images and 12,687 annotated license plates for legibility classification, addressing the challenge of recognizing illegible license plates through selective image pre-processing.

Details

Motivation: To address the core issue of recognizing low-quality license plates in ALPR systems by enabling selective application of image enhancement methods only when needed, optimizing both performance and computational efficiency.

Method: Created a comprehensive dataset with fine-grained annotations including occlusion levels, four legibility categories, and character labels. Proposed a classification benchmark using ViT, ResNet, and YOLO networks to categorize LP images into three conditions: good enough, requires super-resolution, or unrecoverable.

Result: All three baseline models (ViT, ResNet, YOLO) achieved F1 scores below 80%, demonstrating the difficulty of the legibility classification task and highlighting the need for further research in this area.

Conclusion: The introduced LPLC dataset provides a valuable resource for research on license plate legibility classification, and the poor performance of current baseline models underscores the complexity of the problem and the necessity for continued investigation into selective image pre-processing approaches.

Abstract: Automatic License Plate Recognition (ALPR) faces a major challenge when dealing with illegible license plates (LPs). While reconstruction methods such as super-resolution (SR) have emerged, the core issue of recognizing these low-quality LPs remains unresolved. To optimize model performance and computational efficiency, image pre-processing should be applied selectively to cases that require enhanced legibility. To support research in this area, we introduce a novel dataset comprising 10,210 images of vehicles with 12,687 annotated LPs for legibility classification (the LPLC dataset). The images span a wide range of vehicle types, lighting conditions, and camera/image quality levels. We adopt a fine-grained annotation strategy that includes vehicle- and LP-level occlusions, four legibility categories (perfect, good, poor, and illegible), and character labels for three categories (excluding illegible LPs). As a benchmark, we propose a classification task using three image recognition networks to determine whether an LP image is good enough, requires super-resolution, or is completely unrecoverable. The overall F1 score, which remained below 80% for all three baseline models (ViT, ResNet, and YOLO), together with the analyses of SR and LP recognition methods, highlights the difficulty of the task and reinforces the need for further research. The proposed dataset is publicly available at https://github.com/lmlwojcik/lplc-dataset.

[100] CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering

Aranya Saha, Tanvir Ahmed Khan, Ismam Nur Swapnil, Mohammad Ariful Haque

Main category: cs.CV

TL;DR: CLARIFY is a Specialist-Generalist framework for dermatological VQA that combines a lightweight domain-trained image classifier with a compressed conversational VLM, achieving 18% higher diagnostic accuracy and significant efficiency improvements.

Details

Motivation: Vision-language models have limitations in specialized diagnostic accuracy and high inference costs for clinical deployment, requiring more efficient and accurate medical AI systems.

Method: Two-component framework: (1) Specialist - lightweight domain-trained image classifier for fast diagnostic predictions, (2) Generalist - compressed conversational VLM for natural language explanations guided by Specialist’s predictions, enhanced by knowledge graph-based retrieval for factual grounding.

Result: 18% improvement in diagnostic accuracy over strongest baseline, with at least 20% reduction in VRAM requirements and 5% reduction in latency on multimodal dermatology dataset.

Conclusion: Specialist-Generalist system provides practical paradigm for building lightweight, trustworthy, and clinically viable AI systems with improved accuracy and efficiency.

Abstract: Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment. To address these challenges, we introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries. In our framework, the Specialist’s predictions directly guide the Generalist’s reasoning, focusing it on the correct diagnostic path. This synergy is further enhanced by a knowledge graph-based retrieval module, which grounds the Generalist’s responses in factual dermatological knowledge, ensuring both accuracy and reliability. This hierarchical design not only reduces diagnostic errors but also significantly improves computational efficiency. Experiments on our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an 18% improvement in diagnostic accuracy over the strongest baseline, a fine-tuned, uncompressed single-line VLM, while reducing the average VRAM requirement and latency by at least 20% and 5%, respectively. These results indicate that a Specialist-Generalist system provides a practical and powerful paradigm for building lightweight, trustworthy, and clinically viable AI systems.

[101] Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal

Xiao Feng Zhang, Tian Yi Song, Jia Wei Yao

Main category: cs.CV

TL;DR: Deshadow-Anything improves SAM’s shadow removal capability using diffusion models with MSAG and DDPM-AIP for faster training and better image restoration.

Details

Motivation: SAM struggles with distinguishing shadows from backgrounds, limiting its effectiveness in shadow removal tasks.

Method: Fine-tuning on large-scale datasets with diffusion models that preserve image details, plus Multi-Self-Attention Guidance and adaptive input perturbation to accelerate training.

Result: Experiments show effective improvement in image restoration performance for shadow removal tasks.

Conclusion: The proposed methods successfully enhance SAM’s shadow removal capabilities while maintaining image details and accelerating training speed.

Abstract: Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, helping to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance.

[102] VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results

Sizhuo Ma, Wei-Ting Chen, Qiang Gao, Jian Wang, Chris Wei Zhou, Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, Guangtao Zhai, Baoying Chen, Xiongwei Xiao, Jishen Zeng, Wei Wu, Tiexuan Lou, Yuchen Tan, Chunyi Song, Zhiwei Xu, MohammadAli Hamidi, Hadi Amirpour, Mingyin Bai, Jiawang Du, Zhenyu Jiang, Zilong Lu, Ziguan Cui, Zongliang Gan, Xinpeng Li, Shiqi Jiang, Chenhui Li, Changbo Wang, Weijun Yuan, Zhan Li, Yihang Chen, Yifan Deng, Ruting Deng, Zhanglu Chen, Boyang Yao, Shuling Zheng, Feng Zhang, Zhiheng Fu, Abhishek Joshi, Aman Agarwal, Rakhil Immidisetti, Ajay Narasimha Mopidevi, Vishwajeet Shukla, Hao Yang, Ruikun Zhang, Liyuan Pan, Kaixin Deng, Hang Ouyang, Fan yang, Zhizun Luo, Zhuohang Shi, Songning Lai, Weilin Ruan, Yutao Yue

Main category: cs.CV

TL;DR: The VQualA 2025 Challenge on Face Image Quality Assessment focused on developing lightweight models to predict Mean Opinion Scores for face images with real-world degradations, attracting 127 participants with 1519 submissions.

Details

Motivation: Real-world face images often suffer from degradations like noise, blur, and compression artifacts that reduce image quality and hinder downstream applications, necessitating effective quality assessment methods.

Method: Participants created lightweight models (≤0.5 GFLOPs and 5M parameters) to predict MOS scores on face images with arbitrary resolutions and realistic degradations, evaluated through correlation metrics on in-the-wild datasets.

Result: The challenge successfully attracted 127 participants who submitted 1519 final entries, demonstrating strong interest and engagement in developing practical FIQA solutions.

Conclusion: The challenge advanced the development of practical face image quality assessment approaches by fostering lightweight, efficient models that can handle real-world degradations and arbitrary image resolutions.

Abstract: Face images play a crucial role in numerous applications; however, real-world conditions frequently introduce degradations such as noise, blur, and compression artifacts, affecting overall image quality and hindering subsequent tasks. To address this challenge, we organized the VQualA 2025 Challenge on Face Image Quality Assessment (FIQA) as part of the ICCV 2025 Workshops. Participants created lightweight and efficient models (limited to 0.5 GFLOPs and 5 million parameters) for the prediction of Mean Opinion Scores (MOS) on face images with arbitrary resolutions and realistic degradations. Submissions underwent comprehensive evaluations through correlation metrics on a dataset of in-the-wild face images. This challenge attracted 127 participants, with 1519 final submissions. This report summarizes the methodologies and findings for advancing the development of practical FIQA approaches.

[103] Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling

Md. Rashid Shahriar Khan, Md. Abrar Hasan, Mohammod Tareq Aziz Justice

Main category: cs.CV

TL;DR: A zero-shot anomaly detection framework combining TimeSformer, DPC, and CLIP for surveillance footage, using temporal forecasting and semantic context without anomaly training data.

Details

Motivation: Detecting anomalies in surveillance is challenging due to unpredictable nature and context-dependence, requiring methods that don't rely on pre-seen anomaly examples.

Method: Hybrid architecture with TimeSformer for spatiotemporal features, DPC for future prediction, and CLIP for semantic context via text prompts. Uses InfoNCE and CPC losses with context-gating mechanism.

Result: Framework enables detection of unseen abnormal behaviors by integrating temporal reasoning with semantic understanding in complex environments.

Conclusion: Successfully bridges temporal reasoning and semantic context for zero-shot anomaly detection, with code made publicly available.

Abstract: Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at https://github.com/NK-II/Context-Aware-ZeroShot-Anomaly-Detection-in-Surveillance.

[104] Enhancing Underwater Images via Deep Learning: A Comparative Study of VGG19 and ResNet50-Based Approaches

Aoqi Li, Yanghui Song, Jichao Dao, Chengfu Yang

Main category: cs.CV

TL;DR: Deep learning approach combining VGG19 and ResNet50 for underwater image enhancement, using multi-scale feature analysis and quantitative evaluation metrics.

Details

Motivation: To address the challenging problem of image enhancement in complex underwater scenes where traditional methods often fail due to poor visibility, color distortion, and low contrast conditions.

Method: Integrates VGG19 and ResNet50 convolutional neural networks to perform multi-scale and multi-level deep feature analysis of underwater images, creating a unified model that leverages the complementary advantages of both architectures.

Result: Achieves comprehensive and accurate image enhancement effects as quantitatively evaluated using PSNR, UCIQE, and UIQM metrics, showing improved performance across different underwater scenarios.

Conclusion: The proposed deep learning framework effectively enhances underwater images and provides practical implementation guidance including model optimization, multi-model fusion strategies, and hardware selection recommendations for real-world underwater visual enhancement systems.

Abstract: This paper addresses the challenging problem of image enhancement in complex underwater scenes by proposing a solution based on deep learning. The proposed method skillfully integrates two deep convolutional neural network models, VGG19 and ResNet50, leveraging their powerful feature extraction capabilities to perform multi-scale and multi-level deep feature analysis of underwater images. By constructing a unified model, the complementary advantages of the two models are effectively integrated, achieving a more comprehensive and accurate image enhancement effect.To objectively evaluate the enhancement effect, this paper introduces image quality assessment metrics such as PSNR, UCIQE, and UIQM to quantitatively compare images before and after enhancement and deeply analyzes the performance of different models in different scenarios.Furthermore, to improve the practicality and stability of the underwater visual enhancement system, this paper also provides practical suggestions from aspects such as model optimization, multi-model fusion, and hardware selection, aiming to provide strong technical support for visual enhancement tasks in complex underwater environments.

Ajinkya Khoche, Qingwen Zhang, Yixi Cai, Sina Sharif Mansouri, Patric Jensfelt

Main category: cs.CV

TL;DR: DoGFlow is a self-supervised framework that uses 4D radar Doppler measurements to generate motion pseudo-labels for LiDAR scene flow estimation, eliminating the need for manual annotations while achieving near-supervised performance.

Details

Motivation: Manual annotation of 3D scene flow data is expensive and limits scalability, while current self-supervised methods underperform in challenging scenarios like long-range and adverse weather conditions.

Method: Cross-modal label transfer approach that computes motion pseudo-labels from 4D radar Doppler measurements in real-time and transfers them to LiDAR domain using dynamic-aware association and ambiguity-resolved propagation.

Result: Substantially outperforms existing self-supervised methods on MAN TruckScenes dataset and achieves over 90% of fully supervised performance with only 10% of ground truth data.

Conclusion: DoGFlow provides an effective self-supervised solution for 3D scene flow estimation that addresses the annotation bottleneck while maintaining high performance comparable to supervised methods.

Abstract: Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse weather scenarios, while supervised methods are not scalable due to their reliance on expensive human labeling. We introduce DoGFlow, a novel self-supervised framework that recovers full 3D object motions for LiDAR scene flow estimation without requiring any manual ground truth annotations. This paper presents our cross-modal label transfer approach, where DoGFlow computes motion pseudo-labels in real-time directly from 4D radar Doppler measurements and transfers them to the LiDAR domain using dynamic-aware association and ambiguity-resolved propagation. On the challenging MAN TruckScenes dataset, DoGFlow substantially outperforms existing self-supervised methods and improves label efficiency by enabling LiDAR backbones to achieve over 90% of fully supervised performance with only 10% of the ground truth data. For more details, please visit https://ajinkyakhoche.github.io/DogFlow/

[106] SAT-SKYLINES: 3D Building Generation from Satellite Imagery and Coarse Geometric Priors

Zhangyu Jin, Andrew Feng

Main category: cs.CV

TL;DR: SatSkylines is a 3D building generation method that uses satellite imagery and coarse geometric priors to create detailed building models, overcoming limitations of existing image-based and detailization approaches.

Details

Motivation: Existing methods struggle with accurate building structure recovery from satellite images alone or require highly detailed inputs, creating a need for an approach that works with simple geometric priors while maintaining flexibility and low computational cost.

Method: The approach models the transformation from interpolated noisy coarse priors to detailed geometries, enabling geometric control without additional computational cost. A large-scale dataset (Skylines-50K) with over 50,000 stylized 3D building assets was developed to support detailed building model generation.

Result: Extensive evaluations demonstrate the model’s effectiveness and strong generalization ability, showing it can produce satisfying results from simple priors like cuboids.

Conclusion: SatSkylines successfully addresses the limitations of existing 3D building generation methods by providing accurate structure recovery from satellite imagery with flexible geometric control and low computational overhead.

Abstract: We present SatSkylines, a 3D building generation approach that takes satellite imagery and coarse geometric priors. Without proper geometric guidance, existing image-based 3D generation methods struggle to recover accurate building structures from the top-down views of satellite images alone. On the other hand, 3D detailization methods tend to rely heavily on highly detailed voxel inputs and fail to produce satisfying results from simple priors such as cuboids. To address these issues, our key idea is to model the transformation from interpolated noisy coarse priors to detailed geometries, enabling flexible geometric control without additional computational cost. We have further developed Skylines-50K, a large-scale dataset of over 50,000 unique and stylized 3D building assets in order to support the generations of detailed building models. Extensive evaluations indicate the effectiveness of our model and strong generalization ability.

Kaijie Xu, Clark Verbrugge

Main category: cs.CV

TL;DR: A novel approach for detecting Spatial Transition Points (STPs) and Main STPs in 3D games using a two-stage deep learning pipeline with parameter-efficient adapters, validated on a custom dataset from five Action RPG titles.

Details

Motivation: To enable efficient identification of map transition points for client-side auto-mapping and provide objective evaluation of map cue presentation in complex 3D game environments.

Method: Two-stage pipeline: 1) Faster R-CNN for STP detection, 2) lightweight MSTP selector fusing local and global visual features with parameter-efficient adapters and optional retrieval-augmented fusion.

Result: Full-network fine-tuning achieves superior STP detection with sufficient data, while adapter-only transfer is more robust and effective in low-data scenarios and for MSTP selection tasks.

Conclusion: Establishes feasibility of STP/MSTP detection problem, provides baseline pipeline and dataset, and offers insights into efficient model adaptation for AI-driven navigation aids and level-design tools.

Abstract: In complex 3D game environments, players rely on visual affordances to spot map transition points. Efficient identification of such points is important to client-side auto-mapping, and provides an objective basis for evaluating map cue presentation. In this work, we formalize the task of detecting traversable Spatial Transition Points (STPs)-connectors between two sub regions-and selecting the singular Main STP (MSTP), the unique STP that lies on the designer-intended critical path toward the player’s current macro-objective, from a single game frame, proposing this as a new research focus. We introduce a two-stage deep-learning pipeline that first detects potential STPs using Faster R-CNN and then ranks them with a lightweight MSTP selector that fuses local and global visual features. Both stages benefit from parameter-efficient adapters, and we further introduce an optional retrieval-augmented fusion step. Our primary goal is to establish the feasibility of this problem and set baseline performance metrics. We validate our approach on a custom-built, diverse dataset collected from five Action RPG titles. Our experiments reveal a key trade-off: while full-network fine-tuning produces superior STP detection with sufficient data, adapter-only transfer is significantly more robust and effective in low-data scenarios and for the MSTP selection task. By defining this novel problem, providing a baseline pipeline and dataset, and offering initial insights into efficient model adaptation, we aim to contribute to future AI-driven navigation aids and data-informed level-design tools.

[108] Wan-S2V: Audio-Driven Cinematic Video Generation

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, Lian Zhuo

Main category: cs.CV

TL;DR: Wan-S2V is a new audio-driven character animation model that significantly outperforms current SOTA methods in film-level animation quality, handling complex cinematic elements like nuanced interactions, realistic body movements, and dynamic camera work.

Details

Motivation: Current audio-driven animation methods work well for speech and singing but fail to meet the sophisticated demands of film and television productions that require complex character interactions, realistic movements, and dynamic camera techniques.

Method: Proposed Wan-S2V model built upon the Wan framework, designed specifically for cinematic character animation with enhanced expressiveness and fidelity in complex production scenarios.

Result: Extensive experiments show Wan-S2V significantly outperforms cutting-edge models like Hunyuan-Avatar and Omnihuman. The method also demonstrates versatility in long-form video generation and precise video lip-sync editing applications.

Conclusion: Wan-S2V successfully addresses the long-standing challenge of achieving film-level character animation through audio-driven methods, offering superior performance and broader applicability compared to existing solutions.

Abstract: Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.

[109] Decouple, Reorganize, and Fuse: A Multimodal Framework for Cancer Survival Prediction

Huayi Wang, Haochao Ying, Yuyang Xu, Qibo Qiu, Cheng Zhang, Danny Z. Chen, Ying Sun, Jian Wu

Main category: cs.CV

TL;DR: Proposes DeReF framework with random feature reorganization and dynamic MoE fusion to address limitations in cancer survival analysis multimodal fusion methods.

Details

Motivation: Existing multimodal fusion methods for cancer survival analysis have limitations: fixed fusion schemes limit dynamic feature combinations, and MoE-based methods restrict information interaction between decoupled features.

Method: Decoupling-Reorganization-Fusion (DeReF) framework with random feature reorganization strategy between modality decoupling and dynamic MoE fusion modules, plus regional cross-attention network for improved feature representation.

Result: Extensive experiments on Liver Cancer dataset and three TCGA public datasets confirm the method’s effectiveness.

Conclusion: DeReF enhances feature combination diversity and information interaction, improving generalization and performance in cancer survival prediction.

Abstract: Cancer survival analysis commonly integrates information across diverse medical modalities to make survival-time predictions. Existing methods primarily focus on extracting different decoupled features of modalities and performing fusion operations such as concatenation, attention, and MoE-based (Mixture-of-Experts) fusion. However, these methods still face two key challenges: i) Fixed fusion schemes (concatenation and attention) can lead to model over-reliance on predefined feature combinations, limiting the dynamic fusion of decoupled features; ii) in MoE-based fusion methods, each expert network handles separate decoupled features, which limits information interaction among the decoupled features. To address these challenges, we propose a novel Decoupling-Reorganization-Fusion framework (DeReF), which devises a random feature reorganization strategy between modalities decoupling and dynamic MoE fusion modules.Its advantages are: i) it increases the diversity of feature combinations and granularity, enhancing the generalization ability of the subsequent expert networks; ii) it overcomes the problem of information closure and helps expert networks better capture information among decoupled features. Additionally, we incorporate a regional cross-attention network within the modality decoupling module to improve the representation quality of decoupled features. Extensive experimental results on our in-house Liver Cancer (LC) and three widely used TCGA public datasets confirm the effectiveness of our proposed method. The code will be made publicly available.

[110] ROSE: Remove Objects with Side Effects in Videos

Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, Hengshuang Zhao

Main category: cs.CV

TL;DR: ROSE is a video object removal framework that specifically addresses side effects like shadows, reflections, and other environmental impacts, using synthetic data and diffusion transformers for superior performance.

Details

Motivation: Existing video object removal methods struggle with eliminating side effects (shadows, reflections, light, translucency, mirror effects) due to lack of paired video data for supervision.

Method: Uses 3D rendering engine to generate synthetic paired dataset, implements video inpainting model with diffusion transformer, includes reference-based erasing and additional supervision to predict affected areas through differential masks.

Result: ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios, with comprehensive evaluation on the new ROSE-Bench benchmark.

Conclusion: The framework successfully addresses object side effect removal through synthetic data generation and specialized model architecture, demonstrating effective generalization to real-world applications.

Abstract: Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object’s effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.

[111] OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

Chunlin Zhong, Qiuxia Hou, Zhangjun Zhou, Shuang Hao, Haonan Lu, Yanhao Zhang, He Tang, Xiang Bai

Main category: cs.CV

TL;DR: OwlCap is a video captioning MLLM that addresses motion-detail imbalance through a new dataset (HMD-270K) and optimization method (CSER with GRPO), achieving significant improvements on both detail-focused and motion-focused benchmarks.

Details

Motivation: Existing video captioning methods suffer from motion-detail imbalance, where models overemphasize one aspect while neglecting the other, resulting in incomplete captions and lack of consistency in video understanding/generation.

Method: Two-pronged approach: 1) Data: Constructed HMD-270K dataset using Motion-Detail Fusion and Fine-Grained Examination pipeline; 2) Optimization: Introduced Caption Set Equivalence Reward based on Group Relative Policy Optimization for unit-to-set matching and bidirectional validation.

Result: OwlCap achieves significant improvements: +4.2 Acc on detail-focused VDC benchmark and +4.6 F1 on motion-focused DREAM-1K benchmark compared to baseline models.

Conclusion: The proposed solutions effectively address motion-detail imbalance in video captioning. OwlCap demonstrates superior performance, and the HMD-270K dataset will be publicly released to advance video captioning research.

Abstract: Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1). The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.

[112] Clustering-based Feature Representation Learning for Oracle Bone Inscriptions Detection

Ye Tao, Xinran Fu, Honglin Pang, Xi Yang, Chuntao Li

Main category: cs.CV

TL;DR: A novel clustering-based feature space representation learning method for automated Oracle Bone Inscriptions detection that leverages font library prior knowledge to overcome degradation challenges in rubbing images.

Details

Motivation: Oracle Bone Inscriptions are crucial for understanding ancient Chinese civilization, but automated detection is challenging due to various degradation factors like noise and cracks that limit conventional detection networks.

Method: Proposes a clustering-based feature space representation learning method that uses Oracle Bones Character font library as prior knowledge. Incorporates a specialized loss function from clustering results to optimize feature representation, integrated into total network loss.

Result: Validated on two OBIs detection datasets using three mainstream frameworks (Faster R-CNN, DETR, Sparse R-CNN). All frameworks demonstrated significant performance improvements through extensive experimentation.

Conclusion: The proposed clustering-based representation learning method effectively enhances Oracle Bone Inscriptions detection by leveraging font library prior knowledge, overcoming degradation challenges in archaeological image analysis.

Abstract: Oracle Bone Inscriptions (OBIs), play a crucial role in understanding ancient Chinese civilization. The automated detection of OBIs from rubbing images represents a fundamental yet challenging task in digital archaeology, primarily due to various degradation factors including noise and cracks that limit the effectiveness of conventional detection networks. To address these challenges, we propose a novel clustering-based feature space representation learning method. Our approach uniquely leverages the Oracle Bones Character (OBC) font library dataset as prior knowledge to enhance feature extraction in the detection network through clustering-based representation learning. The method incorporates a specialized loss function derived from clustering results to optimize feature representation, which is then integrated into the total network loss. We validate the effectiveness of our method by conducting experiments on two OBIs detection dataset using three mainstream detection frameworks: Faster R-CNN, DETR, and Sparse R-CNN. Through extensive experimentation, all frameworks demonstrate significant performance improvements.

[113] SFormer: SNR-guided Transformer for Underwater Image Enhancement from the Frequency Domain

Xin Tian, Yingtie Lei, Xiujun Zhang, Zimeng Li, Chi-Man Pun, Xuhang Chen

Main category: cs.CV

TL;DR: SFormer uses frequency domain SNR priors with Fourier attention to enhance underwater images, achieving state-of-the-art performance with 3.1 dB PSNR improvement.

Details

Motivation: Existing spatial domain SNR priors fail to separate cross-channel interference and provide limited noise suppression, requiring a frequency domain approach for better underwater image enhancement.

Method: Proposes Fourier Attention SNR-prior Transformer (FAST) with frequency domain decomposition and Frequency Adaptive Transformer (FAT) bottleneck using gated attention in a U-shaped architecture combining RGB and SNR-guided branches.

Result: Achieves 3.1 dB gain in PSNR and 0.08 improvement in SSIM on 4,800 paired images from UIEB, EUVP, and LSUI datasets, successfully restoring colors, textures, and contrast.

Conclusion: Frequency domain SNR priors with spectral decomposition and attention mechanisms significantly outperform spatial domain approaches for underwater image enhancement.

Abstract: Recent learning-based underwater image enhancement (UIE) methods have advanced by incorporating physical priors into deep neural networks, particularly using the signal-to-noise ratio (SNR) prior to reduce wavelength-dependent attenuation. However, spatial domain SNR priors have two limitations: (i) they cannot effectively separate cross-channel interference, and (ii) they provide limited help in amplifying informative structures while suppressing noise. To overcome these, we propose using the SNR prior in the frequency domain, decomposing features into amplitude and phase spectra for better channel modulation. We introduce the Fourier Attention SNR-prior Transformer (FAST), combining spectral interactions with SNR cues to highlight key spectral components. Additionally, the Frequency Adaptive Transformer (FAT) bottleneck merges low- and high-frequency branches using a gated attention mechanism to enhance perceptual quality. Embedded in a unified U-shaped architecture, these modules integrate a conventional RGB stream with an SNR-guided branch, forming SFormer. Trained on 4,800 paired images from UIEB, EUVP, and LSUI, SFormer surpasses recent methods with a 3.1 dB gain in PSNR and 0.08 in SSIM, successfully restoring colors, textures, and contrast in underwater scenes.

[114] Hierarchical Spatio-temporal Segmentation Network for Ejection Fraction Estimation in Echocardiography Videos

Dongfang Wang, Jian Yang, Yizhe Zhang, Tao Zhou

Main category: cs.CV

TL;DR: Proposed Hierarchical Spatio-temporal Segmentation Network (HSSN) for echocardiography video segmentation to improve Ejection Fraction estimation accuracy by combining local detail modeling with global dynamic perception.

Details

Motivation: Existing echocardiography segmentation methods achieve good segmentation performance but perform poorly in EF estimation due to issues like local error accumulation from single-frame processing or detail neglect from multi-frame approaches.

Method: Hierarchical network design with low-level stages using CNNs for single-frame detail preservation and high-level stages using Mamba architecture for spatio-temporal relationship capture. Introduces Spatio-temporal Cross Scan (STCS) module for long-range context integration across frames and positions.

Result: The proposed approach addresses EF calculation biases caused by ultrasound image noise and other factors through balanced single-frame and multi-frame processing.

Conclusion: The hierarchical spatio-temporal segmentation network synergizes local detail modeling with global dynamic perception to improve EF estimation accuracy in echocardiography video analysis.

Abstract: Automated segmentation of the left ventricular endocardium in echocardiography videos is a key research area in cardiology. It aims to provide accurate assessment of cardiac structure and function through Ejection Fraction (EF) estimation. Although existing studies have achieved good segmentation performance, their results do not perform well in EF estimation. In this paper, we propose a Hierarchical Spatio-temporal Segmentation Network (\ourmodel) for echocardiography video, aiming to improve EF estimation accuracy by synergizing local detail modeling with global dynamic perception. The network employs a hierarchical design, with low-level stages using convolutional networks to process single-frame images and preserve details, while high-level stages utilize the Mamba architecture to capture spatio-temporal relationships. The hierarchical design balances single-frame and multi-frame processing, avoiding issues such as local error accumulation when relying solely on single frames or neglecting details when using only multi-frame data. To overcome local spatio-temporal limitations, we propose the Spatio-temporal Cross Scan (STCS) module, which integrates long-range context through skip scanning across frames and positions. This approach helps mitigate EF calculation biases caused by ultrasound image noise and other factors.

[115] Feature-Space Planes Searcher: A Universal Domain Adaptation Framework for Interpretability and Computational Efficiency

Zhitong Cheng, Yiran Jiang, Yulong Ge, Yufeng Li, Zhongheng Qin, Rongzhi Lin, Jianwei Ma

Main category: cs.CV

TL;DR: FPS is a novel domain adaptation framework that optimizes decision boundaries while keeping feature encoders frozen, leveraging domain-invariant geometric patterns in pre-trained models to address domain shift more efficiently than fine-tuning approaches.

Details

Motivation: Current UDA methods rely on inefficient fine-tuning of feature extractors, which has limitations in interpretability and scalability. Domain shifts primarily manifest as boundary misalignment rather than feature degradation in pre-trained models.

Method: Feature-space Planes Searcher (FPS) optimizes decision boundaries by leveraging geometric patterns (intra-class clustering and inter-class separation) in frozen pre-trained feature encoders, enabling offline feature extraction and full-dataset optimization in a single computation cycle.

Result: FPS achieves competitive or superior performance to state-of-the-art methods on public benchmarks, scales efficiently with multimodal large models, and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection.

Conclusion: FPS provides a simple, effective, and generalizable paradigm for transfer learning that reduces memory and computational costs while maintaining interpretability, making it particularly suitable for domain adaptation tasks.

Abstract: Domain shift, characterized by degraded model performance during transition from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine-tuning feature extractors - an approach limited by inefficiency, reduced interpretability, and poor scalability to modern architectures. Our analysis reveals that models pretrained on large-scale data exhibit domain-invariant geometric patterns in their feature space, characterized by intra-class clustering and inter-class separation, thereby preserving transferable discriminative structures. These findings indicate that domain shifts primarily manifest as boundary misalignment rather than feature degradation. Unlike fine-tuning entire pre-trained models - which risks introducing unpredictable feature distortions - we propose the Feature-space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretative analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full-dataset optimization in a single computation cycle. Evaluations on public benchmarks demonstrate that FPS achieves competitive or superior performance to state-of-the-art methods. FPS scales efficiently with multimodal large models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable paradigm for transfer learning, particularly in domain adaptation tasks. .

[116] A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition

Wasi Ullah, Yasir Noman Khalid, Saddam Hussain Khan

Main category: cs.CV

TL;DR: An optimized hybrid deep learning framework for Human Activity Recognition that combines customized InceptionV3, LSTM, and ensemble feature selection to achieve high accuracy with minimal features for real-time deployment.

Details

Motivation: HAR systems face challenges with high computational costs, redundant features, and limited scalability in real-time scenarios, requiring optimized solutions for practical deployment.

Method: Integrates customized InceptionV3 for spatial feature extraction, LSTM for temporal modeling, and ensemble-based genetic algorithm with ADFSA for feature selection to balance accuracy, redundancy, and complexity.

Result: Achieves 99.65% recognition accuracy on UCF-YouTube dataset, reduces features to as few as 7, and enhances inference time for real-time deployment on edge devices.

Conclusion: The lightweight and scalable framework enables practical HAR applications in resource-aware environments like public safety, assistive technology, and autonomous monitoring systems.

Abstract: Human Activity Recognition (HAR) plays a pivotal role in various applications, including smart surveillance, healthcare, assistive technologies, sports analytics, etc. However, HAR systems still face critical challenges, including high computational costs, redundant features, and limited scalability in real-time scenarios. An optimized hybrid deep learning framework is introduced that integrates a customized InceptionV3, an LSTM architecture, and a novel ensemble-based feature selection strategy. The proposed framework first extracts spatial descriptors using the customized InceptionV3 model, which captures multilevel contextual patterns, region homogeneity, and fine-grained localization cues. The temporal dependencies across frames are then modeled using LSTMs to effectively encode motion dynamics. Finally, an ensemble-based genetic algorithm with Adaptive Dynamic Fitness Sharing and Attention (ADFSA) is employed to select a compact and optimized feature set by dynamically balancing objectives such as accuracy, redundancy, uniqueness, and complexity reduction. Consequently, the selected feature subsets, which are both diverse and discriminative, enable various lightweight machine learning classifiers to achieve accurate and robust HAR in heterogeneous environments. Experimental results on the robust UCF-YouTube dataset, which presents challenges such as occlusion, cluttered backgrounds, motion dynamics, and poor illumination, demonstrate good performance. The proposed approach achieves 99.65% recognition accuracy, reduces features to as few as 7, and enhances inference time. The lightweight and scalable nature of the HAR system supports real-time deployment on edge devices such as Raspberry Pi, enabling practical applications in intelligent, resource-aware environments, including public safety, assistive technology, and autonomous monitoring systems.

[117] ColorGS: High-fidelity Surgical Scene Reconstruction with Colored Gaussian Splatting

Qun Ji, Peng Li, Mingqiang Wei

Main category: cs.CV

TL;DR: ColorGS improves surgical scene reconstruction by introducing adaptive color encoding and enhanced deformation modeling, achieving state-of-the-art performance with 39.85 PSNR while maintaining real-time rendering.

Details

Motivation: Existing methods struggle with capturing subtle color variations and modeling global deformations in endoscopic videos, limiting reconstruction fidelity for surgical applications.

Method: Proposes ColorGS framework with: 1) Colored Gaussian Primitives using dynamic anchors with learnable color parameters for adaptive texture encoding, and 2) Enhanced Deformation Model combining time-aware basis functions with time-independent deformations.

Result: Achieves PSNR of 39.85 (1.5 higher than prior 3DGS methods) and SSIM of 97.25% on DaVinci robotic surgery videos and benchmark datasets (EndoNeRF, StereoMIS), with real-time rendering efficiency.

Conclusion: ColorGS advances surgical scene reconstruction by balancing high fidelity with computational practicality, making it suitable for intraoperative guidance and AR/VR applications.

Abstract: High-fidelity reconstruction of deformable tissues from endoscopic videos remains challenging due to the limitations of existing methods in capturing subtle color variations and modeling global deformations. While 3D Gaussian Splatting (3DGS) enables efficient dynamic reconstruction, its fixed per-Gaussian color assignment struggles with intricate textures, and linear deformation modeling fails to model consistent global deformation. To address these issues, we propose ColorGS, a novel framework that integrates spatially adaptive color encoding and enhanced deformation modeling for surgical scene reconstruction. First, we introduce Colored Gaussian Primitives, which employ dynamic anchors with learnable color parameters to adaptively encode spatially varying textures, significantly improving color expressiveness under complex lighting and tissue similarity. Second, we design an Enhanced Deformation Model (EDM) that combines time-aware Gaussian basis functions with learnable time-independent deformations, enabling precise capture of both localized tissue deformations and global motion consistency caused by surgical interactions. Extensive experiments on DaVinci robotic surgery videos and benchmark datasets (EndoNeRF, StereoMIS) demonstrate that ColorGS achieves state-of-the-art performance, attaining a PSNR of 39.85 (1.5 higher than prior 3DGS-based methods) and superior SSIM (97.25%) while maintaining real-time rendering efficiency. Our work advances surgical scene reconstruction by balancing high fidelity with computational practicality, critical for intraoperative guidance and AR/VR applications.

[118] Class-wise Flooding Regularization for Imbalanced Image Classification

Hiroaki Aizawa, Yuta Naito, Kohei Fukuda

Main category: cs.CV

TL;DR: Class-wise flooding regularization improves minority class performance in imbalanced datasets by applying class-specific flooding levels based on frequency.

Details

Motivation: Neural networks trained on imbalanced datasets tend to favor majority classes, leading to poor performance on minority classes due to overfitting and memorization.

Method: Extends flooding regularization to class level by assigning class-specific flooding thresholds based on class frequencies - higher thresholds for majority classes to suppress overfitting, lower thresholds for minority classes to allow sufficient learning.

Result: Validated on imbalanced image classification, the method improves minority class performance and achieves better overall generalization compared to conventional flooding regularizations.

Conclusion: Class-wise flooding regularization effectively addresses class imbalance by tailoring regularization strength per class, preventing majority class overfitting while enabling minority class learning.

Abstract: The purpose of training neural networks is to achieve high generalization performance on unseen inputs. However, when trained on imbalanced datasets, a model’s prediction tends to favor majority classes over minority classes, leading to significant degradation in the recognition performance of minority classes. To address this issue, we propose class-wise flooding regularization, an extension of flooding regularization applied at the class level. Flooding is a regularization technique that mitigates overfitting by preventing the training loss from falling below a predefined threshold, known as the flooding level, thereby discouraging memorization. Our proposed method assigns a class-specific flooding level based on class frequencies. By doing so, it suppresses overfitting in majority classes while allowing sufficient learning for minority classes. We validate our approach on imbalanced image classification. Compared to conventional flooding regularizations, our method improves the classification performance of minority classes and achieves better overall generalization.

[119] Flatness-aware Curriculum Learning via Adversarial Difficulty

Hiroaki Aizawa, Yoshikazu Hayashi

Main category: cs.CV

TL;DR: Proposes Adversarial Difficulty Measure (ADM) to combine Curriculum Learning with Sharpness-Aware Minimization, overcoming the challenge of evaluating sample difficulty in flat minima regions.

Details

Motivation: Neural networks suffer from overfitting and poor generalization. While Curriculum Learning (CL) addresses this by selecting samples based on difficulty, and Sharpness-Aware Minimization (SAM) improves robustness through flat minima, combining them is challenging because flat regions make traditional difficulty measures ineffective.

Method: Developed Adversarial Difficulty Measure (ADM) that quantifies adversarial vulnerability by measuring normalized loss gap between original and adversarial examples. Incorporated ADM into CL-based training with SAM to dynamically assess sample difficulty.

Result: Evaluated on image classification, fine-grained recognition, and domain generalization tasks. Outperformed existing curriculum-based and flatness-aware training strategies while preserving strengths of both CL and SAM.

Conclusion: ADM successfully bridges the gap between Curriculum Learning and Sharpness-Aware Minimization, providing an effective difficulty measure that remains informative in flat regions and improves generalization performance.

Abstract: Neural networks trained by empirical risk minimization often suffer from overfitting, especially to specific samples or domains, which leads to poor generalization. Curriculum Learning (CL) addresses this issue by selecting training samples based on the difficulty. From the optimization perspective, methods such as Sharpness-Aware Minimization (SAM) improve robustness and generalization by seeking flat minima. However, combining CL with SAM is not straightforward. In flat regions, both the loss values and the gradient norms tend to become uniformly small, which makes it difficult to evaluate sample difficulty and design an effective curriculum. To overcome this problem, we propose the Adversarial Difficulty Measure (ADM), which quantifies adversarial vulnerability by leveraging the robustness properties of models trained toward flat minima. Unlike loss- or gradient-based measures, which become ineffective as training progresses into flatter regions, ADM remains informative by measuring the normalized loss gap between original and adversarial examples. We incorporate ADM into CL-based training with SAM to dynamically assess sample difficulty. We evaluated our approach on image classification tasks, fine-grained recognition, and domain generalization. The results demonstrate that our method preserves the strengths of both CL and SAM while outperforming existing curriculum-based and flatness-aware training strategies.

[120] Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

Melanie Wille, Tobias Fischer, Scarlett Raine

Main category: cs.CV

TL;DR: This paper investigates performance disparities in underwater object detection, particularly for scallop species, finding that localization challenges (foreground-background discrimination) are the primary issue rather than data quantity or classification problems.

Details

Motivation: Underwater object detection faces unique challenges including degraded image quality and imbalanced class distribution, with unclear underlying causes for why some marine species are detected better than others.

Method: The researchers manipulated the DUO dataset to separate object detection into localization and classification tasks, using YOLO11 and TIDE for localization analysis, and conducted classification experiments with balanced data to isolate performance factors.

Result: Localization analysis revealed foreground-background discrimination as the most problematic stage regardless of data quantity. Classification experiments showed persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity.

Conclusion: The study recommends using imbalanced distributions when prioritizing precision and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules.

Abstract: Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO dataset to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.

[121] Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vectorized Drawings

Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu

Main category: cs.CV

TL;DR: Drawing2CAD is a framework that generates parametric CAD models from 2D engineering drawings using sequence-to-sequence learning with transformer architecture.

Details

Motivation: Traditional CAD generative methods diverge from industrial workflows that start with 2D engineering drawings, and automatic generation from these drawings remains underexplored despite being critical for engineering design.

Method: Uses a dual-decoder transformer architecture with network-friendly vector primitive representation, decoupling command type and parameter generation while maintaining precise correspondence, and employs a soft target distribution loss function.

Result: The method effectively transforms 2D vector drawings into parametric CAD models while preserving geometric precision and design intent, as demonstrated through experiments on the created CAD-VGDrawing dataset.

Conclusion: Drawing2CAD successfully bridges the gap between traditional 2D engineering workflows and modern CAD generation, providing an effective solution for converting engineering drawings into parametric CAD models with maintained precision.

Abstract: Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.

[122] Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

Qinqian Lei, Bo Wang, Robby T. Tan

Main category: cs.CV

TL;DR: New benchmark reformulates HOI detection as multiple-choice task to better evaluate VLMs and specialized methods, avoiding penalization of valid predictions in ambiguous cases.

Details

Motivation: Existing HOI benchmarks use exact class matching that penalizes VLMs' multiple valid interpretations of ambiguous human-object interactions, requiring a more flexible evaluation protocol.

Method: Introduces a new benchmark that frames HOI detection as multiple-answer multiple-choice task with ground-truth positives and curated negatives to reduce ambiguity.

Result: Enables direct comparison between general-purpose VLMs and specialized HOI methods while accommodating VLMs’ generative nature and reducing false penalization.

Conclusion: The proposed benchmark provides fair evaluation for both VLMs and HOI-specific methods, offering new insights into HOI understanding progress.

Abstract: Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either “throwing” or “catching”. When only “catching” is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when “catching” is annotated, “throwing” is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.

[123] Beyond the Textual: Generating Coherent Visual Options for MCQs

Wanqiang Wang, Longzhu He, Wei Zheng

Main category: cs.CV

TL;DR: CmOS is a novel framework for generating educational multiple-choice questions with visual options using multimodal reasoning and retrieval-augmented generation to create plausible visual distractors.

Details

Motivation: Previous research focused on textual MCQ options but overlooked visual options, and generating high-quality distractors manually is costly and not scalable.

Method: Integrates Multimodal Chain-of-Thought reasoning and Retrieval-Augmented Generation to produce semantically plausible and visually similar answer/distractors, plus a discrimination module for identifying content suitable for visual options.

Result: Experimental results show CmOS outperforms existing methods in content discrimination, question generation, and visual option generation across various subjects and educational levels.

Conclusion: The proposed CmOS framework effectively addresses the challenge of generating educational MCQs with high-quality visual options, demonstrating superior performance over previous approaches.

Abstract: Multiple-choice questions (MCQs) play a crucial role in fostering deep thinking and knowledge integration in education. However, previous research has primarily focused on generating MCQs with textual options, but it largely overlooks the visual options. Moreover, generating high-quality distractors remains a major challenge due to the high cost and limited scalability of manual authoring. To tackle these problems, we propose a Cross-modal Options Synthesis (CmOS), a novel framework for generating educational MCQs with visual options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning process and Retrieval-Augmented Generation (RAG) to produce semantically plausible and visually similar answer and distractors. It also includes a discrimination module to identify content suitable for visual options. Experimental results on test tasks demonstrate the superiority of CmOS in content discrimination, question generation and visual option generation over existing methods across various subjects and educational levels.

[124] Design, Implementation and Evaluation of a Real-Time Remote Photoplethysmography (rPPG) Acquisition System for Non-Invasive Vital Sign Monitoring

Constantino Álvarez Casado, Sasan Sharifipour, Manuel Lage Cañellas, Nhi Nguyen, Le Nguyen, Miguel Bordallo López

Main category: cs.CV

TL;DR: Real-time remote photoplethysmography system optimized for low-power devices that extracts heart rate, respiratory rate, and oxygen saturation from facial video streams at 30fps using hybrid programming model.

Details

Motivation: Address challenges in deploying real-time physiological monitoring systems on resource-constrained platforms, including scalability, interoperability, and performance issues in smart environments.

Method: Built on Face2PPG pipeline with multithreaded architecture for concurrent video capture, processing, network communication, and GUI updates. Uses hybrid programming model combining Functional Reactive Programming (FRP) and Actor Model for event-driven processing and task parallelization.

Result: System achieves continuous reliable operation at 30fps with adaptive feedback through collaborative user interface. Includes HTTP server for video streaming and RESTful API for vital sign retrieval.

Conclusion: Addresses key challenges in real-time biosignal monitoring and offers practical solutions for optimizing performance in healthcare and human-computer interaction applications on low-power devices.

Abstract: The growing integration of smart environments and low-power computing devices, coupled with mass-market sensor technologies, is driving advancements in remote and non-contact physiological monitoring. However, deploying these systems in real-time on resource-constrained platforms introduces significant challenges related to scalability, interoperability, and performance. This paper presents a real-time remote photoplethysmography (rPPG) system optimized for low-power devices, designed to extract physiological signals, such as heart rate (HR), respiratory rate (RR), and oxygen saturation (SpO2), from facial video streams. The system is built on the Face2PPG pipeline, which processes video frames sequentially for rPPG signal extraction and analysis, while leveraging a multithreaded architecture to manage video capture, real-time processing, network communication, and graphical user interface (GUI) updates concurrently. This design ensures continuous, reliable operation at 30 frames per second (fps), with adaptive feedback through a collaborative user interface to guide optimal signal capture conditions. The network interface includes both an HTTP server for continuous video streaming and a RESTful API for on-demand vital sign retrieval. To ensure accurate performance despite the limitations of low-power devices, we use a hybrid programming model combining Functional Reactive Programming (FRP) and the Actor Model, allowing event-driven processing and efficient task parallelization. The system is evaluated under real-time constraints, demonstrating robustness while minimizing computational overhead. Our work addresses key challenges in real-time biosignal monitoring, offering practical solutions for optimizing performance in modern healthcare and human-computer interaction applications.

[125] PseudoMapTrainer: Learning Online Mapping without HD Maps

Christian Löwens, Thorben Funke, Jingchao Xie, Alexandru Paul Condurache

Main category: cs.CV

TL;DR: PseudoMapTrainer enables online mapping model training without ground-truth HD maps by using pseudo-labels from unlabeled sensor data via Gaussian splatting and 2D segmentation.

Details

Motivation: Existing online mapping approaches require expensive ground-truth HD maps for training, which are costly and lack geographic diversity for reliable generalization.

Method: Generates pseudo-labels by reconstructing road surface from multi-camera imagery using Gaussian splatting and pre-trained 2D segmentation network. Introduces mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels.

Result: Enables training of online mapping models without any ground-truth maps and allows semi-supervised pre-training using large-scale unlabeled crowdsourced data.

Conclusion: Proposes a novel approach that eliminates dependency on expensive ground-truth HD maps while enabling effective training and generalization for online mapping models.

Abstract: Online mapping models show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo-labels generated from unlabeled sensor data. We derive those pseudo-labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo-labels can be effectively used to pre-train an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code is available at github.com/boschresearch/PseudoMapTrainer.

[126] Robust and Label-Efficient Deep Waste Detection

Hassan Abid, Khan Muhammad, Muhammad Haris Khan

Main category: cs.CV

TL;DR: This paper establishes strong baselines for AI-driven waste detection, benchmarks OVOD models on ZeroWaste dataset, introduces ensemble-based semi-supervised learning with soft pseudo-labeling, and achieves performance surpassing fully supervised training.

Details

Motivation: AI research in waste sorting lags behind commercial systems due to limited datasets and reliance on legacy object detectors, highlighting the need for advanced detection methods and scalable annotation pipelines.

Method: Benchmarked state-of-the-art Open-Vocabulary Object Detection models, fine-tuned transformer-based detectors, and proposed ensemble-based semi-supervised learning with spatial and consensus-aware weighting for soft pseudo-labeling.

Result: Achieved new baseline of 51.6 mAP with fine-tuned detectors, demonstrated that LLM-optimized prompts significantly enhance zero-shot accuracy, and achieved performance gains surpassing fully supervised training on unlabeled ZeroWaste-s subset.

Conclusion: The work establishes rigorous baselines, introduces robust pseudo-labeling pipeline, generates high-quality annotations, and systematically evaluates OVOD models, contributing significantly to waste detection research with available code for community use.

Abstract: Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.

[127] Embedding Font Impression Word Tags Based on Co-occurrence

Yugo Kubota, Seiichi Uchida

Main category: cs.CV

TL;DR: Novel graph-based embedding method for font impression tags that outperforms standard word embeddings (BERT, CLIP) in impression-guided font generation by leveraging font shape-impression relationships.

Details

Motivation: Different font styles convey distinct impressions, and there's a close relationship between font shapes and impression tags. Standard word embedding methods fail to capture these specific shape-impression relationships effectively for font-related tasks.

Method: Construct a graph with nodes as impression tags and edges encoding co-occurrence relationships. Apply spectral embedding to obtain impression vectors that assign similar vectors to tags that frequently co-occur in font impressions.

Result: The proposed method performs better than BERT and CLIP in both qualitative and quantitative evaluations for impression-guided font generation and font retrieval tasks.

Conclusion: The graph-based spectral embedding approach effectively captures font impression relationships and is particularly useful for impression-based font generation and retrieval applications.

Abstract: Different font styles (i.e., font shapes) convey distinct impressions, indicating a close relationship between font shapes and word tags describing those impressions. This paper proposes a novel embedding method for impression tags that leverages these shape-impression relationships. For instance, our method assigns similar vectors to impression tags that frequently co-occur in order to represent impressions of fonts, whereas standard word embedding methods (e.g., BERT and CLIP) yield very different vectors. This property is particularly useful for impression-based font generation and font retrieval. Technically, we construct a graph whose nodes represent impression tags and whose edges encode co-occurrence relationships. Then, we apply spectral embedding to obtain the impression vectors for each tag. We compare our method with BERT and CLIP in qualitative and quantitative evaluations, demonstrating that our approach performs better in impression-guided font generation.

[128] Deep Pre-trained Time Series Features for Tree Species Classification in the Dutch Forest Inventory

Takayuki Ishikawa, Carmelo Bonannella, Bas J. W. Lerink, Marc Rußwurm

Main category: cs.CV

TL;DR: Using pre-trained remote sensing foundation models with deep features significantly outperforms traditional hand-designed features for tree species classification in National Forest Inventories, achieving up to 10% higher accuracy with limited annotated data.

Details

Motivation: Traditional National Forest Inventory methods are labor-intensive and rely on hand-designed features. Remote sensing combined with machine learning offers opportunities for more frequent updates at larger scales, but current approaches use basic Random Forest classifiers with manual feature engineering.

Method: Extracted time-series data from Sentinel-1, Sentinel-2, ERA5 satellites and SRTM data using Google Earth Engine. Fine-tuned a publicly available remote sensing time series foundation model with deep features instead of traditional hand-designed harmonic features.

Result: Fine-tuning the pre-trained foundation model outperformed current state-of-the-art NFI classification in the Netherlands by up to 10% across all datasets, demonstrating superior performance with limited annotated data.

Conclusion: Deep AI features from pre-trained models significantly outperform classic hand-defined features for tree species classification, highlighting their potential for data-limited applications like NFI classification and offering an effective complement to existing forest inventory processes.

Abstract: National Forest Inventory (NFI)s serve as the primary source of forest information, providing crucial tree species distribution data. However, maintaining these inventories requires labor-intensive on-site campaigns. Remote sensing approaches, particularly when combined with machine learning, offer opportunities to update NFIs more frequently and at larger scales. While the use of Satellite Image Time Series has proven effective for distinguishing tree species through seasonal canopy reflectance patterns, current approaches rely primarily on Random Forest classifiers with hand-designed features and phenology-based metrics. Using deep features from an available pre-trained remote sensing foundation models offers a complementary strategy. These pre-trained models leverage unannotated global data and are meant to used for general-purpose applications and can then be efficiently fine-tuned with smaller labeled datasets for specific classification tasks. This work systematically investigates how deep features improve tree species classification accuracy in the Netherlands with few annotated data. Data-wise, we extracted time-series data from Sentinel-1, Sentinel-2 and ERA5 satellites data and SRTM data using Google Earth Engine. Our results demonstrate that fine-tuning a publicly available remote sensing time series foundation model outperforms the current state-of-the-art in NFI classification in the Netherlands by a large margin of up to 10% across all datasets. This demonstrates that classic hand-defined harmonic features are too simple for this task and highlights the potential of using deep AI features for data-limited application like NFI classification. By leveraging openly available satellite data and pre-trained models, this approach significantly improves classification accuracy compared to traditional methods and can effectively complement existing forest inventory processes.

[129] Automated Classification of Normal and Atypical Mitotic Figures Using ConvNeXt V2: MIDOG 2025 Track 2

Yosuke Yamagishi, Shouhei Hanaoka

Main category: cs.CV

TL;DR: A ConvNeXt V2-based solution for binary classification of normal vs atypical mitotic figures in histopathology images, using center cropping and ensemble learning to address class imbalance and domain heterogeneity.

Details

Motivation: To develop an effective method for distinguishing normal mitotic figures (NMFs) from atypical mitotic figures (AMFs) in histopathological images, addressing challenges of severe class imbalance, high morphological variability, and domain heterogeneity across different tumor types, species, and scanners.

Method: Leverages ConvNeXt V2 base model with 60% center cropping preprocessing and 5-fold cross-validation ensemble strategy. Uses mixed precision training to optimize performance and computational efficiency.

Result: Achieved robust performance on the diverse MIDOG 2025 dataset, demonstrating the model’s effectiveness in handling the classification task despite the challenging domain variations.

Conclusion: The solution shows that modern convolutional architectures like ConvNeXt V2, combined with strategic preprocessing and ensemble strategies, are effective for mitotic figure subtyping while maintaining computational efficiency through careful architectural choices and training optimizations.

Abstract: This paper presents our solution for the MIDOG 2025 Challenge Track 2, which focuses on binary classification of normal mitotic figures (NMFs) versus atypical mitotic figures (AMFs) in histopathological images. Our approach leverages a ConvNeXt V2 base model with center cropping preprocessing and 5-fold cross-validation ensemble strategy. The method addresses key challenges including severe class imbalance, high morphological variability, and domain heterogeneity across different tumor types, species, and scanners. Through strategic preprocessing with 60% center cropping and mixed precision training, our model achieved robust performance on the diverse MIDOG 2025 dataset. The solution demonstrates the effectiveness of modern convolutional architectures for mitotic figure subtyping while maintaining computational efficiency through careful architectural choices and training optimizations.

[130] Boosting Micro-Expression Analysis via Prior-Guided Video-Level Regression

Zizheng Guo, Bochao Zou, Yinuo Jia, Xiangyu Li, Huimin Ma

Main category: cs.CV

TL;DR: A prior-guided video-level regression method for micro-expression analysis that introduces scalable interval selection and synergistic optimization between spotting and recognition tasks, achieving state-of-the-art performance.

Details

Motivation: Existing micro-expression analysis methods rely on fixed window sizes and hard decisions, limiting their ability to capture complex temporal dynamics. Current video-level regression approaches still depend on manually predefined window-based interval decoding.

Method: Proposes a prior-guided video-level regression method with scalable interval selection strategy that considers temporal evolution, duration, and class distribution. Uses synergistic optimization framework where spotting and recognition tasks share parameters except classification heads.

Result: Achieves state-of-the-art performance with STRS of 0.0562 on CAS(ME)^3 and 0.2000 on SAMMLV datasets.

Conclusion: The method effectively addresses limitations of existing approaches by enabling precise spotting of onset, apex, and offset phases while making efficient use of limited data through complementary information sharing.

Abstract: Micro-expressions (MEs) are involuntary, low-intensity, and short-duration facial expressions that often reveal an individual’s genuine thoughts and emotions. Most existing ME analysis methods rely on window-level classification with fixed window sizes and hard decisions, which limits their ability to capture the complex temporal dynamics of MEs. Although recent approaches have adopted video-level regression frameworks to address some of these challenges, interval decoding still depends on manually predefined, window-based methods, leaving the issue only partially mitigated. In this paper, we propose a prior-guided video-level regression method for ME analysis. We introduce a scalable interval selection strategy that comprehensively considers the temporal evolution, duration, and class distribution characteristics of MEs, enabling precise spotting of the onset, apex, and offset phases. In addition, we introduce a synergistic optimization framework, in which the spotting and recognition tasks share parameters except for the classification heads. This fully exploits complementary information, makes more efficient use of limited data, and enhances the model’s capability. Extensive experiments on multiple benchmark datasets demonstrate the state-of-the-art performance of our method, with an STRS of 0.0562 on CAS(ME)$^3$ and 0.2000 on SAMMLV. The code is available at https://github.com/zizheng-guo/BoostingVRME.

[131] Quantitative Outcome-Oriented Assessment of Microsurgical Anastomosis

Luyin Hu, Soheil Gholami, George Dindelegan, Torstein R. Meling, Aude Billard

Main category: cs.CV

TL;DR: A quantitative image-processing framework for objective assessment of microsurgical anastomosis proficiency, replacing subjective evaluation with geometric error modeling.

Details

Motivation: Current microsurgical anastomosis assessment methods rely on subjective judgment which introduces biases and affects reliability of competence evaluation.

Method: Uses image-processing techniques and geometric modeling of errors with detection and scoring mechanism across three hospital datasets with participants at various skill levels.

Result: Geometric metrics effectively replicate expert raters’ scoring for the errors considered, enhancing efficiency and reliability of assessment.

Conclusion: The quantitative framework advances microsurgical training protocols by providing objective, reliable proficiency assessment.

Abstract: Microsurgical anastomosis demands exceptional dexterity and visuospatial skills, underscoring the importance of comprehensive training and precise outcome assessment. Currently, methods such as the outcome-oriented anastomosis lapse index are used to evaluate this procedure. However, they often rely on subjective judgment, which can introduce biases that affect the reliability and efficiency of the assessment of competence. Leveraging three datasets from hospitals with participants at various levels, we introduce a quantitative framework that uses image-processing techniques for objective assessment of microsurgical anastomoses. The approach uses geometric modeling of errors along with a detection and scoring mechanism, enhancing the efficiency and reliability of microsurgical proficiency assessment and advancing training protocols. The results show that the geometric metrics effectively replicate expert raters’ scoring for the errors considered in this work.

[132] Harnessing Meta-Learning for Controllable Full-Frame Video Stabilization

Muhammad Kashif Ali, Eun Woo Im, Dongjin Kim, Tae Hyun Kim, Vivek Gupta, Haonan Luo, Tianrui Li

Main category: cs.CV

TL;DR: A novel test-time adaptation method for video stabilization that rapidly adapts pixel-level synthesis models to each input video using low-level visual cues, improving stability and visual quality with minimal adaptation steps.

Details

Motivation: Pixel-level synthesis video stabilization methods struggle with robust generalization due to diverse motion profiles and visual content in different videos. Fixed parameters make it difficult to handle this variability effectively.

Method: Proposes rapid adaptation at test time using low-level visual cues, a jerk localization module to identify unstable segments, and targeted adaptation strategy focusing on high-jerk areas for efficient stabilization with fewer adaptation steps.

Result: Significant performance gains even with single adaptation pass, consistently improves various full-frame synthesis models in both qualitative and quantitative metrics across diverse real-world datasets.

Conclusion: The method enables modern stabilizers to surpass state-of-the-art approaches while maintaining full-frame output and providing user control similar to classical methods, demonstrating versatility and effectiveness.

Abstract: Video stabilization remains a fundamental problem in computer vision, particularly pixel-level synthesis solutions for video stabilization, which synthesize full-frame outputs, add to the complexity of this task. These methods aim to enhance stability while synthesizing full-frame videos, but the inherent diversity in motion profiles and visual content present in each video sequence makes robust generalization with fixed parameters difficult. To address this, we present a novel method that improves pixel-level synthesis video stabilization methods by rapidly adapting models to each input video at test time. The proposed approach takes advantage of low-level visual cues available during inference to improve both the stability and visual quality of the output. Notably, the proposed rapid adaptation achieves significant performance gains even with a single adaptation pass. We further propose a jerk localization module and a targeted adaptation strategy, which focuses the adaptation on high-jerk segments for maximizing stability with fewer adaptation steps. The proposed methodology enables modern stabilizers to overcome the longstanding SOTA approaches while maintaining the full frame nature of the modern methods, while offering users with control mechanisms akin to classical approaches. Extensive experiments on diverse real-world datasets demonstrate the versatility of the proposed method. Our approach consistently improves the performance of various full-frame synthesis models in both qualitative and quantitative terms, including results on downstream applications.

Yuexuan Xia, Benteng Ma, Jiang He, Zhiyong Wang, Qi Dou, Yong Xia

Main category: cs.CV

TL;DR: DualFairVL is a multimodal prompt-learning framework that jointly debiases and aligns vision-language representations for fair medical diagnosis across demographic groups under distribution shifts.

Details

Motivation: Existing debiasing approaches address vision and text modalities independently, leaving residual cross-modal misalignment and fairness gaps in medical imaging diagnosis.

Method: Uses parallel dual-branch architecture to separate sensitive and target attributes, constructs orthogonal text anchors via linear projections, employs hypernetwork for instance-aware visual prompts, and applies prototype-based regularization for feature separation.

Result: Achieves state-of-the-art fairness and accuracy on eight medical imaging datasets across four modalities under both in- and out-of-distribution settings, outperforming baselines with only 3.6M trainable parameters.

Conclusion: DualFairVL effectively addresses cross-modal misalignment in fairness-aware medical diagnosis and demonstrates strong generalization capabilities with minimal parameter overhead.

Abstract: Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice. Vision-language models (VLMs) exhibit strong generalization, and text prompts encode identity attributes, enabling explicit identification and removal of sensitive directions. However, existing debiasing approaches typically address vision and text modalities independently, leaving residual cross-modal misalignment and fairness gaps. To address this challenge, we propose DualFairVL, a multimodal prompt-learning framework that jointly debiases and aligns cross-modal representations. DualFairVL employs a parallel dual-branch architecture that separates sensitive and target attributes, enabling disentangled yet aligned representations across modalities. Approximately orthogonal text anchors are constructed via linear projections, guiding cross-attention mechanisms to produce fused features. A hypernetwork further disentangles attribute-related information and generates instance-aware visual prompts, which encode dual-modal cues for fairness and robustness. Prototype-based regularization is applied in the visual branch to enforce separation of sensitive features and strengthen alignment with textual anchors. Extensive experiments on eight medical imaging datasets across four modalities show that DualFairVL achieves state-of-the-art fairness and accuracy under both in- and out-of-distribution settings, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters. Code will be released upon publication.

[134] DQEN: Dual Query Enhancement Network for DETR-based HOI Detection

Zhehao Li, Chong Wang, Yi Chen, Yinghao Lu, Jiangbo Qian, Jiong Wang, Jiafei Wu

Main category: cs.CV

TL;DR: DQEN enhances HOI detection by improving object and interaction queries with object-aware features and CLIP-based semantic fusion, achieving competitive results on standard datasets.

Details

Motivation: Randomly initialized queries in DETR-based HOI detection models lead to vague representations that limit effectiveness, while humans in HOI categories are fixed but objects and interactions are variable.

Method: Proposes Dual Query Enhancement Network (DQEN) with object-aware encoder features for object queries, Interaction Semantic Fusion module using CLIP for interaction queries, and Auxiliary Prediction Unit for better interaction feature representation.

Result: Achieves competitive performance on both HICO-Det and V-COCO datasets.

Conclusion: The proposed query enhancement approach effectively improves HOI detection by providing more meaningful query representations through object-aware features and semantic fusion.

Abstract: Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model’s effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model’s ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at https://github.com/lzzhhh1019/DQEN.

[135] Interpretable Decision-Making for End-to-End Autonomous Driving

Mona Mirzaie, Bodo Rosenhahn

Main category: cs.CV

TL;DR: This paper presents a method to enhance interpretability in end-to-end autonomous driving models by using loss functions that generate sparse and localized feature maps, allowing visualization of which image regions influence control decisions.

Details

Motivation: End-to-end autonomous driving approaches are challenging to interpret due to deep neural networks with non-linear decision boundaries, making it difficult to understand AI-driven decisions in complex urban scenarios.

Method: Proposed loss functions that promote interpretability by generating sparse and localized feature maps, allowing identification of image regions contributing to control commands. Conducted ablation studies on feature extraction and validated on CARLA benchmarks.

Result: The approach improves interpretability while reducing infractions, achieving safer high-performance driving. The monocular non-ensemble model surpassed top CARLA Leaderboard approaches with lower infraction scores and highest route completion rate.

Conclusion: The method successfully enhances interpretability in autonomous driving models while maintaining or improving performance, demonstrating that interpretable AI can yield safer driving behavior without sacrificing route completion capabilities.

Abstract: Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.

[136] Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025

Thien-Phuc Tran, Minh-Quang Nguyen, Minh-Triet Tran, Tam V. Nguyen, Trong-Le Do, Duy-Nam Ly, Viet-Tham Huynh, Khanh-Duy Le, Mai-Khiem Tran, Trung-Nghia Le

Main category: cs.CV

TL;DR: EVENTA Grand Challenge introduces first large-scale benchmark for event-level multimodal understanding, addressing limitations of traditional captioning/retrieval by integrating contextual, temporal and semantic information.

Details

Motivation: Traditional captioning and retrieval tasks focus on surface-level recognition but overlook contextual and semantic dimensions that define real-world events.

Method: Built on OpenEvents V1 dataset with two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. Evaluation through Public and Private Test phases with 45 teams from 6 countries.

Result: Successful challenge with top three teams presenting solutions at ACM Multimedia 2025. Established foundation for context-aware, narrative-driven multimedia AI.

Conclusion: EVENTA provides comprehensive benchmark for event-level understanding with applications in journalism, media analysis, cultural archiving, and accessibility.

Abstract: The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: https://ltnghia.github.io/eventa/eventa-2025.

[137] Preliminary Study on Space Utilization and Emergent Behaviors of Group vs. Single Pedestrians in Real-World Trajectories

Amartaivan Sanjjamts, Morita Hiroshi

Main category: cs.CV

TL;DR: A framework for classifying pedestrian groups vs individuals using Transformer models and analyzing spatial/behavioral differences through various metrics.

Details

Motivation: To analyze differences in space utilization and behavioral patterns between group and single pedestrians in real-world trajectory data.

Method: Segment trajectories into time bins, use Transformer-based pair classification to identify groups, and apply spatial/behavioral metrics including convex hull area, velocity changes, and encounter typology.

Result: Establishes a classification pipeline and dataset structure for scalable analysis across different sequence lengths (60, 100, 200 frames).

Conclusion: This initial framework provides groundwork for future quantitative analysis of pedestrian interactions, with implications for crowd simulation and space design validation.

Abstract: This study presents an initial framework for distinguishing group and single pedestrians based on real-world trajectory data, with the aim of analyzing their differences in space utilization and emergent behavioral patterns. By segmenting pedestrian trajectories into fixed time bins and applying a Transformer-based pair classification model, we identify cohesive groups and isolate single pedestrians over a structured sequence-based filtering process. To prepare for deeper analysis, we establish a comprehensive metric framework incorporating both spatial and behavioral dimensions. Spatial utilization metrics include convex hull area, smallest enclosing circle radius, and heatmap-based spatial densities to characterize how different pedestrian types occupy and interact with space. Behavioral metrics such as velocity change, motion angle deviation, clearance radius, and trajectory straightness are designed to capture local adaptations and responses during interactions. Furthermore, we introduce a typology of encounter types-single-to-single, single-to-group, and group-to-group to categorize and later quantify different interaction scenarios. Although this version focuses primarily on the classification pipeline and dataset structuring, it establishes the groundwork for scalable analysis across different sequence lengths 60, 100, and 200 frames. Future versions will incorporate complete quantitative analysis of the proposed metrics and their implications for pedestrian simulation and space design validation in crowd dynamics research.

[138] The point is the mask: scaling coral reef segmentation with weak supervision

Matteo Contini, Victor Illien, Sylvain Poulain, Serge Bernard, Julien Barde, Sylvain Bonhommeau, Alexis Joly

Main category: cs.CV

TL;DR: A multi-scale weakly supervised semantic segmentation framework that transfers fine-scale ecological information from underwater imagery to drone-based aerial data for large-scale coral reef mapping with minimal manual annotation.

Details

Motivation: Monitoring coral reefs at large spatial scales is essential for ecosystem health assessment but challenging due to limited resolution of drone imagery and high cost of pixel-level annotations, which limits scalability of deep learning methods.

Method: Combines classification-based supervision, spatial interpolation and self-distillation techniques to transfer fine-scale ecological information from underwater imagery to aerial data with minimal manual annotation.

Result: Enables large-area segmentation of coral morphotypes from drone imagery and demonstrates flexibility for integrating new classes.

Conclusion: Presents a scalable, cost-effective methodology for high-resolution reef monitoring that combines low-cost data collection, weakly supervised deep learning and multi-scale remote sensing.

Abstract: Monitoring coral reefs at large spatial scales remains an open challenge, essential for assessing ecosystem health and informing conservation efforts. While drone-based aerial imagery offers broad spatial coverage, its limited resolution makes it difficult to reliably distinguish fine-scale classes, such as coral morphotypes. At the same time, obtaining pixel-level annotations over large spatial extents is costly and labor-intensive, limiting the scalability of deep learning-based segmentation methods for aerial imagery. We present a multi-scale weakly supervised semantic segmentation framework that addresses this challenge by transferring fine-scale ecological information from underwater imagery to aerial data. Our method enables large-scale coral reef mapping from drone imagery with minimal manual annotation, combining classification-based supervision, spatial interpolation and self-distillation techniques. We demonstrate the efficacy of the approach, enabling large-area segmentation of coral morphotypes and demonstrating flexibility for integrating new classes. This study presents a scalable, cost-effective methodology for high-resolution reef monitoring, combining low-cost data collection, weakly supervised deep learning and multi-scale remote sensing.

[139] Generative AI in Map-Making: A Technical Exploration and Its Implications for Cartographers

Claudio Affolter, Sidi Wu, Yizi Chen, Lorenz Hurni

Main category: cs.CV

TL;DR: First AI model that generates accurate maps in controlled styles using vector data guidance and text prompts, integrated into a web application for democratized map-making

Details

Motivation: Traditional GIS-based map-making requires domain expertise and is time-consuming, especially for repetitive tasks. Generative AI offers automation potential but struggles with spatial accuracy and semantic control.

Method: Integration of vector data to guide map generation with textual prompts for style specification. Developed web application for usability and conducted user study with professional cartographers.

Result: Successfully generated accurate maps in controlled styles. User study showed potential for helping both non-experts and professionals create maps more efficiently.

Conclusion: Demonstrates the potential of GenAI models in democratizing map-making while outlining technical improvements needed and emphasizing the evolving role of cartographers in AI-assisted workflows.

Abstract: Traditional map-making relies heavily on Geographic Information Systems (GIS), requiring domain expertise and being time-consuming, especially for repetitive tasks. Recent advances in generative AI (GenAI), particularly image diffusion models, offer new opportunities for automating and democratizing the map-making process. However, these models struggle with accurate map creation due to limited control over spatial composition and semantic layout. To address this, we integrate vector data to guide map generation in different styles, specified by the textual prompts. Our model is the first to generate accurate maps in controlled styles, and we have integrated it into a web application to improve its usability and accessibility. We conducted a user study with professional cartographers to assess the fidelity of generated maps, the usability of the web application, and the implications of ever-emerging GenAI in map-making. The findings have suggested the potential of our developed application and, more generally, the GenAI models in helping both non-expert users and professionals in creating maps more efficiently. We have also outlined further technical improvements and emphasized the new role of cartographers to advance the paradigm of AI-assisted map-making.

[140] Enhancing compact convolutional transformers with super attention

Simpenzwe Honore Leandre, Natenaile Asmamaw Shiferaw, Dillip Rout

Main category: cs.CV

TL;DR: A vision model using token mixing, sequence-pooling, and convolutional tokenizers achieves SOTA performance on CIFAR100 with 46.29% top-1 accuracy, while being more efficient than SDPA transformers at 60% model size.

Details

Motivation: To create an efficient vision model that performs well in fixed context-length tasks without relying on complex techniques like data augmentation, positional embeddings, or learning rate scheduling.

Method: Proposes a vision architecture combining token mixing, sequence-pooling, and convolutional tokenizers to process visual data efficiently while maintaining high performance.

Result: Achieved 46.29% top-1 accuracy (vs 36.50% baseline) and 76.31% top-5 accuracy (vs 66.33% baseline) on CIFAR100. More efficient than SDPA transformers when context length < embedding dimension, with only 60% model size. High training stability without additional techniques.

Conclusion: The proposed architecture demonstrates superior performance and efficiency for fixed context-length vision tasks while maintaining training stability without complex auxiliary methods.

Abstract: In this paper, we propose a vision model that adopts token mixing, sequence-pooling, and convolutional tokenizers to achieve state-of-the-art performance and efficient inference in fixed context-length tasks. In the CIFAR100 benchmark, our model significantly improves the baseline of the top 1% and top 5% validation accuracy from 36.50% to 46.29% and 66.33% to 76.31%, while being more efficient than the Scaled Dot Product Attention (SDPA) transformers when the context length is less than the embedding dimension and only 60% the size. In addition, the architecture demonstrates high training stability and does not rely on techniques such as data augmentation like mixup, positional embeddings, or learning rate scheduling. We make our code available on Github.

[141] USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, Qian He

Main category: cs.CV

TL;DR: USO is a unified framework that combines style-driven and subject-driven image generation through disentangled learning of content and style features, achieving state-of-the-art performance in both style similarity and subject consistency.

Details

Motivation: Existing approaches treat style-driven and subject-driven generation as separate tasks, creating an artificial antagonism between style similarity and subject consistency. The authors argue both objectives can be unified under a single framework through content-style disentanglement.

Method: USO uses a large-scale triplet dataset, a disentangled learning scheme with style-alignment training and content-style disentanglement training, and incorporates a style reward-learning paradigm (SRL) to enhance performance. They also release USO-Bench for joint evaluation.

Result: Extensive experiments show USO achieves state-of-the-art performance among open-source models in both subject consistency and style similarity metrics.

Conclusion: Style-driven and subject-driven generation can be effectively unified through proper content-style disentanglement and re-composition, with USO demonstrating superior performance across both dimensions simultaneously.

Abstract: Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model’s performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

[142] Can we make NeRF-based visual localization privacy-preserving?

Maxime Pietrantoni, Martin Humenberger, Torsten Sattler, Gabriela Csurka

Main category: cs.CV

TL;DR: Privacy-preserving NeRF for visual localization using segmentation supervision instead of RGB images to protect sensitive scene details while maintaining localization accuracy.

Details

Motivation: NeRF-based visual localization methods encode fine scene details in their geometry representations, creating privacy vulnerabilities when deployed in cloud-based services where sensitive information could be recovered.

Method: Proposes ppNeSF (Privacy-Preserving Neural Segmentation Field), a NeRF variant trained with self-supervised segmentation supervision instead of RGB images, making segmentation labels coarse enough to obscure identifiable details while maintaining discriminative 3D features.

Result: The method shows that standard NeRFs trained with photometric losses are vulnerable to privacy attacks, while ppNeSF achieves state-of-the-art visual localization results while preserving privacy.

Conclusion: Segmentation-based supervision provides an effective approach for privacy-preserving visual localization with NeRF-based representations, balancing privacy protection with localization performance.

Abstract: Visual localization (VL) is the task of estimating the camera pose in a known scene. VL methods, a.o., can be distinguished based on how they represent the scene, e.g., explicitly through a (sparse) point cloud or a collection of images or implicitly through the weights of a neural network. Recently, NeRF-based methods have become popular for VL. While NeRFs offer high-quality novel view synthesis, they inadvertently encode fine scene details, raising privacy concerns when deployed in cloud-based localization services as sensitive information could be recovered. In this paper, we tackle this challenge on two ends. We first propose a new protocol to assess privacy-preservation of NeRF-based representations. We show that NeRFs trained with photometric losses store fine-grained details in their geometry representations, making them vulnerable to privacy attacks, even if the head that predicts colors is removed. Second, we propose ppNeSF (Privacy-Preserving Neural Segmentation Field), a NeRF variant trained with segmentation supervision instead of RGB images. These segmentation labels are learned in a self-supervised manner, ensuring they are coarse enough to obscure identifiable scene details while remaining discriminativeness in 3D. The segmentation space of ppNeSF can be used for accurate visual localization, yielding state-of-the-art results.

[143] Enhancing Document VQA Models via Retrieval-Augmented Generation

Eric López, Artemis Llabrés, Ernest Valveny

Main category: cs.CV

TL;DR: RAG significantly improves Document VQA performance by retrieving relevant segments before answer generation, with text-based retrieval achieving +22.5 ANLS improvement over baseline methods.

Details

Motivation: Current Document VQA systems either concatenate all pages (memory-intensive) or rely on large VLMs, creating efficiency problems for multi-page documents. RAG offers a memory-efficient alternative through selective evidence retrieval.

Method: Systematically evaluated RAG variants: text-based retrieval using OCR tokens and purely visual retrieval without OCR. Tested across multiple models and benchmarks (MP-DocVQA, DUDE, InfographicVQA) with ablation studies on retrieval components.

Result: Text-centric RAG improved baseline by up to +22.5 ANLS, visual variant achieved +5.0 ANLS without text extraction. Retrieval and reranking drove most gains, while layout-guided chunking strategies proved ineffective.

Conclusion: Careful evidence selection through RAG consistently boosts accuracy across model sizes and benchmarks, demonstrating practical value for real-world Document VQA applications.

Abstract: Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the “concatenate-all-pages” baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

[144] Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone

Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth

Main category: cs.CV

TL;DR: GRAS benchmark reveals significant demographic biases in Vision Language Models across gender, race, age, and skin tone, with the best model scoring only 2/100 on bias metrics.

Details

Motivation: As Vision Language Models become critical for real-world applications, understanding and quantifying their demographic biases is essential to ensure fair and equitable performance across diverse populations.

Method: Introduced GRAS benchmark with diverse demographic coverage and proposed GRAS Bias Score metric. Evaluated five state-of-the-art VLMs using visual question answering, emphasizing the need to consider multiple question formulations for accurate bias assessment.

Result: All five benchmarked VLMs showed concerning bias levels, with the least biased model achieving only 2 out of 100 on the GRAS Bias Score. The study also revealed that proper bias evaluation requires testing multiple question formulations in VQA tasks.

Conclusion: Current VLMs exhibit substantial demographic biases that need to be addressed. The GRAS benchmark provides a comprehensive framework for quantifying and understanding these biases, with code, data, and results made publicly available to facilitate further research.

Abstract: As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.

[145] RoofSeg: An edge-aware transformer-based network for end-to-end roof plane segmentation

Siyuan You, Guozheng Xu, Pengwei Zhou, Qiwen Jin, Jian Yao, Li Li

Main category: cs.CV

TL;DR: RoofSeg - an edge-aware transformer network for end-to-end roof plane segmentation from LiDAR point clouds, addressing feature discriminability at edges and incorporating geometric constraints.

Details

Motivation: Current deep learning approaches for roof plane segmentation have three main problems: not truly end-to-end, poor feature discriminability near edges, and insufficient consideration of planar geometric characteristics during training.

Method: A transformer encoder-decoder framework with learnable plane queries, Edge-Aware Mask Module (EAMM) for edge refinement, adaptive weighting strategy in mask loss, and plane geometric loss for training constraints.

Result: The proposed RoofSeg network achieves improved segmentation accuracy, particularly in edge regions, through geometric-aware feature enhancement and end-to-end optimization.

Conclusion: RoofSeg provides a comprehensive solution for roof plane segmentation by addressing edge discriminability issues and incorporating geometric constraints in a truly end-to-end deep learning framework.

Abstract: Roof plane segmentation is one of the key procedures for reconstructing three-dimensional (3D) building models at levels of detail (LoD) 2 and 3 from airborne light detection and ranging (LiDAR) point clouds. The majority of current approaches for roof plane segmentation rely on the manually designed or learned features followed by some specifically designed geometric clustering strategies. Because the learned features are more powerful than the manually designed features, the deep learning-based approaches usually perform better than the traditional approaches. However, the current deep learning-based approaches have three unsolved problems. The first is that most of them are not truly end-to-end, the plane segmentation results may be not optimal. The second is that the point feature discriminability near the edges is relatively low, leading to inaccurate planar edges. The third is that the planar geometric characteristics are not sufficiently considered to constrain the network training. To solve these issues, a novel edge-aware transformer-based network, named RoofSeg, is developed for segmenting roof planes from LiDAR point clouds in a truly end-to-end manner. In the RoofSeg, we leverage a transformer encoder-decoder-based framework to hierarchically predict the plane instance masks with the use of a set of learnable plane queries. To further improve the segmentation accuracy of edge regions, we also design an Edge-Aware Mask Module (EAMM) that sufficiently incorporates planar geometric prior of edges to enhance its discriminability for plane instance mask refinement. In addition, we propose an adaptive weighting strategy in the mask loss to reduce the influence of misclassified points, and also propose a new plane geometric loss to constrain the network training.

[146] MicroDetect-Net (MDN): Leveraging Deep Learning to Detect Microplastics in Clam Blood, a Step Towards Human Blood Analysis

Riju Marwah, Riya Arora, Navneet Yadav, Himank Arora

Main category: cs.CV

TL;DR: MicroDetect-Net (MDN) is a deep learning model that uses fluorescence microscopy with Nile Red dye staining to detect microplastics in blood samples with 92% accuracy.

Details

Motivation: Microplastic pollution is widespread and harmful to human health, causing liver infection, intestinal injuries, and gut flora imbalance. Current detection methods need improvement for accurate identification in biological samples.

Method: The MDN model combines fluorescence microscopy with Nile Red dye staining and convolutional neural networks for segmentation and detection of microplastic fragments in blood samples.

Result: MDN achieved 92% accuracy on 276 Nile Red-stained fluorescent blood images, with IoU of 87.4%, F1 score of 92.1%, Precision of 90.6%, and Recall of 93.7%.

Conclusion: The approach shows strong performance in microplastic detection and provides a foundation for applying this method to human blood samples for more comprehensive health impact studies.

Abstract: With the prevalence of plastics exceeding 368 million tons yearly, microplastic pollution has grown to an extent where air, water, soil, and living organisms have all tested positive for microplastic presence. These particles, which are smaller than 5 millimeters in size, are no less harmful to humans than to the environment. Toxicity research on microplastics has shown that exposure may cause liver infection, intestinal injuries, and gut flora imbalance, leading to numerous potential health hazards. This paper presents a new model, MicroDetect-Net (MDN), which applies fluorescence microscopy with Nile Red dye staining and deep learning to scan blood samples for microplastics. Although clam blood has certain limitations in replicating real human blood, this study opens avenues for applying the approach to human samples, which are more consistent for preliminary data collection. The MDN model integrates dataset preparation, fluorescence imaging, and segmentation using a convolutional neural network to localize and count microplastic fragments. The combination of convolutional networks and Nile Red dye for segmentation produced strong image detection and accuracy. MDN was evaluated on a dataset of 276 Nile Red-stained fluorescent blood images and achieved an accuracy of ninety two percent. Robust performance was observed with an Intersection over Union of 87.4 percent, F1 score of 92.1 percent, Precision of 90.6 percent, and Recall of 93.7 percent. These metrics demonstrate the effectiveness of MDN in the detection of microplastics.

[147] ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao

Main category: cs.CV

TL;DR: ProPy adapts CLIP for Partially Relevant Video Retrieval using a Prompt Pyramid structure and Ancestor-Descendant Interaction Mechanism to capture multi-granularity event semantics, achieving state-of-the-art performance.

Details

Motivation: Existing PRVR approaches focus on unimodal features while powerful pretrained vision-language models like CLIP remain underexplored for this task, creating a gap in leveraging advanced multimodal capabilities.

Method: ProPy introduces two innovations: 1) Prompt Pyramid structure organizing event prompts at multiple granularity levels, and 2) Ancestor-Descendant Interaction Mechanism enabling dynamic semantic interaction among events, systematically adapting CLIP for PRVR.

Result: ProPy achieves state-of-the-art performance on three public datasets, significantly outperforming previous models by substantial margins.

Conclusion: The systematic architectural adaptation of CLIP with multi-granularity event semantics through ProPy effectively bridges the gap in PRVR, demonstrating the potential of pretrained vision-language models for this challenging task.

Abstract: Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.

[148] GReAT: leveraging geometric artery data to improve wall shear stress assessment

Julian Suk, Jolanda J. Wentzel, Patryk Rygiel, Joost Daemen, Daniel Rueckert, Jelmer M. Wolterink

Main category: cs.CV

TL;DR: Self-supervised learning on large geometric artery datasets improves wall shear stress assessment in coronary arteries with limited clinical data.

Details

Motivation: Big data for patient care shows promise in cardiovascular health, but training machine learning models for hemodynamic biomarkers like wall shear stress requires large datasets that are challenging to obtain. Data scarcity can be addressed through self-supervised pre-training using large geometric artery datasets.

Method: Used a large dataset of 8449 geometric 3D blood vessel models for self-supervised pre-training. Created self-supervised targets by computing heat kernel signature via Laplacian eigenvectors to capture shape essence. Applied learned geometric representations to improve wall shear stress segmentation in coronary arteries from a small clinical trial (49 patients).

Result: Geometric representations learned from the large dataset boosted segmentation of coronary arteries into regions of low, mid, and high time-averaged wall shear stress, even when trained on limited data.

Conclusion: Self-supervised pre-training on large geometric datasets effectively addresses data scarcity and improves hemodynamic biomarker assessment in coronary arteries, demonstrating the value of foundation models for medical applications with limited clinical data.

Abstract: Leveraging big data for patient care is promising in many medical fields such as cardiovascular health. For example, hemodynamic biomarkers like wall shear stress could be assessed from patient-specific medical images via machine learning algorithms, bypassing the need for time-intensive computational fluid simulation. However, it is extremely challenging to amass large-enough datasets to effectively train such models. We could address this data scarcity by means of self-supervised pre-training and foundations models given large datasets of geometric artery models. In the context of coronary arteries, leveraging learned representations to improve hemodynamic biomarker assessment has not yet been well studied. In this work, we address this gap by investigating whether a large dataset (8449 shapes) consisting of geometric models of 3D blood vessels can benefit wall shear stress assessment in coronary artery models from a small-scale clinical trial (49 patients). We create a self-supervised target for the 3D blood vessels by computing the heat kernel signature, a quantity obtained via Laplacian eigenvectors, which captures the very essence of the shapes. We show how geometric representations learned from this datasets can boost segmentation of coronary arteries into regions of low, mid and high (time-averaged) wall shear stress even when trained on limited data.

[149] No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes

Blaž Rolih, Matic Fučka, Danijel Skočaj

Main category: cs.CV

TL;DR: SuperSimpleNet is a highly efficient surface defect detection model that works across all supervision scenarios (unsupervised, weakly supervised, mixed supervision, and fully supervised) with inference under 10ms.

Details

Motivation: Existing surface defect detection methods fail to meet industrial demands for high performance, efficiency, and adaptability across diverse data annotation scenarios encountered in real-world manufacturing.

Method: Built on SimpleNet foundation, incorporates novel synthetic anomaly generation, enhanced classification head, and improved learning procedure to enable efficient training in all four supervision scenarios.

Result: Sets new performance standards across all supervision scenarios on four benchmark datasets while achieving inference time below 10ms.

Conclusion: SuperSimpleNet represents a significant advancement in unifying diverse supervision paradigms while maintaining outstanding speed and reliability, bridging the gap between academic research and industrial applications.

Abstract: Surface defect detection is a critical task across numerous industries, aimed at efficiently identifying and localising imperfections or irregularities on manufactured components. While numerous methods have been proposed, many fail to meet industrial demands for high performance, efficiency, and adaptability. Existing approaches are often constrained to specific supervision scenarios and struggle to adapt to the diverse data annotations encountered in real-world manufacturing processes, such as unsupervised, weakly supervised, mixed supervision, and fully supervised settings. To address these challenges, we propose SuperSimpleNet, a highly efficient and adaptable discriminative model built on the foundation of SimpleNet. SuperSimpleNet incorporates a novel synthetic anomaly generation process, an enhanced classification head, and an improved learning procedure, enabling efficient training in all four supervision scenarios, making it the first model capable of fully leveraging all available data annotations. SuperSimpleNet sets a new standard for performance across all scenarios, as demonstrated by its results on four challenging benchmark datasets. Beyond accuracy, it is very fast, achieving an inference time below 10 ms. With its ability to unify diverse supervision paradigms while maintaining outstanding speed and reliability, SuperSimpleNet represents a promising step forward in addressing real-world manufacturing challenges and bridging the gap between academic research and industrial applications. Code: https://github.com/blaz-r/SuperSimpleNet

[150] Learning Binary Sampling Patterns for Single-Pixel Imaging using Bilevel Optimisation

Serban C. Tudosie, Alexander Denker, Zeljko Kereta, Simon Arridge

Main category: cs.CV

TL;DR: Bilevel optimization method for learning task-specific binary illumination patterns in single-pixel imaging, achieving superior reconstruction performance especially in undersampled regimes.

Details

Motivation: To improve single-pixel imaging reconstruction performance for applications like fluorescence microscopy by optimizing task-specific binary illumination patterns, addressing the non-differentiable nature of binary pattern optimization.

Method: Proposes a bilevel optimization method using Straight-Through Estimator to handle binary pattern non-differentiability and Total Deep Variation regulariser in the bilevel formulation.

Result: Demonstrated on CytoImageNet microscopy dataset, showing learned patterns achieve superior reconstruction performance compared to baseline methods, particularly in highly undersampled regimes.

Conclusion: The proposed bilevel optimization approach successfully learns effective binary illumination patterns for single-pixel imaging, significantly outperforming baseline methods especially when using limited samples.

Abstract: Single-Pixel Imaging enables reconstructing objects using a single detector through sequential illuminations with structured light patterns. We propose a bilevel optimisation method for learning task-specific, binary illumination patterns, optimised for applications like single-pixel fluorescence microscopy. We address the non-differentiable nature of binary pattern optimisation using the Straight-Through Estimator and leveraging a Total Deep Variation regulariser in the bilevel formulation. We demonstrate our method on the CytoImageNet microscopy dataset and show that learned patterns achieve superior reconstruction performance compared to baseline methods, especially in highly undersampled regimes.

[151] VibES: Induced Vibration for Persistent Event-Based Sensing

Vincenzo Polizzi, Stephen Yang, Quentin Clark, Jonathan Kelly, Igor Gilitschenski, David B. Lindell

Main category: cs.CV

TL;DR: A lightweight vibration-based approach to sustain persistent event generation in static scenes using a rotating unbalanced mass, combined with motion compensation for clean event data.

Details

Motivation: Event cameras cannot generate events in static or low-motion scenes under fixed illumination, limiting their usefulness for computer vision tasks that require continuous event data.

Method: Uses a simple rotating unbalanced mass to induce periodic vibrational motion, combined with a motion-compensation pipeline that removes the injected motion to yield clean events for downstream tasks.

Result: The method reliably recovers motion parameters and improves both image reconstruction and edge detection compared to event-based sensing without motion induction, as demonstrated with a hardware prototype and real-world datasets.

Conclusion: The proposed vibration-based approach provides an effective and lightweight solution to sustain event generation in static conditions, enabling reliable event-based perception without complex hardware or additional optical components.

Abstract: Event cameras are a bio-inspired class of sensors that asynchronously measure per-pixel intensity changes. Under fixed illumination conditions in static or low-motion scenes, rigidly mounted event cameras are unable to generate any events, becoming unsuitable for most computer vision tasks. To address this limitation, recent work has investigated motion-induced event stimulation that often requires complex hardware or additional optical components. In contrast, we introduce a lightweight approach to sustain persistent event generation by employing a simple rotating unbalanced mass to induce periodic vibrational motion. This is combined with a motion-compensation pipeline that removes the injected motion and yields clean, motion-corrected events for downstream perception tasks. We demonstrate our approach with a hardware prototype and evaluate it on real-world captured datasets. Our method reliably recovers motion parameters and improves both image reconstruction and edge detection over event-based sensing without motion induction.

[152] Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents

Rafael Sterzinger, Tingyu Lin, Robert Sablatnig

Main category: cs.CV

TL;DR: Lightweight UNet++ with topology-aware loss achieves state-of-the-art text line segmentation using only 3 annotated pages per manuscript, significantly outperforming complex models in data efficiency and accuracy.

Details

Motivation: Automating text line segmentation for historical documents is challenging due to limited annotated datasets and high annotation costs requiring expert knowledge, making few-shot learning essential.

Method: Pairs a lightweight UNet++ architecture with a connectivity-aware loss function that explicitly penalizes structural errors like line fragmentation and unintended merges. Trains on small patches from just 3 annotated pages per manuscript.

Result: 200% increase in Recognition Accuracy and 75% increase in Line Intersection over Union on U-DIADS-TL dataset. Achieves F-Measure score on par with or exceeding DIVA-HisDB competition winner while using only 3 annotated pages.

Conclusion: Small, simple architectures with topology-aware loss functions are more accurate and data-efficient than complex alternatives for few-shot historical document text line segmentation, demonstrating superior performance with minimal annotation requirements.

Abstract: A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.

[153] Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding

Yuzhen Li, Min Liu, Yuan Bian, Xueping Wang, Zhaoyang Li, Gen Li, Yaonan Wang

Main category: cs.CV

TL;DR: The paper addresses the issue of pre-trained language models being sensitive to numerical value magnitudes but ignoring measurement units in monocular 3D visual grounding, proposing two enhancement methods to improve 3D-text comprehension.

Details

Motivation: Current monocular 3D visual grounding methods suffer from poor 3D comprehension in text embeddings, where language models are sensitive to numerical magnitudes but ignore measurement units, leading to performance degradation even when physical lengths remain equivalent.

Method: Two methods: 1) 3D-text Enhancement (3DTE) - pre-processing that augments distance descriptor diversity in text queries to improve unit mapping comprehension; 2) Text-Guided Geometry Enhancement (TGE) module - projects text features into geometrically consistent space to guide geometry feature attention.

Result: Achieves state-of-the-art results on Mono3DRefer dataset with 11.94% accuracy gain in the “Far” scenario, demonstrating substantial improvements over previous methods.

Conclusion: The proposed 3D-text enhancement methods effectively address the weak 3D comprehension in language models, significantly improving monocular 3D visual grounding performance by better handling measurement units and numerical values in text descriptions.

Abstract: Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit “meter” to “decimeters” or “centimeters” leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94% in the “Far” scenario. Our code will be made publicly available.

[154] Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions

Zhihang Xin, Xitong Hu, Rui Wang

Main category: cs.CV

TL;DR: WEF-PE is a novel positional encoding method using elliptic functions to preserve 2D spatial structure in Vision Transformers, achieving superior performance across multiple benchmarks.

Details

Motivation: Traditional positional embeddings in Vision Transformers disrupt 2D spatial structure and lack geometric constraints, failing to properly encode spatial proximity relationships in images.

Method: Proposes Weierstrass Elliptic Function Positional Encoding (WEF-PE) that uses complex domain representation and elliptic functions’ doubly periodic properties to naturally encode 2D coordinates and spatial distances.

Result: Achieves 63.78% accuracy on CIFAR-100 with ViT-Tiny, 93.28% on CIFAR-100 with ViT-Base, and consistent improvements on VTAB-1k benchmark. Theoretical analysis confirms distance-decay property and attention visualization shows enhanced geometric bias.

Conclusion: WEF-PE provides mathematically principled 2D positional encoding that better preserves spatial structure and improves Vision Transformer performance across diverse scenarios.

Abstract: Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through patch flattening procedures. Traditional positional encoding approaches lack geometric constraints and fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model’s capacity to leverage spatial proximity priors effectively. We propose Weierstrass Elliptic Function Positional Encoding (WEF-PE), a mathematically principled approach that directly addresses two-dimensional coordinates through natural complex domain representation, where the doubly periodic properties of elliptic functions align remarkably with translational invariance patterns commonly observed in visual data. Our method exploits the non-linear geometric nature of elliptic functions to encode spatial distance relationships naturally, while the algebraic addition formula enables direct derivation of relative positional information between arbitrary patch pairs from their absolute encodings. Comprehensive experiments demonstrate that WEF-PE achieves superior performance across diverse scenarios, including 63.78% accuracy on CIFAR-100 from-scratch training with ViT-Tiny architecture, 93.28% on CIFAR-100 fine-tuning with ViT-Base, and consistent improvements on VTAB-1k benchmark tasks. Theoretical analysis confirms the distance-decay property through rigorous mathematical proof, while attention visualization reveals enhanced geometric inductive bias and more coherent semantic focus compared to conventional approaches.The source code implementing the methods described in this paper is publicly available on GitHub.

[155] SoccerNet 2025 Challenges Results

Silvio Giancola, Anthony Cioppa, Marc Gutiérrez-Pérez, Jan Held, Carlos Hinojosa, Victor Joos, Arnaud Leduc, Floriane Magera, Karen Sanchez, Vladimir Somers, Artur Xarles, Antonio Agudo, Alexandre Alahi, Olivier Barnich, Albert Clapés, Christophe De Vleeschouwer, Sergio Escalera, Bernard Ghanem, Thomas B. Moeslund, Marc Van Droogenbroeck, Tomoki Abe, Saad Alotaibi, Faisal Altawijri, Steven Araujo, Xiang Bai, Xiaoyang Bi, Jiawang Cao, Vanyi Chao, Kamil Czarnogórski, Fabian Deuser, Mingyang Du, Tianrui Feng, Patrick Frenzel, Mirco Fuchs, Jorge García, Konrad Habel, Takaya Hashiguchi, Sadao Hirose, Xinting Hu, Yewon Hwang, Ririko Inoue, Riku Itsuji, Kazuto Iwai, Hongwei Ji, Yangguang Ji, Licheng Jiao, Yuto Kageyama, Yuta Kamikawa, Yuuki Kanasugi, Hyungjung Kim, Jinwook Kim, Takuya Kurihara, Bozheng Li, Lingling Li, Xian Li, Youxing Lian, Dingkang Liang, Hongkai Lin, Jiadong Lin, Jian Liu, Liang Liu, Shuaikun Liu, Zhaohong Liu, Yi Lu, Federico Méndez, Huadong Ma, Wenping Ma, Jacek Maksymiuk, Henry Mantilla, Ismail Mathkour, Daniel Matthes, Ayaha Motomochi, Amrulloh Robbani Muhammad, Haruto Nakayama, Joohyung Oh, Yin May Oo, Marcelo Ortega, Norbert Oswald, Rintaro Otsubo, Fabian Perez, Mengshi Qi, Cristian Rey, Abel Reyes-Angulo, Oliver Rose, Hoover Rueda-Chacón, Hideo Saito, Jose Sarmiento, Kanta Sawafuji, Atom Scott, Xi Shen, Pragyan Shrestha, Jae-Young Sim, Long Sun, Yuyang Sun, Tomohiro Suzuki, Licheng Tang, Masato Tonouchi, Ikuma Uchida, Henry O. Velesaca, Tiancheng Wang, Rio Watanabe, Jay Wu, Yongliang Wu, Shunzo Yamagishi, Di Yang, Xu Yang, Yuxin Yang, Hao Ye, Xinyu Ye, Calvin Yeung, Xuanlong Yu, Chao Zhang, Dingyuan Zhang, Kexing Zhang, Zhe Zhao, Xin Zhou, Wenbo Zhu, Julian Ziegler

Main category: cs.CV

TL;DR: SoccerNet 2025 Challenges present four computer vision tasks for football video analysis: team ball action spotting, monocular depth estimation, multi-view foul recognition, and game state reconstruction, with standardized datasets and evaluation protocols.

Details

Motivation: To advance computer vision research in football video understanding through reproducible, open benchmarking and drive progress at the intersection of computer vision, AI, and sports analytics.

Method: Organized four distinct vision-based challenges with large-scale annotated datasets, unified evaluation protocols, and strong baseline models for participants to build upon and compare solutions.

Result: The challenges produced top-performing solutions across all four tasks, demonstrating community progress in football video analysis through standardized benchmarking and open research collaboration.

Conclusion: SoccerNet Challenges successfully serve as a driving force for reproducible research in sports video analysis, providing comprehensive frameworks for evaluating computer vision techniques in football contexts with available datasets and development kits.

Abstract: The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year’s challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, targeting the recovery of scene geometry from single-camera broadcast clips through relative depth estimation for each pixel; (3) Multi-View Foul Recognition, requiring the analysis of multiple synchronized camera views to classify fouls and their severity; and (4) Game State Reconstruction, aimed at localizing and identifying all players from a broadcast video to reconstruct the game state on a 2D top-view of the field. Across all tasks, participants were provided with large-scale annotated datasets, unified evaluation protocols, and strong baselines as starting points. This report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. The SoccerNet Challenges continue to serve as a driving force for reproducible, open research at the intersection of computer vision, artificial intelligence, and sports. Detailed information about the tasks, challenges, and leaderboards can be found at https://www.soccer-net.org, with baselines and development kits available at https://github.com/SoccerNet.

[156] FastMesh:Efficient Artistic Mesh Generation via Component Decoupling

Jeonghwan Kim, Yushi Lan, Armando Fortes, Yongwei Chen, Xingang Pan

Main category: cs.CV

TL;DR: A novel mesh generation framework that separates vertex and face generation, reducing token redundancy by 77% and achieving 8x faster generation speed with higher quality meshes compared to state-of-the-art methods.

Details

Motivation: Existing mesh generation approaches suffer from vertex redundancy in token sequences, leading to inefficient generation processes due to vertices being reused multiple times for manifold meshes.

Method: Autoregressive model for vertex generation only, bidirectional transformer for face completion in single step, fidelity enhancer for vertex refinement, and post-processing for edge connection cleanup.

Result: Achieves 8x faster generation speed and higher mesh quality compared to state-of-the-art approaches, with token count reduced to approximately 23% of existing compact tokenizers.

Conclusion: Separating vertex and face generation while leveraging specialized transformers and refinement techniques significantly improves mesh generation efficiency and quality.

Abstract: Recent mesh generation approaches typically tokenize triangle meshes into sequences of tokens and train autoregressive models to generate these tokens sequentially. Despite substantial progress, such token sequences inevitably reuse vertices multiple times to fully represent manifold meshes, as each vertex is shared by multiple faces. This redundancy leads to excessively long token sequences and inefficient generation processes. In this paper, we propose an efficient framework that generates artistic meshes by treating vertices and faces separately, significantly reducing redundancy. We employ an autoregressive model solely for vertex generation, decreasing the token count to approximately 23% of that required by the most compact existing tokenizer. Next, we leverage a bidirectional transformer to complete the mesh in a single step by capturing inter-vertex relationships and constructing the adjacency matrix that defines the mesh faces. To further improve the generation quality, we introduce a fidelity enhancer to refine vertex positioning into more natural arrangements and propose a post-processing framework to remove undesirable edge connections. Experimental results show that our method achieves more than 8$\times$ faster speed on mesh generation compared to state-of-the-art approaches, while producing higher mesh quality.

[157] All-in-One Slider for Attribute Manipulation in Diffusion Models

Weixin Ye, Hongguang Zhu, Wei Wang, Yahui Liu, Mengyu Wang

Main category: cs.CV

TL;DR: All-in-One Slider enables fine-grained control over multiple image attributes using a single lightweight module, eliminating the need for separate sliders per attribute and supporting zero-shot manipulation of unseen attributes.

Details

Motivation: Existing text-to-image diffusion models struggle with progressive attribute manipulation, requiring separate trained sliders for each attribute which leads to parameter redundancy and limited scalability.

Method: Decomposes text embedding space into sparse, semantically meaningful attribute directions using a lightweight module that functions as a general-purpose slider for interpretable control.

Result: Achieves accurate and scalable attribute manipulation with notable improvements over previous methods, supports zero-shot manipulation of unseen attributes, and works with real images through inversion framework integration.

Conclusion: The All-in-One Slider provides an efficient, flexible solution for fine-grained attribute control in text-to-image generation, overcoming limitations of previous one-slider-per-attribute approaches and enabling broader real-world applications.

Abstract: Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released at: https://github.com/ywxsuperstar/KSAE-FaceSteer.

[158] LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding

Julian Ost, Andrea Ramazzina, Amogh Joshi, Maximilian Bömer, Mario Bijelic, Felix Heide

Main category: cs.CV

TL;DR: A method for generating large-scale 3D driving scenes with accurate geometry and controllability, bridging neural reconstruction and diffusion models.

Details

Motivation: Existing neural reconstruction methods create static environments with limited control, while diffusion models lack geometry grounding and causality. This work aims to combine controllability with accurate 3D geometry for driving scenes.

Method: Combines proxy geometry generation with environment representation and score distillation from learned 2D image priors. Allows prompt-guided geometry and conditioning on map layouts.

Result: Produces realistic and geometrically consistent 3D generations of complex driving scenes with high-fidelity texture and structure, enabling causal novel view synthesis with object permanence.

Conclusion: The approach successfully bridges the gap between controllable generation and accurate 3D geometry, offering high controllability for large-scale driving scene synthesis.

Abstract: Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control – they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts – producing realistic and geometrically consistent 3D generations of complex driving scenes.

[159] OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation

Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, Mingyuan Gao

Main category: cs.CV

TL;DR: OmniHuman-1.5 is a video avatar framework that generates semantically coherent and expressive character animations using multimodal guidance and a specialized DiT architecture.

Details

Motivation: Existing video avatar models focus on physical likeness but lack semantic understanding of emotion, intent, and context, producing motions that only synchronize with low-level cues like audio rhythm.

Method: Leverages Multimodal Large Language Models for structured textual representation and introduces a specialized Multimodal DiT architecture with Pseudo Last Frame design to fuse multimodal inputs and mitigate inter-modality conflicts.

Result: Achieves leading performance across comprehensive metrics including lip-sync accuracy, video quality, motion naturalness, and semantic consistency with textual prompts. Shows remarkable extensibility to complex scenarios like multi-person and non-human subjects.

Conclusion: The proposed framework successfully bridges the gap between physical plausibility and semantic coherence, generating character animations that are contextually and emotionally resonant through effective multimodal input fusion.

Abstract: Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character’s authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}

[160] Automated Feature Tracking for Real-Time Kinematic Analysis and Shape Estimation of Carbon Nanotube Growth

Kaveh Safavigerdini, Ramakrishna Surya, Jaired Collins, Prasad Calyam, Filiz Bunyak, Matthew R. Maschmann, Kannappan Palaniappan

Main category: cs.CV

TL;DR: VFTrack is a real-time particle tracking framework that automatically detects and tracks carbon nanotube particles in SEM image sequences, enabling kinematic analysis of CNT growth with optimal feature detection and matching.

Details

Motivation: Existing methods for CNT growth characterization are limited to static analysis or require manual initialization, lacking continuous per-particle trajectory decomposition needed for dynamic growth analysis.

Method: VFTrack integrates handcrafted or deep feature detectors and matchers within a particle tracking framework, systematically evaluating combinations using 13,540 manually annotated trajectories to identify optimal detector-matcher pairs.

Result: ALIKED detector with LightGlue matcher achieved optimal performance (F1-score: 0.78, α-score: 0.89). Motion vectors decomposed into axial growth, lateral drift, and oscillations enabled calculation of heterogeneous regional growth rates and CNT pillar morphology reconstruction.

Conclusion: VFTrack bridges the gap between physics-based models and experimental observation, enabling automated nano-material characterization and real-time optimization of CNT synthesis through continuous particle tracking.

Abstract: Carbon nanotubes (CNTs) are critical building blocks in nanotechnology, yet the characterization of their dynamic growth is limited by the experimental challenges in nanoscale motion measurement using scanning electron microscopy (SEM) imaging. Existing ex situ methods offer only static analysis, while in situ techniques often require manual initialization and lack continuous per-particle trajectory decomposition. We present Visual Feature Tracking (VFTrack) an in-situ real-time particle tracking framework that automatically detects and tracks individual CNT particles in SEM image sequences. VFTrack integrates handcrafted or deep feature detectors and matchers within a particle tracking framework to enable kinematic analysis of CNT micropillar growth. A systematic using 13,540 manually annotated trajectories identifies the ALIKED detector with LightGlue matcher as an optimal combination (F1-score of 0.78, $\alpha$-score of 0.89). VFTrack motion vectors decomposed into axial growth, lateral drift, and oscillations, facilitate the calculation of heterogeneous regional growth rates and the reconstruction of evolving CNT pillar morphologies. This work enables advancement in automated nano-material characterization, bridging the gap between physics-based models and experimental observation to enable real-time optimization of CNT synthesis.

[161] Autoregressive Universal Video Segmentation Model

Miran Heo, Sukjun Hwang, Min-Hung Chen, Yu-Chiang Frank Wang, Albert Gu, Seon Joo Kim, Ryo Hachiuma

Main category: cs.CV

TL;DR: AUSM is a unified autoregressive model for both prompted and unprompted video segmentation that treats segmentation as sequential mask prediction, achieving state-of-the-art performance with faster training.

Details

Motivation: Current video segmentation landscape is fragmented with task-specific models, lacking a unified approach for both prompted and unprompted segmentation in streaming videos.

Method: Introduces Autoregressive Universal Segmentation Model (AUSM) based on state-space models, maintaining fixed-size spatial state for arbitrary-length videos with parallel training across frames.

Result: Outperforms prior universal streaming video segmentation methods on multiple benchmarks (DAVIS17, YouTube-VOS, MOSE, YouTube-VIS, OVIS) and achieves 2.5x faster training on 16-frame sequences.

Conclusion: AUSM successfully unifies prompted and unprompted video segmentation through autoregressive mask prediction, offering superior performance and efficiency compared to existing approaches.

Abstract: Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today’s landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.

[162] Style4D-Bench: A Benchmark Suite for 4D Stylization

Beiqi Chen, Shuai Shao, Haitang Feng, Jianhuang Lai, Jianlou Si, Guangcong Wang

Main category: cs.CV

TL;DR: Style4D-Bench is the first benchmark for 4D stylization, featuring evaluation metrics, a baseline method (Style4D), and curated 4D scenes. Style4D uses 4D Gaussian Splatting with specialized components for temporal/spatial appearance control and geometry preservation, achieving SOTA performance.

Details

Motivation: To standardize evaluation and facilitate progress in the emerging field of 4D stylization by providing the first comprehensive benchmark suite with proper evaluation protocols and baseline methods.

Method: Style4D framework built on 4D Gaussian Splatting with three components: basic 4DGS scene representation, Style Gaussian Representation using per-Gaussian MLPs for appearance control, and Holistic Geometry-Preserved Style Transfer module with contrastive coherence learning.

Result: Extensive experiments show Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering.

Conclusion: Style4D-Bench serves as a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes, with Style4D establishing a strong baseline for future work.

Abstract: We introduce Style4D-Bench, the first benchmark suite specifically designed for 4D stylization, with the goal of standardizing evaluation and facilitating progress in this emerging area. Style4D-Bench comprises: 1) a comprehensive evaluation protocol measuring spatial fidelity, temporal coherence, and multi-view consistency through both perceptual and quantitative metrics, 2) a strong baseline that make an initial attempt for 4D stylization, and 3) a curated collection of high-resolution dynamic 4D scenes with diverse motions and complex backgrounds. To establish a strong baseline, we present Style4D, a novel framework built upon 4D Gaussian Splatting. It consists of three key components: a basic 4DGS scene representation to capture reliable geometry, a Style Gaussian Representation that leverages lightweight per-Gaussian MLPs for temporally and spatially aware appearance control, and a Holistic Geometry-Preserved Style Transfer module designed to enhance spatio-temporal consistency via contrastive coherence learning and structural content preservation. Extensive experiments on Style4D-Bench demonstrate that Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering. We expect Style4D-Bench to become a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes. Project page: https://becky-catherine.github.io/Style4D . Code: https://github.com/Becky-catherine/Style4D-Bench .

[163] Articulate3D: Zero-Shot Text-Driven 3D Object Posing

Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht

Main category: cs.CV

TL;DR: Articulate3D is a training-free method that uses language instructions to pose 3D assets through a two-step process involving image generation and mesh alignment.

Details

Motivation: Despite advances in vision and language models, controlling 3D asset poses through language remains challenging, requiring a method that can manipulate poses while maintaining mesh identity.

Method: Two-step approach: 1) Modify image generator with self-attention rewiring (RSActrl) to create target images from input image + text instruction, 2) Use keypoint-based multi-view pose optimization to align mesh to target images instead of differentiable rendering.

Result: Successfully manipulates poses across diverse 3D objects and free-form text prompts while maintaining original mesh identity. Preferred over existing approaches 85% of the time in user studies.

Conclusion: Articulate3D provides an effective training-free solution for language-controlled 3D pose manipulation using image generation and keypoint-based optimization, outperforming existing methods.

Abstract: We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/

[164] VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng

Main category: cs.CV

TL;DR: VoxHammer is a training-free 3D editing method that performs precise local editing in 3D latent space by preserving contextual features of unedited regions, achieving superior consistency and quality compared to existing approaches.

Details

Motivation: 3D local editing is crucial for game industry and robot interaction, but current methods struggle with preserving unedited regions and maintaining overall coherence when editing multi-view images before reconstruction.

Method: VoxHammer predicts inversion trajectory of 3D models to obtain inverted latents and key-value tokens, then replaces denoising features of preserved regions with these cached features during the denoising and editing phase to ensure consistent reconstruction.

Result: The method significantly outperforms existing approaches in 3D consistency of preserved regions and overall quality, as demonstrated on the Edit3D-Bench dataset with human-annotated 3D editing regions.

Conclusion: VoxHammer enables high-quality 3D editing with precise preservation of unedited areas and coherent integration of edited parts, promising to facilitate synthesis of edited paired data for in-context 3D generation.

Abstract: 3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

[165] Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment

Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang

Main category: cs.CV

TL;DR: 3D-VLA is a weakly supervised approach for 3D visual grounding that uses vision-language models to align text with 3D point clouds without requiring bounding box annotations.

Details

Motivation: Existing 3D visual grounding methods require extensive bounding box annotations which are time-consuming and labor-intensive to obtain, creating a need for weakly supervised approaches.

Method: Exploits large-scale vision-language models’ ability to align text with 2D images, combined with natural 2D-3D correspondences, to implicitly construct text-3D relationships without box annotations.

Result: Achieves comparable and even superior performance to fully supervised methods on ReferIt3D and ScanRefer datasets.

Conclusion: This is the first weakly supervised 3D visual grounding method using vision-language models, demonstrating that high performance can be achieved without costly bounding box annotations.

Abstract: Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.

[166] VAGUE: Visual Contexts Clarify Ambiguous Expressions

Heejeong Nam, Jinwoo Ahn, Keummin Ka, Jiwan Chung, Youngjae Yu

Main category: cs.CV

TL;DR: VAGUE benchmark evaluates multimodal AI’s ability to use visual context for intent disambiguation, showing current models struggle significantly compared to humans despite visual cue improvements.

Details

Motivation: Human communication relies on visual cues to resolve ambiguity, but AI systems find it challenging to perform sophisticated multimodal reasoning.

Method: Created VAGUE benchmark with 1.6K ambiguous textual expressions paired with images and multiple-choice interpretations, spanning staged (Visual Commonsense Reasoning) and natural (Ego4D) scenes.

Result: Existing multimodal AI models struggle to infer speaker’s true intent, with performance far below human levels despite improvements from visual cues. Models fail to distinguish true intent from superficial correlations.

Conclusion: There’s a critical gap in multimodal reasoning where current models perceive images but don’t effectively reason with them, highlighting the need for better visual context integration.

Abstract: Human communication often relies on visual cues to resolve ambiguity. While humans can intuitively integrate these cues, AI systems often find it challenging to engage in sophisticated multimodal reasoning. We introduce VAGUE, a benchmark evaluating multimodal AI systems’ ability to integrate visual context for intent disambiguation. VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations, where the correct answer is only apparent with visual context. The dataset spans both staged, complex (Visual Commonsense Reasoning) and natural, personal (Ego4D) scenes, ensuring diversity. Our experiments reveal that existing multimodal AI models struggle to infer the speaker’s true intent. While performance consistently improves from the introduction of more visual cues, the overall accuracy remains far below human performance, highlighting a critical gap in multimodal reasoning. Analysis of failure cases demonstrates that current models fail to distinguish true intent from superficial correlations in the visual scene, indicating that they perceive images but do not effectively reason with them. We release our code and data at https://hazel-heejeong-nam.github.io/vague/.

[167] PointFix: Learning to Fix Domain Bias for Robust Online Stereo Adaptation

Kwonyoung Kim, Jungin Park, Jiyoung Lee, Dongbo Min, Kwanghoon Sohn

Main category: cs.CV

TL;DR: PointFix: A meta-learning framework with auxiliary point-selective network for robust initialization of stereo models to handle domain shift in online stereo adaptation, particularly addressing dynamic objects in challenging environments.

Details

Motivation: Online stereo adaptation faces domain shift problems between synthetic training and real test datasets, especially failing in regions with dynamic objects and severe environmental changes in applications like autonomous driving.

Method: Incorporates an auxiliary point-selective network into meta-learning framework to provide robust initialization. The network learns to fix local variants by back-propagating local information through meta-gradients, and is model-agnostic for plug-and-play use with any architecture.

Result: Extensive experiments across short-, mid-, and long-term sequence adaptations show state-of-the-art performance at inference through proper initialization of the base stereo model.

Conclusion: The auxiliary network enables effective learning paradigm that achieves superior performance in online stereo adaptation by addressing domain shift issues, particularly for dynamic objects in real-world applications.

Abstract: Online stereo adaptation tackles the domain shift problem, caused by different environments between synthetic (training) and real (test) datasets, to promptly adapt stereo models in dynamic real-world applications such as autonomous driving. However, previous methods often fail to counteract particular regions related to dynamic objects with more severe environmental changes. To mitigate this issue, we propose to incorporate an auxiliary point-selective network into a meta-learning framework, called PointFix, to provide a robust initialization of stereo models for online stereo adaptation. In a nutshell, our auxiliary network learns to fix local variants intensively by effectively back-propagating local information through the meta-gradient for the robust initialization of the baseline model. This network is model-agnostic, so can be used in any kind of architectures in a plug-and-play manner. We conduct extensive experiments to verify the effectiveness of our method under three adaptation settings such as short-, mid-, and long-term sequences. Experimental results show that the proper initialization of the base stereo model by the auxiliary network enables our learning paradigm to achieve state-of-the-art performance at inference.

[168] DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion Models

Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn

Main category: cs.CV

TL;DR: DiffBlender enhances diffusion-based text-to-image generation by integrating structure, layout, and attribute modalities in a unified framework without modifying pre-trained model parameters.

Details

Motivation: To improve text-to-image generation by incorporating diverse conditional inputs beyond just textual descriptions within a single framework.

Method: Proposes a multimodal T2I diffusion model that processes three modality types (structure, layout, attribute) using only a small subset of updated components while keeping pre-trained diffusion model parameters unchanged.

Result: Sets new benchmarks in multimodal generation through extensive quantitative and qualitative comparisons with existing conditional generation methods.

Conclusion: DiffBlender effectively integrates multiple information sources and supports diverse applications in detailed image synthesis, demonstrating superior multimodal generation capabilities.

Abstract: In this study, we aim to enhance the capabilities of diffusion-based text-to-image (T2I) generation models by integrating diverse modalities beyond textual descriptions within a unified framework. To this end, we categorize widely used conditional inputs into three modality types: structure, layout, and attribute. We propose a multimodal T2I diffusion model, which is capable of processing all three modalities within a single architecture without modifying the parameters of the pre-trained diffusion model, as only a small subset of components is updated. Our approach sets new benchmarks in multimodal generation through extensive quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender effectively integrates multiple sources of information and supports diverse applications in detailed image synthesis. The code and demo are available at https://github.com/sungnyun/diffblender.

[169] Memory augment is All You Need for image restoration

Xiao Feng Zhang, Chao Chen Gu, Shan Ying Zhu

Main category: cs.CV

TL;DR: MemoryNet proposes a three-granularity memory layer with contrastive learning for image restoration tasks, achieving improved performance on deraining, deshadowing, and deblurring.

Details

Motivation: Most CNN-based image restoration methods lack transparency and internal aesthetics, while existing optimization-DNN hybrid approaches have limitations.

Method: Uses a three-granularity memory layer to preserve deep image features and contrastive learning with positive, negative, and actual samples to balance learned features.

Result: Achieves significant PSNR and SSIM gains on three datasets with different degradation types, demonstrating perceptual realism in recovered images.

Conclusion: MemoryNet effectively improves restoration performance through its memory layer and contrastive learning approach, proving successful across multiple image restoration tasks.

Abstract: Image restoration is a low-level vision task, most CNN methods are designed as a black box, lacking transparency and internal aesthetics. Although some methods combining traditional optimization algorithms with DNNs have been proposed, they all have some limitations. In this paper, we propose a three-granularity memory layer and contrast learning named MemoryNet, specifically, dividing the samples into positive, negative, and actual three samples for contrastive learning, where the memory layer is able to preserve the deep features of the image and the contrastive learning converges the learned features to balance. Experiments on Derain/Deshadow/Deblur task demonstrate that these methods are effective in improving restoration performance. In addition, this paper’s model obtains significant PSNR, SSIM gain on three datasets with different degradation types, which is a strong proof that the recovered images are perceptually realistic. The source code of MemoryNet can be obtained from https://github.com/zhangbaijin/MemoryNet

[170] Learning county from pixels: corn yield prediction with attention-weighted multiple instance learning

Xiaoyu Wang, Yuchi Ma, Qunying Huang, Zhengwei Yang, Zhou Zhang

Main category: cs.CV

TL;DR: Pixel-level corn yield prediction using attention-based multiple instance learning to address mixed pixel issues, achieving R²=0.84 and outperforming other ML models.

Details

Motivation: Existing county-level yield prediction methods aggregate pixels into single values, losing granular information and suffering from mixed pixel noise due to inconsistent resolution between feature datasets and crop masks.

Method: Uses multiple instance learning with attention mechanism to analyze counties at pixel level, automatically assigning weights to different pixels to mitigate mixed pixel influence and filter out noise.

Result: Outperformed four other machine learning models over five years in US corn belt, with best performance in 2022 (R²=0.84, RMSE=0.83). Verified ability to capture critical features while filtering mixed pixel noise.

Conclusion: Pixel-level analysis with attention mechanism provides superior yield prediction by leveraging detailed spatial information and effectively addressing mixed pixel challenges, demonstrating advantages from both spatial and temporal perspectives.

Abstract: Remote sensing technology has become a promising tool in yield prediction. Most prior work employs satellite imagery for county-level corn yield prediction by spatially aggregating all pixels within a county into a single value, potentially overlooking the detailed information and valuable insights offered by more granular data. To this end, this research examines each county at the pixel level and applies multiple instance learning to leverage detailed information within a county. In addition, our method addresses the “mixed pixel” issue caused by the inconsistent resolution between feature datasets and crop mask, which may introduce noise into the model and therefore hinder accurate yield prediction. Specifically, the attention mechanism is employed to automatically assign weights to different pixels, which can mitigate the influence of mixed pixels. The experimental results show that the developed model outperforms four other machine learning models over the past five years in the U.S. corn belt and demonstrates its best performance in 2022, achieving a coefficient of determination (R2) value of 0.84 and a root mean square error (RMSE) of 0.83. This paper demonstrates the advantages of our approach from both spatial and temporal perspectives. Furthermore, through an in-depth study of the relationship between mixed pixels and attention, it is verified that our approach can capture critical feature information while filtering out noise from mixed pixels.

Hu Wang, Salma Hassan, Yuyuan Liu, Congbo Ma, Yuanhong Chen, Qing Li, Jiahui Geng, Bingjie Wang, Yu Tian, Yutong Xie, Jodie Avery, Louise Hull, Ian Reid, Mohammad Yaqub, Gustavo Carneiro

Main category: cs.CV

TL;DR: MetaKD uses meta-learning to weight modalities and enable knowledge transfer between them, maintaining high accuracy even when key modalities are missing across multiple tasks.

Details

Motivation: Some modalities in multi-modal learning are more influential than others, and their absence significantly impacts classification/segmentation accuracy. Existing methods are often task-specific and require major modifications.

Method: Meta-learned Modality-weighted Knowledge Distillation (MetaKD) adaptively estimates modality importance weights through meta-learning, then guides pairwise modality-weighted knowledge distillation to transfer knowledge from high-importance to lower-importance modalities.

Result: Outperforms compared models by a large margin on five datasets including three Brain Tumor Segmentation datasets (BraTS2018-2020), ADNI classification, and Audiovision-MNIST classification.

Conclusion: MetaKD provides a robust approach that works across multiple tasks with minimal adaptation, maintaining high performance despite missing modalities through adaptive modality weighting and knowledge distillation.

Abstract: In multi-modal learning, some modalities are more influential than others, and their absence can have a significant impact on classification/segmentation accuracy. Addressing this challenge, we propose a novel approach called Meta-learned Modality-weighted Knowledge Distillation (MetaKD), which enables multi-modal models to maintain high accuracy even when key modalities are missing. MetaKD adaptively estimates the importance weight of each modality through a meta-learning process. These learned importance weights guide a pairwise modality-weighted knowledge distillation process, allowing high-importance modalities to transfer knowledge to lower-importance ones, resulting in robust performance despite missing inputs. Unlike previous methods in the field, which are often task-specific and require significant modifications, our approach is designed to work in multiple tasks (e.g., segmentation and classification) with minimal adaptation. Experimental results on five prevalent datasets, including three Brain Tumor Segmentation datasets (BraTS2018, BraTS2019 and BraTS2020), the Alzheimer’s Disease Neuroimaging Initiative (ADNI) classification dataset and the Audiovision-MNIST classification dataset, demonstrate the proposed model is able to outperform the compared models by a large margin. The code is available at https://github.com/billhhh/MetaKD.

[172] Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

Somraj Gautam, Abhirama Subramanyam Penamakuri, Abhishek Bhandari, Gaurav Harit

Main category: cs.CV

TL;DR: MMCRICBENCH-3K is a benchmark for evaluating large vision-language models on cricket scorecard VQA tasks, featuring English and Hindi scorecards with English QA pairs to test numerical reasoning and cross-lingual generalization.

Details

Motivation: To address the limitations of current LVLMs in handling complex numerical reasoning, structured data understanding, and cross-lingual generalization in semi-structured tabular images like cricket scorecards.

Method: Created a benchmark with 1,463 synthetic cricket scorecard images (ODI, T20, Test formats) and 1,500 English QA pairs, divided into English (MMCRICBENCH-E-1.5K) and visually similar Hindi (MMCRICBENCH-H-1.5K) subsets for controlled cross-script evaluation.

Result: State-of-the-art LVLMs (GPT-4o, Qwen2.5VL) struggle significantly on both English and Hindi subsets, with performance dropping further on Hindi scorecards despite English being their primary training language.

Conclusion: The benchmark reveals critical limitations in LVLMs’ structure-aware visual text understanding, numerical reasoning capabilities, and cross-lingual generalization, providing a valuable resource for future research in this direction.

Abstract: We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.

[173] MicroMIL: Graph-Based Multiple Instance Learning for Context-Aware Diagnosis with Microscopic Images

Jongwoo Kim, Bryan Wong, Huazhu Fu, Willmer Rafell Quiñones, Youngsin Ko, Mun Yong Yi

Main category: cs.CV

TL;DR: MicroMIL is a weakly-supervised multiple instance learning framework designed for conventional light microscope images that reduces redundancy and eliminates the need for spatial coordinates while maintaining diagnostic accuracy.

Details

Motivation: Whole-slide images require significant computational resources, limiting accessibility in resource-constrained settings. Conventional light microscopes offer a cost-effective alternative but present challenges for graph-based MIL due to redundant images and missing spatial coordinates.

Method: Uses a representative image extractor (RIE) with deep cluster embedding and hard Gumbel-Softmax to dynamically reduce redundancy and select representative images. These images serve as graph nodes with edges computed via cosine similarity, eliminating the need for spatial coordinates.

Result: Achieves state-of-the-art performance on real-world colon cancer and BreakHis datasets, improving both diagnostic accuracy and robustness to redundancy.

Conclusion: MicroMIL successfully addresses the challenges of applying GNN-MIL to conventional microscope images, providing a cost-effective alternative to whole-slide imaging while maintaining high diagnostic performance.

Abstract: Cancer diagnosis has greatly benefited from the integration of whole-slide images (WSIs) with multiple instance learning (MIL), enabling high-resolution analysis of tissue morphology. Graph-based MIL (GNN-MIL) approaches have emerged as powerful solutions for capturing contextual information in WSIs, thereby improving diagnostic accuracy. However, WSIs require significant computational and infrastructural resources, limiting accessibility in resource-constrained settings. Conventional light microscopes offer a cost-effective alternative, but applying GNN-MIL to such data is challenging due to extensive redundant images and missing spatial coordinates, which hinder contextual learning. To address these issues, we introduce MicroMIL, the first weakly-supervised MIL framework specifically designed for images acquired from conventional light microscopes. MicroMIL leverages a representative image extractor (RIE) that employs deep cluster embedding (DCE) and hard Gumbel-Softmax to dynamically reduce redundancy and select representative images. These images serve as graph nodes, with edges computed via cosine similarity, eliminating the need for spatial coordinates while preserving contextual information. Extensive experiments on a real-world colon cancer dataset and the BreakHis dataset demonstrate that MicroMIL achieves state-of-the-art performance, improving both diagnostic accuracy and robustness to redundancy. The code is available at https://github.com/kimjongwoo-cell/MicroMIL

[174] FUSELOC: Fusing Global and Local Descriptors to Disambiguate 2D-3D Matching in Visual Localization

Son Tung Nguyen, Alejandro Fontan, Michael Milford, Tobias Fischer

Main category: cs.CV

TL;DR: A novel method that fuses local and global descriptors using weighted average to improve direct 2D-3D matching accuracy while reducing memory usage and increasing speed compared to hierarchical methods.

Details

Motivation: Hierarchical visual localization methods achieve high accuracy but require substantial memory storage for all database images, while direct 2D-3D matching uses less memory but suffers from lower accuracy due to ambiguous search spaces.

Method: Fuses local and global descriptors using a weighted average operator that rearranges the local descriptor space so geographically nearby descriptors are closer in feature space according to global descriptors, reducing irrelevant competing descriptors.

Result: Consistently improved accuracy over local-only systems, achieving performance close to hierarchical methods while using 43% less memory and running 1.6 times faster on four challenging datasets (Cambridge Landmarks, Aachen Day/Night, RobotCar Seasons, Extended CMU Seasons).

Conclusion: For the first time, direct matching algorithms can benefit from global descriptors without compromising computational efficiency, demonstrating a breakthrough in balancing accuracy, memory usage, and speed in visual localization.

Abstract: Hierarchical visual localization methods achieve state-of-the-art accuracy but require substantial memory as they need to store all database images. Direct 2D-3D matching requires significantly less memory but suffers from lower accuracy due to the larger and more ambiguous search space. We address this ambiguity by fusing local and global descriptors using a weighted average operator. This operator rearranges the local descriptor space so that geographically nearby local descriptors are closer in the feature space according to the global descriptors. This decreases the number of irrelevant competing descriptors, especially if they are geographically distant, thus increasing the correct matching likelihood. We consistently improve the accuracy over local-only systems, and we achieve performance close to hierarchical methods while using 43% less memory and running 1.6 times faster. Extensive experiments on four challenging datasets – Cambridge Landmarks, Aachen Day/Night, RobotCar Seasons, and Extended CMU Seasons – demonstrate that, for the first time, direct matching algorithms can benefit from global descriptors without compromising computational efficiency. Our code is available at \href{https://github.com/sontung/descriptor-disambiguation}{https://github.com/sontung/descriptor-disambiguation}.

[175] MCGS: Multiview Consistency Enhancement for Sparse-View 3D Gaussian Radiance Fields

Yuru Xiao, Deming Zhai, Wenbo Zhao, Kui Jiang, Junjun Jiang, Xianming Liu

Main category: cs.CV

TL;DR: MCGS is a 3D Gaussian Splatting framework that improves sparse view synthesis by using matching priors for Gaussian initialization and multi-view consistency-guided pruning for robust scene reconstruction.

Details

Motivation: Existing methods using 3D Gaussians struggle with sparse input views due to poor initialization and lack of multi-view consistency constraints, leading to suboptimal performance and inefficient scene representation.

Method: Uses sparse matcher priors to initialize Gaussians on textured regions with random distribution in low-texture areas, plus multi-view consistency-guided progressive pruning to dynamically remove inconsistent Gaussians during optimization.

Result: Achieves photorealistic scene reconstruction from sparse views with enhanced robustness, faster rendering, and reduced memory consumption compared to existing methods.

Conclusion: MCGS provides a practical framework for 3D Gaussian Splatting that effectively addresses sparse view challenges through improved initialization and consistency-constrained optimization strategies.

Abstract: Radiance fields represented by 3D Gaussians excel at synthesizing novel views, offering both high training efficiency and fast rendering. However, with sparse input views, the lack of multi-view consistency constraints results in poorly initialized Gaussians and unreliable heuristics for optimization, leading to suboptimal performance. Existing methods often incorporate depth priors from dense estimation networks but overlook the inherent multi-view consistency in input images. Additionally, they rely on dense initialization, which limits the efficiency of scene representation. To overcome these challenges, we propose a view synthesis framework based on 3D Gaussian Splatting, named MCGS, enabling photorealistic scene reconstruction from sparse views. The key innovations of MCGS in enhancing multi-view consistency are as follows: i) We leverage matching priors from a sparse matcher to initialize Gaussians primarily on textured regions, while low-texture areas are populated with randomly distributed Gaussians. This yields a compact yet sufficient set of initial Gaussians. ii) We propose a multi-view consistency-guided progressive pruning strategy to dynamically eliminate inconsistent Gaussians. This approach confines their optimization to a consistency-constrained space, which ensures robust and coherent scene reconstruction. These strategies enhance robustness to sparse views, accelerate rendering, and reduce memory consumption, making MCGS a practical framework for 3D Gaussian Splatting.

[176] Benchmarking XAI Explanations with Human-Aligned Evaluations

Rémi Kazmierczak, Steve Azzolin, Eloïse Berthier, Anna Hedström, Patricia Delhomme, David Filliat, Nicolas Bousquet, Goran Frehse, Massimiliano Mancini, Baptiste Caramiaux, Andrea Passerini, Gianni Franchi

Main category: cs.CV

TL;DR: PASTA is a human-centric framework for evaluating XAI techniques in computer vision, featuring a large-scale benchmark dataset and automated scoring system that predicts human preferences.

Details

Motivation: Current XAI evaluation lacks standardized human-centric benchmarks that can compare different explanation modalities (saliency-based vs concept-based) and provide scalable assessment aligned with human perception.

Method: Created PASTA-dataset - first large-scale benchmark spanning diverse models and explanation methods. Developed automated PASTA-score method that predicts human preferences using this dataset, enabling cross-modality comparisons.

Result: The framework provides robust comparative analysis of XAI techniques based on human judgment, offering scalable and consistent evaluation that aligns with human perception across different explanation modalities.

Conclusion: PASTA enables systematic evaluation of XAI interpretability and can be used to build more human-interpretable AI methods, addressing previous gaps in cross-modality comparison and human-aligned assessment.

Abstract: We introduce PASTA (Perceptual Assessment System for explanaTion of Artificial Intelligence), a novel human-centric framework for evaluating eXplainable AI (XAI) techniques in computer vision. Our first contribution is the creation of the PASTA-dataset, the first large-scale benchmark that spans a diverse set of models and both saliency-based and concept-based explanation methods. This dataset enables robust, comparative analysis of XAI techniques based on human judgment. Our second contribution is an automated, data-driven benchmark that predicts human preferences using the PASTA-dataset. This scoring called PASTA-score method offers scalable, reliable, and consistent evaluation aligned with human perception. Additionally, our benchmark allows for comparisons between explanations across different modalities, an aspect previously unaddressed. We then propose to apply our scoring method to probe the interpretability of existing models and to build more human interpretable XAI methods.

[177] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin

Main category: cs.CV

TL;DR: Zoom Eye is a tree search algorithm that enables MLLMs to navigate high-resolution images hierarchically, improving detailed object recognition without additional training.

Details

Motivation: High-resolution images contain numerous visual elements, but MLLMs struggle with restricted input resolution and cluttered contexts, often missing fine-grained details while focusing only on primary objects.

Method: Proposes a model-agnostic, training-free tree search algorithm that treats images as hierarchical trees, where each child node represents a zoomed sub-patch of its parent. The algorithm searches from root (full image) to leaf nodes to locate relevant information.

Result: Significant performance improvements: LLaVA-v1.5-7B increased by 34.57% on V* Bench and 17.88% on HR-Bench. Small 7B MLLMs outperformed large models like GPT-4o on high-resolution benchmarks.

Conclusion: Zoom Eye effectively enables MLLMs to simulate human zooming actions, capturing detailed visual information in high-resolution images without requiring model retraining, demonstrating strong generalization across different MLLM architectures.

Abstract: An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects. When perceiving such images, multimodal large language models~(MLLMs) face limitations due to the restricted input resolution of the pretrained vision encoder and the cluttered, dense context of the image, resulting in a focus on primary objects while easily overlooking detailed ones. In this paper, we propose Zoom Eye, a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information. Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs to simulate human zooming actions by searching along the image tree from root to leaf nodes, seeking out pertinent information, and accurately responding to related queries. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series base MLLMs with large margin~(e.g., LLaVA-v1.5-7B increases by 34.57% on $V^*$ Bench and 17.88% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o. Our code is available at \href{https://github.com/om-ai-lab/ZoomEye}{https://github.com/om-ai-lab/ZoomEye}.

[178] Human Vision Constrained Super-Resolution

Volodymyr Karpenko, Taimoor Tariq, Jorge Condor, Piotr Didyk

Main category: cs.CV

TL;DR: A human vision-inspired framework that dynamically guides super-resolution methods based on visual sensitivity and viewing conditions, reducing computational costs by 2x+ without quality loss.

Details

Motivation: Current SR methods process images/videos independently of human visual system limitations and viewing conditions, wasting computational resources on details that viewers cannot perceive.

Method: Proposes a Human Visual Processing Framework (HVPF) that dynamically and locally guides SR methods according to human sensitivity to image details (spatial frequency, luminance, color, contrast, motion) and viewing conditions (lighting, distance). Combined with network branching for computational efficiency.

Result: Achieves 2x or greater reduction in FLOPS without sacrificing perceived quality, as demonstrated through quantitative/qualitative evaluations and user studies.

Conclusion: The framework successfully optimizes computational resources in SR by aligning processing with human visual capabilities, delivering visually optimal results while significantly reducing computational complexity.

Abstract: Modern deep-learning super-resolution (SR) techniques process images and videos independently of the underlying content and viewing conditions. However, the sensitivity of the human visual system (HVS) to image details changes depending on the underlying image characteristics, such as spatial frequency, luminance, color, contrast, or motion; as well viewing condition aspects such as ambient lighting and distance to the display. This observation suggests that computational resources spent on up-sampling images/videos may be wasted whenever a viewer cannot resolve the synthesized details i.e the resolution of details exceeds the resolving capability of human vision. Motivated by this observation, we propose a human vision inspired and architecture-agnostic approach for controlling SR techniques to deliver visually optimal results while limiting computational complexity. Its core is an explicit Human Visual Processing Framework (HVPF) that dynamically and locally guides SR methods according to human sensitivity to specific image details and viewing conditions. We demonstrate the application of our framework in combination with network branching to improve the computational efficiency of SR methods. Quantitative and qualitative evaluations, including user studies, demonstrate the effectiveness of our approach in reducing FLOPS by factors of 2$\times$ and greater, without sacrificing perceived quality.

[179] Incremental Multi-Scene Modeling via Continual Neural Graphics Primitives

Prajwal Singh, Ashish Tiwari, Gautam Vashishtha, Shanmuganathan Raman

Main category: cs.CV

TL;DR: C-NGP enables multiple 3D scenes to be incrementally encoded into a single NeRF model without increasing parameters, using generative replay to avoid catastrophic forgetting.

Details

Motivation: NeRFs require separate models per scene with cumulative training time increases, limiting scalability for multiple scenes.

Method: Continual-Neural Graphics Primitives (C-NGP) framework using generative replay approach to integrate multiple scenes incrementally without access to old data.

Result: C-NGP models all 8 scenes from Real-LLFF dataset with only 2.2% PSNR drop compared to vanilla NeRF per-scene modeling, and enables multiple style edits.

Conclusion: C-NGP successfully addresses NeRF scalability challenges by enabling continual learning of multiple scenes in a single model with minimal quality degradation.

Abstract: Neural radiance fields (NeRF) have revolutionized photorealistic rendering of novel views for 3D scenes. Despite their growing popularity and efficiency as 3D resources, NeRFs face scalability challenges due to the need for separate models per scene and the cumulative increase in training time for multiple scenes. The potential for incrementally encoding multiple 3D scenes into a single NeRF model remains largely unexplored. To address this, we introduce Continual-Neural Graphics Primitives (C-NGP), a novel continual learning framework that integrates multiple scenes incrementally into a single neural radiance field. Using a generative replay approach, C-NGP adapts to new scenes without requiring access to old data. We demonstrate that C-NGP can accommodate multiple scenes without increasing the parameter count, producing high-quality novel-view renderings on synthetic and real datasets. Notably, C-NGP models all $8$ scenes from the Real-LLFF dataset together, with only a $2.2%$ drop in PSNR compared to vanilla NeRF, which models each scene independently. Further, C-NGP allows multiple style edits in the same network.

[180] Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

Xiaoqi Zhao, Youwei Pang, Shijie Chang, Yuan Zhao, Lihe Zhang, Chenyang Yu, Hanqi Liu, Jiaming Zuo, Jinsong Ouyang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu

Main category: cs.CV

TL;DR: Comprehensive evaluation of SAM and SAM~2 foundation models on 11 context-dependent concepts across 2D/3D images and videos, revealing limitations in handling context-sensitive segmentation tasks.

Details

Motivation: SAM and SAM~2 have shown strong open-world segmentation capabilities but overlook challenging context-dependent concepts like visual saliency, camouflage, defects, and lesions that require strong contextual understanding.

Method: Developed unified evaluation framework with manual/automatic/self-prompting strategies, tested on 11 CD concepts across multiple visual modalities, explored in-context learning, and conducted prompt robustness testing.

Result: The evaluation reveals SAMs’ limitations in handling context-dependent concepts that require strong discriminative capabilities and contextual understanding across different domains.

Conclusion: While SAMs excel at context-independent segmentation, they struggle with context-dependent concepts, highlighting the need for improved contextual understanding in future segmentation models.

Abstract: As large-scale foundation models trained on billions of image–mask pairs covering a vast diversity of scenes, objects, and contexts, SAM and its upgraded version, SAM~~2, have significantly influenced multiple fields within computer vision. Leveraging such unprecedented data diversity, they exhibit strong open-world segmentation capabilities, with SAM~~2 further enhancing these capabilities to support high-quality video segmentation. While SAMs (SAM and SAM2) have demonstrated excellent performance in segmenting context-independent concepts like people, cars, and roads, they overlook more challenging context-dependent (CD) concepts, such as visual saliency, camouflage, industrial defects, and medical lesions. CD concepts rely heavily on global and local contextual information, making them susceptible to shifts in different contexts, which requires strong discriminative capabilities from the model. The lack of comprehensive evaluation of SAMs limits understanding of their performance boundaries, which may hinder the design of future models. In this paper, we conduct a thorough evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes. We develop a unified evaluation framework for SAM and SAM2 that supports manual, automatic, and intermediate self-prompting, aided by our specific prompt generation and interaction strategies. We further explore the potential of SAM~2 for in-context learning and introduce prompt robustness testing to simulate real-world imperfect prompts. Finally, we analyze the benefits and limitations of SAMs in understanding CD concepts and discuss their future development in segmentation tasks.

[181] StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Yunzhi Yan, Zhen Xu, Haotong Lin, Haian Jin, Haoyu Guo, Yida Wang, Kun Zhan, Xianpeng Lang, Hujun Bao, Xiaowei Zhou, Sida Peng

Main category: cs.CV

TL;DR: StreetCrafter is a controllable video diffusion model that uses LiDAR point cloud renderings as pixel-level conditions for photorealistic novel view synthesis in autonomous driving scenes, enabling precise camera control and pixel-level editing.

Details

Motivation: Existing neural scene representation methods for autonomous driving scenes suffer from performance degradation when viewpoints deviate from training trajectories, limiting their practical application for novel view synthesis.

Method: The authors introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions to exploit generative priors while preserving precise camera control.

Result: Experiments on Waymo Open Dataset and PandaSet demonstrate that StreetCrafter enables flexible viewpoint control, expands view synthesis regions, and outperforms existing methods in rendering quality.

Conclusion: StreetCrafter effectively addresses viewpoint deviation issues in autonomous driving scene synthesis through LiDAR-based pixel-level conditioning and generative priors, achieving superior performance and enabling real-time rendering when incorporated into dynamic scene representations.

Abstract: This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensor data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes, but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control. Moreover, the utilization of pixel-level LiDAR conditions allows us to make accurate pixel-level edits to target scenes. In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open Dataset and PandaSet demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.

[182] CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers

Dimitrios Mallis, Ahmet Serdar Karadeniz, Sebastian Cavada, Danila Rukhovich, Niki Foteinopoulou, Kseniya Cherenkova, Anis Kacem, Djamila Aouada

Main category: cs.CV

TL;DR: CAD-Assistant is a general-purpose CAD agent that uses a Vision and Large Language Model with CAD-specific tools to process multimodal user queries and generate executable CAD commands through iterative execution on FreeCAD.

Details

Motivation: To create an AI assistant that can help with CAD design by understanding multimodal inputs and generating appropriate CAD operations, addressing the need for intelligent design assistance.

Method: Uses a Vision and Large Language Model as planner with tool-augmentation paradigm, equipped with CAD-specific tools including sketch parameterizer, rendering modules, cross-section generator, and executes actions iteratively on FreeCAD via Python API.

Result: Outperforms VLLM baselines and supervised task-specific methods on multiple CAD benchmarks, demonstrating strong performance in CAD design tasks.

Conclusion: Tool-augmented VLLMs show significant potential as general-purpose CAD solvers capable of handling diverse design workflows through iterative command generation and state adaptation.

Abstract: We propose CAD-Assistant, a general-purpose CAD agent for AI-assisted design. Our approach is based on a powerful Vision and Large Language Model (VLLM) as a planner and a tool-augmentation paradigm using CAD-specific tools. CAD-Assistant addresses multimodal user queries by generating actions that are iteratively executed on a Python interpreter equipped with the FreeCAD software, accessed via its Python API. Our framework is able to assess the impact of generated CAD commands on geometry and adapts subsequent actions based on the evolving state of the CAD design. We consider a wide range of CAD-specific tools including a sketch image parameterizer, rendering modules, a 2D cross-section generator, and other specialized routines. CAD-Assistant is evaluated on multiple CAD benchmarks, where it outperforms VLLM baselines and supervised task-specific methods. Beyond existing benchmarks, we qualitatively demonstrate the potential of tool-augmented VLLMs as general-purpose CAD solvers across diverse workflows.

[183] Survey on Monocular Metric Depth Estimation

Jiuling Zhang

Main category: cs.CV

TL;DR: Survey paper on Monocular Metric Depth Estimation (MMDE) that reviews evolution from geometry-based methods to deep learning models, analyzes key datasets and benchmarks, and evaluates methodological advances for producing depth maps with absolute scale.

Details

Motivation: Deep learning approaches for monocular depth estimation often predict only relative depth without consistent metric scale, reducing reliability in applications like visual SLAM, 3D modeling, and view synthesis. MMDE overcomes this by producing depth maps with absolute scale.

Method: Comprehensive review and analysis of MMDE methods, covering geometry-based approaches to state-of-the-art deep models. Examines key datasets (KITTI, NYU-D, ApolloScape, TartanAir), methodological advances including domain generalization, boundary preservation, synthetic-real data integration, unsupervised/semi-supervised learning, patch-based inference, architectural innovations, and generative modeling.

Result: Synthesizes current progress in MMDE, highlights importance of high-quality datasets, and identifies open challenges. Provides structured reference for advancing MMDE and supporting real-world computer vision applications.

Conclusion: This survey provides a comprehensive overview of MMDE evolution, methodological advances, and dataset analysis, serving as a valuable reference for researchers and practitioners to advance metric depth estimation and enable reliable deployment in real-world computer vision systems.

Abstract: Monocular Depth Estimation (MDE) enables spatial understanding, 3D reconstruction, and autonomous navigation, yet deep learning approaches often predict only relative depth without a consistent metric scale. This limitation reduces reliability in applications such as visual SLAM, precise 3D modeling, and view synthesis. Monocular Metric Depth Estimation (MMDE) overcomes this challenge by producing depth maps with absolute scale, ensuring geometric consistency and enabling deployment without additional calibration. This survey reviews the evolution of MMDE, from geometry-based methods to state-of-the-art deep models, with emphasis on the datasets that drive progress. Key benchmarks, including KITTI, NYU-D, ApolloScape, and TartanAir, are examined in terms of modality, scene type, and application domain. Methodological advances are analyzed, covering domain generalization, boundary preservation, and the integration of synthetic and real data. Techniques such as unsupervised and semi-supervised learning, patch-based inference, architectural innovations, and generative modeling are evaluated for their strengths and limitations. By synthesizing current progress, highlighting the importance of high-quality datasets, and identifying open challenges, this survey provides a structured reference for advancing MMDE and supporting its adoption in real-world computer vision systems.

[184] Single-Domain Generalized Object Detection by Balancing Domain Diversity and Invariance

Zhenwei He, Hongsu Ni

Main category: cs.CV

TL;DR: Proposes DIDM model that balances domain-specific diversity and invariance for single-domain generalization in object detection, addressing limitations of previous invariance-only approaches.

Details

Motivation: Existing single-domain generalization methods focus only on feature invariance, which causes loss of domain-specific information and incomplete feature representations. They also ignore domain-specific discrepancies, increasing training complexity.

Method: DIDM model with Diversity Learning Module (DLM) that preserves invariant semantics while enhancing domain-specific features using feature diversity loss, and Weighted Aligning Module (WAM) for cross-domain feature alignment while maintaining discriminative domain-specific information.

Result: Extensive experiments on multiple diverse datasets show superior performance compared to existing methods.

Conclusion: The proposed DIDM model successfully achieves harmonious integration of domain-specific diversity and domain invariance, overcoming limitations of previous invariance-driven approaches in single-domain generalization for object detection.

Abstract: Single-domain generalization for object detection (S-DGOD) seeks to transfer learned representations from a single source domain to unseen target domains. While recent approaches have primarily focused on achieving feature invariance, they ignore that domain diversity also presents significant challenges for the task. First, such invariance-driven strategies often lead to the loss of domain-specific information, resulting in incomplete feature representations. Second, cross-domain feature alignment forces the model to overlook domain-specific discrepancies, thereby increasing the complexity of the training process. To address these limitations, this paper proposes the Diversity Invariant Detection Model (DIDM), which achieves a harmonious integration of domain-specific diversity and domain invariance. Our key idea is to learn the invariant representations by keeping the inherent domain-specific features. Specifically, we introduce the Diversity Learning Module (DLM). This module limits the invariant semantics while explicitly enhancing domain-specific feature representation through a proposed feature diversity loss. Furthermore, to ensure cross-domain invariance without sacrificing diversity, we incorporate the Weighted Aligning Module (WAM) to enable feature alignment while maintaining the discriminative domain-specific information. Extensive experiments on multiple diverse datasets demonstrate the effectiveness of the proposed model, achieving superior performance compared to existing methods.

[185] PromptGAR: Flexible Promptive Group Activity Recognition

Zhangyu Jin, Andrew Feng, Ankur Chemburkar, Celso M. De Melo

Main category: cs.CV

TL;DR: PromptGAR is a flexible Group Activity Recognition framework that handles diverse visual prompts (boxes, keypoints, IDs) as point prompts without retraining, achieving high accuracy with actor consistency.

Details

Motivation: Existing GAR approaches have limited real-world applicability due to reliance on full prompt annotations, fixed frame/instance numbers, and lack of actor consistency for extended activities.

Method: Unifies diverse visual prompts as point prompts, uses recognition decoder for cross-updating class/prompt tokens, and introduces relative instance attention mechanism to encode instance identities for actor consistency.

Result: Achieves competitive performance on both full and partial prompt inputs, demonstrating effectiveness in input flexibility and generalization for real-world applications.

Conclusion: PromptGAR successfully bridges the gap in GAR by providing input flexibility across prompts, frames, and instances without retraining while maintaining high recognition accuracy and actor consistency.

Abstract: We present PromptGAR, a novel framework for Group Activity Recognition (GAR) that offering both input flexibility and high recognition accuracy. The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, fixed number of frames and instances, and the lack of actor consistency. To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining. We leverage diverse visual prompts, like bounding boxes, skeletal keypoints, and instance identities, by unifying them as point prompts. A recognition decoder then cross-updates class and prompt tokens for enhanced performance. To ensure actor consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance identities. Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and partial prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.

[186] Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness

Beier Zhu, Jiequan Cui, Hanwang Zhang, Chi Zhang

Main category: cs.CV

TL;DR: PPA is a parameter-efficient fine-tuning method that addresses spurious correlations in image-text foundation models without needing group annotations, improving minority sample identification and robust training.

Details

Motivation: Image-text foundation models struggle with spurious correlations between inputs and labels, and existing debiasing methods often require group annotations which are costly to obtain.

Method: A three-step approach: 1) Train biased classifiers by projecting image features onto text encoder nullspace, 2) Infer group labels using biased classifier with prior correction, 3) Aggregate group weights to produce debiased classifier.

Result: Outperforms state-of-the-art methods by average worst-group accuracy while using less than 0.01% tunable parameters without requiring group labels during training.

Conclusion: PPA effectively mitigates spurious correlations in foundation models through improved minority group identification and achieves Bayes optimal performance for balanced group error minimization.

Abstract: While image-text foundation models have succeeded across diverse downstream tasks, they still face challenges in the presence of spurious correlations between the input and label. To address this issue, we propose a simple three-step approach,Project-Probe-Aggregate (PPA), that enables parameter-efficient fine-tuning for foundation models without relying on group annotations. Building upon the failure-based debiasing scheme, our method, PPA, improves its two key components: minority samples identification and the robust training algorithm. Specifically, we first train biased classifiers by projecting image features onto the nullspace of class proxies from text encoders. Next, we infer group labels using the biased classifier and probe group targets with prior correction. Finally, we aggregate group weights of each class to produce the debiased classifier. Our theoretical analysis shows that our PPA enhances minority group identification and is Bayes optimal for minimizing the balanced group error, mitigating spurious correlations. Extensive experimental results confirm the effectiveness of our PPA: it outperforms the state-of-the-art by an average worst-group accuracy while requiring less than 0.01% tunable parameters without training group labels.

[187] Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, Basura Fernando

Main category: cs.CV

TL;DR: PKR-QA is a new benchmark for procedural knowledge reasoning QA, built using a procedural knowledge graph from instructional videos and enriched with commonsense knowledge, with a neurosymbolic approach for interpretable reasoning.

Details

Motivation: To address the need for structured reasoning over procedural tasks and enable interpretable question answering that requires step-by-step procedural knowledge.

Method: Semi-automatic construction of procedural knowledge graph (PKG) from COIN dataset, ConceptNet, and LLM outputs with manual verification. Uses graph traversal templates for QA generation and proposes Knowledge Module Learning (KML) - a neurosymbolic approach combining neural modules with LLM-based structured reasoning.

Result: The approach improves reasoning performance on PKR-QA benchmark and enables step-by-step reasoning traces that facilitate interpretability.

Conclusion: PKR-QA provides a valuable benchmark for procedural knowledge reasoning, and the neurosymbolic KML approach effectively combines neural learning with symbolic reasoning for interpretable procedural task understanding.

Abstract: We introduce PKR-QA (Procedural Knowledge Reasoning Question Answering), a new benchmark for question answering over procedural tasks that require structured reasoning. PKR-QA is constructed semi-automatically using a procedural knowledge graph (PKG), which encodes task-specific knowledge across diverse domains. The PKG is built by curating and linking information from the COIN instructional video dataset and the ontology, enriched with commonsense knowledge from ConceptNet and structured outputs from Large Language Models (LLMs), followed by manual verification. To generate question-answer pairs, we design graph traversal templates where each template is applied systematically over PKG. To enable interpretable reasoning, we propose a neurosymbolic approach called Knowledge Module Learning (KML), which learns procedural relations via neural modules and composes them for structured reasoning with LLMs. Experiments demonstrate that this paradigm improves reasoning performance on PKR-QA and enables step-by-step reasoning traces that facilitate interpretability. Code and dataset will be released soon https://github.com/LUNAProject22/KML.

[188] Faster Parameter-Efficient Tuning with Token Redundancy Reduction

Kwonyoung Kim, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn

Main category: cs.CV

TL;DR: FPET is a parameter-efficient tuning method that improves inference speed and training efficiency while maintaining storage efficiency through token redundancy reduction.

Details

Motivation: Current PET methods inherit the inference latency of large backbone models and often introduce additional computational overhead from extra modules, limiting their practicality for compute-intensive applications.

Method: Introduces a plug-and-play token redundancy reduction module that refines tokens from self-attention layers using an adapter to learn token similarity, and uses a fully-differentiable token merging strategy with straight-through estimator for optimal token reduction.

Result: FPET achieves faster inference and higher memory efficiency than the pre-trained backbone while maintaining competitive performance comparable to state-of-the-art PET methods.

Conclusion: FPET successfully addresses the computational overhead limitations of traditional PET methods by enhancing inference speed and training efficiency without sacrificing storage efficiency or performance.

Abstract: Parameter-efficient tuning (PET) aims to transfer pre-trained foundation models to downstream tasks by learning a small number of parameters. Compared to traditional fine-tuning, which updates the entire model, PET significantly reduces storage and transfer costs for each task regardless of exponentially increasing pre-trained model capacity. However, most PET methods inherit the inference latency of their large backbone models and often introduce additional computational overhead due to additional modules (e.g. adapters), limiting their practicality for compute-intensive applications. In this paper, we propose Faster Parameter-Efficient Tuning (FPET), a novel approach that enhances inference speed and training efficiency while maintaining high storage efficiency. Specifically, we introduce a plug-and-play token redundancy reduction module delicately designed for PET. This module refines tokens from the self-attention layer using an adapter to learn the accurate similarity between tokens and cuts off the tokens through a fully-differentiable token merging strategy, which uses a straight-through estimator for optimal token reduction. Experimental results prove that our FPET achieves faster inference and higher memory efficiency than the pre-trained backbone while keeping competitive performance on par with state-of-the-art PET methods.

[189] LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

Xiang Hu, Yuhao Wang, Pingping Zhang, Huchuan Lu

Main category: cs.CV

TL;DR: LATex framework uses prompt-tuning with CLIP to leverage attribute-based text knowledge for aerial-ground person re-identification, improving performance while reducing training costs.

Details

Motivation: Existing AG-ReID methods overlook semantic attribute information and rely on expensive full fine-tuning of large models, creating a need for more efficient and semantically-aware approaches.

Method: Uses CLIP as backbone with Attribute-aware Image Encoder (AIE) to extract features, Prompted Attribute Classifier Group (PACG) for attribute prediction, and Coupled Prompt Template (CPT) to generate structured sentences from attributes and view information.

Result: Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of the proposed methods in improving re-identification performance.

Conclusion: The LATex framework successfully leverages attribute-based text knowledge through prompt-tuning strategies to achieve better AG-ReID performance with reduced training costs compared to full fine-tuning approaches.

Abstract: As an important task in intelligent transportation systems, Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different viewpoints. Previous methods typically adopt deep learning-based models, focusing on extracting view-invariant features. However, they usually overlook the semantic information in person attributes. In addition, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. More specifically, we first introduce the Contrastive Language-Image Pre-training (CLIP) model as the backbone, and propose an Attribute-aware Image Encoder (AIE) to extract both global semantic features and attribute-aware features from input images. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to predict person attributes and obtain attribute representations. Finally, we design a Coupled Prompt Template (CPT) to transform attribute representations and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve AG-ReID performance. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed methods. The source code will be available.

[190] M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering

Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang

Main category: cs.CV

TL;DR: M²IV is a representation engineering approach that replaces token-intensive multimodal demonstrations with learnable vectors injected into LVLMs, improving performance while reducing computational overhead.

Details

Motivation: Multimodal in-context learning is constrained by token-intensive inputs and complex cross-modal reasoning, hindering LVLMs from effectively extracting patterns from demonstrations.

Method: Proposes M²IV - learnable Multimodal In-context Vectors injected into residual streams, with training strategy analyzing MHA and MLP roles for semantic distillation and cross-modal representation learning.

Result: Achieves average 3.74% accuracy gain over vanilla ICL, reduces token overhead, enables many-shot scaling, and outperforms prior representation engineering baselines.

Conclusion: M²IV provides an efficient and effective alternative to traditional multimodal ICL, with VLibrary enabling flexible customization of pre-trained LVLMs for diverse requirements.

Abstract: Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74% with substantial improvements in overall efficiency.

[191] A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Disease Detection from Retinal Fundus Images

Kerol Djoumessi, Samuel Ofosu Mensah, Philipp Berens

Main category: cs.CV

TL;DR: Interpretable hybrid CNN-Transformer architecture for retinal disease detection that generates faithful evidence maps directly reflecting model decisions, achieving state-of-the-art performance.

Details

Motivation: Hybrid CNN-Transformer models combine local feature extraction and global dependencies but lack interpretability, which is crucial for medical imaging applications where understanding model decisions is essential.

Method: Developed an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture that generates class-specific sparse evidence maps in a single forward pass, unlike post-hoc saliency methods.

Result: Achieves state-of-the-art predictive performance on retinal disease detection tasks using color fundus images, outperforming both black-box and interpretable models while providing faithful localized evidence maps.

Conclusion: The proposed architecture successfully combines the strengths of CNNs and Transformers while maintaining interpretability, making it suitable for medical imaging applications where both performance and transparency are critical.

Abstract: In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for retinal disease detection. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the mode’s decision process. We evaluated our method on two medical tasks focused on disease detection using color fundus images. Our model achieves state-of-the-art predictive performance compared to black-box and interpretable models and provides class-specific sparse evidence maps in a single forward pass. The code is available at: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.

[192] ForgetMe: Evaluating Selective Forgetting in Generative Models

Zhenyu Yu, Mohd Yamani Inda Idris, Pei Wang

Main category: cs.CV

TL;DR: Proposes ForgetMe dataset and Entangled metric for evaluating selective unlearning in diffusion models, addressing privacy concerns by removing sensitive information while preserving non-sensitive regions.

Details

Motivation: Growing need for privacy-compliant unlearning in diffusion models due to their widespread use in image generation, with existing methods struggling to selectively remove sensitive information without affecting non-sensitive areas.

Method: Automatic Dataset Creation Framework using prompt-based layered editing and training-free local feature removal to construct ForgetMe dataset, with LoRA fine-tuning on Stable Diffusion for selective unlearning and Entangled metric for evaluation.

Result: Created diverse ForgetMe dataset covering real and synthetic scenarios (CUB-200-2011, Stanford-Dogs, ImageNet, synthetic cats) and validated effectiveness of both dataset and Entangled metric as benchmarks for selective unlearning.

Conclusion: Provides scalable and adaptable solution for privacy-preserving generative AI through comprehensive dataset and evaluation metric for selective unlearning in diffusion models.

Abstract: The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.

[193] Video CLIP Model for Multi-View Echocardiography Interpretation

Ryo Takizawa, Satoshi Kodera, Tempei Kabayama, Ryo Matsuoka, Yuta Ando, Yuto Nakamura, Haruki Settai, Norihiko Takeda

Main category: cs.CV

TL;DR: A video-language model for echocardiography that processes full video sequences from five standard views, trained on 60k+ video-report pairs to improve diagnostic accuracy through motion analysis and multi-view support.

Details

Motivation: Existing medical vision-language models rely on single-frame inputs, which reduces diagnostic accuracy for conditions requiring cardiac motion analysis. Echocardiographic videos from multiple views vary in suitability for detecting specific conditions.

Method: Developed a video-language model that processes full video sequences from five standard echocardiographic views, trained on 60,747 video-report pairs. Evaluated gains from video input and multi-view support with various pretrained models.

Result: The model demonstrates improved retrieval performance by leveraging video sequences and multi-view inputs compared to single-frame approaches.

Conclusion: Processing full video sequences from multiple standard views significantly enhances echocardiographic interpretation accuracy by capturing cardiac motion and leveraging complementary view-specific information.

Abstract: Echocardiography records ultrasound videos of the heart, enabling clinicians to assess cardiac function. Recent advances in large-scale vision-language models (VLMs) have spurred interest in automating echocardiographic interpretation. However, most existing medical VLMs rely on single-frame (image) inputs, which can reduce diagnostic accuracy for conditions identifiable only through cardiac motion. In addition, echocardiographic videos are captured from multiple views, each varying in suitability for detecting specific conditions. Leveraging multiple views may therefore improve diagnostic performance. We developed a video-language model that processes full video sequences from five standard views, trained on 60,747 echocardiographic video-report pairs. We evaluated the gains in retrieval performance from video input and multi-view support, including the contributions of various pretrained models.

[194] WMKA-Net: A Weighted Multi-Kernel Attention Network for Retinal Vessel Segmentation

Xinran Xu, Yuliang Ma, Sifu Cai, Ruoyan Shi

Main category: cs.CV

TL;DR: A dual-stage retinal vessel segmentation model called WMKA-Net addresses multi-scale feature fusion, contextual continuity, and noise issues using reversible multi-scale fusion and vascular-oriented attention mechanisms, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Retinal vessel segmentation faces challenges including insufficient multi-scale feature fusion, disruption of contextual continuity, and noise interference, which are crucial for intelligent ophthalmic diagnosis and early screening of diabetic retinopathy.

Method: Two-stage approach: 1) Reversible Multi-Scale Fusion Module (RMS) with hierarchical adaptive convolution for dynamic cross-scale feature merging and bias calibration; 2) Vascular-Oriented Attention Mechanism with axial pathway for long-distance continuity modeling and bifurcation attention pathway for topological key node capture.

Result: Achieved accuracy of 0.9909, sensitivity of 0.9198, and specificity of 0.9953 on DRIVE, STARE, and CHASE-DB1 datasets, significantly outperforming existing methods.

Conclusion: WMKA-Net provides an efficient, precise, and robust intelligent solution for retinal vessel segmentation, particularly beneficial for early diabetic retinopathy screening through improved vascular structure continuity restoration and complex network segmentation.

Abstract: Retinal vessel segmentation is crucial for intelligent ophthalmic diagnosis, yet it faces three major challenges: insufficient multi-scale feature fusion, disruption of contextual continuity, and noise interference. This study proposes a dual-stage solution to address these issues. The first stage employs a Reversible Multi-Scale Fusion Module (RMS) that uses hierarchical adaptive convolution to dynamically merge cross-scale features from capillaries to main vessels, self-adaptively calibrating feature biases. The second stage introduces a Vascular-Oriented Attention Mechanism, which models long-distance vascular continuity through an axial pathway and enhances the capture of topological key nodes, such as bifurcation points, via a dedicated bifurcation attention pathway. The synergistic operation of these two pathways effectively restores the continuity of vascular structures and improves the segmentation accuracy of complex vascular networks. Systematic experiments on the DRIVE, STARE, and CHASE-DB1 datasets demonstrate that WMKA-Net achieves an accuracy of 0.9909, sensitivity of 0.9198, and specificity of 0.9953, significantly outperforming existing methods. This model provides an efficient, precise, and robust intelligent solution for the early screening of diabetic retinopathy.

[195] Decoupled Global-Local Alignment for Improving Compositional Understanding

Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang

Main category: cs.CV

TL;DR: DeGLA framework improves CLIP’s compositional understanding while preserving general capabilities through self-distillation and novel contrastive losses using LLM-generated negative samples.

Details

Motivation: CLIP's global contrastive learning limits its ability to understand compositional concepts like relations and attributes, and existing methods that use hard negative samples compromise the model's general capabilities.

Method: Proposes Decoupled Global-Local Alignment (DeGLA) with self-distillation to retain pretrained knowledge, uses LLMs to generate 2M high-quality negative captions, and introduces Image-Grounded Contrast (IGC) and Text-Grounded Contrast (TGC) losses for compositional enhancement.

Result: Achieves 3.5% average improvement on VALSE, SugarCrepe, and ARO benchmarks, and 13.0% average improvement on zero-shot classification across 11 datasets compared to previous state-of-the-art methods.

Conclusion: DeGLA effectively enhances compositional understanding while mitigating catastrophic forgetting of pretrained knowledge, demonstrating superior performance on both compositional reasoning and general zero-shot tasks.

Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP’s ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model’s inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model’s inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA

Kai Cui, Jia Li, Yu Liu, Xuesong Zhang, Zhenzhen Hu, Meng Wang

Main category: cs.CV

TL;DR: PhysioSync is a novel pre-training framework that uses temporal and cross-modal contrastive learning to improve EEG-based emotion recognition by modeling dynamic synchronization between EEG and peripheral physiological signals at different time resolutions.

Details

Motivation: EEG signals are noisy and vary across individuals, making emotion recognition challenging. Existing multimodal approaches often overlook dynamic synchronization and consistent semantics between modalities, and temporal dynamics of emotional fluctuations in peripheral physiological signals remain underexplored.

Method: Proposes PhysioSync framework with Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and PPS, and Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions. After pre-training, features are hierarchically fused and fine-tuned.

Result: Experiments on DEAP and DREAMER datasets demonstrate PhysioSync’s advanced performance under both uni-modal and cross-modal conditions for EEG-centered emotion recognition.

Conclusion: PhysioSync effectively addresses the challenges of EEG-based emotion recognition by leveraging physiological synchronization phenomena through contrastive learning, showing superior performance in capturing cross-modal and temporal emotional dynamics.

Abstract: Electroencephalography (EEG) signals provide a promising and involuntary reflection of brain activity related to emotional states, offering significant advantages over behavioral cues like facial expressions. However, EEG signals are often noisy, affected by artifacts, and vary across individuals, complicating emotion recognition. While multimodal approaches have used Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often overlook the dynamic synchronization and consistent semantics between the modalities. Additionally, the temporal dynamics of emotional fluctuations across different time resolutions in PPS remain underexplored. To address these challenges, we propose PhysioSync, a novel pre-training framework leveraging temporal and cross-modal contrastive learning, inspired by physiological synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and complementary PPS, enabling emotion-related synchronizations across modalities. Besides, it introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions within modalities. After pre-training, cross-resolution and cross-modal features are hierarchically fused and fine-tuned to enhance emotion recognition. Experiments on DEAP and DREAMER datasets demonstrate PhysioSync’s advanced performance under uni-modal and cross-modal conditions, highlighting its effectiveness for EEG-centered emotion recognition.

[197] RAFT: Robust Augmentation of FeaTures for Image Segmentation

Edward Humes, Xiaomin Lin, Uttej Kallakuri, Tinoosh Mohsenin

Main category: cs.CV

TL;DR: RAFT is a novel framework that improves synthetic-to-real image segmentation performance using minimal labeled real data through data/feature augmentations and active learning, achieving state-of-the-art results on multiple benchmarks.

Details

Motivation: Real-world deployment of image segmentation models is hindered by the need for high-quality labeled datasets and the Syn2Real problem where models trained on synthetic data perform poorly on real data.

Method: Proposes RAFT framework combining data augmentation, feature augmentation, and active learning to adapt image segmentation models using minimal labeled real-world data.

Result: Achieved improvements of 2.1%/79.9% mIoU on SYNTHIA->Cityscapes, 0.4%/78.2% on GTAV->Cityscapes, and 1.3%/73.2% on Cityscapes->ACDC benchmarks, surpassing previous state-of-the-art HALO.

Conclusion: RAFT effectively bridges the synthetic-to-real gap in image segmentation with minimal labeled real data and outperforms existing methods across multiple domain adaptation benchmarks.

Abstract: Image segmentation is a powerful computer vision technique for scene understanding. However, real-world deployment is stymied by the need for high-quality, meticulously labeled datasets. Synthetic data provides high-quality labels while reducing the need for manual data collection and annotation. However, deep neural networks trained on synthetic data often face the Syn2Real problem, leading to poor performance in real-world deployments. To mitigate the aforementioned gap in image segmentation, we propose RAFT, a novel framework for adapting image segmentation models using minimal labeled real-world data through data and feature augmentations, as well as active learning. To validate RAFT, we perform experiments on the synthetic-to-real “SYNTHIA->Cityscapes” and “GTAV->Cityscapes” benchmarks. We managed to surpass the previous state of the art, HALO. SYNTHIA->Cityscapes experiences an improvement in mIoU* upon domain adaptation of 2.1%/79.9%, and GTAV->Cityscapes experiences a 0.4%/78.2% improvement in mIoU. Furthermore, we test our approach on the real-to-real benchmark of “Cityscapes->ACDC”, and again surpass HALO, with a gain in mIoU upon adaptation of 1.3%/73.2%. Finally, we examine the effect of the allocated annotation budget and various components of RAFT upon the final transfer mIoU.

[198] EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion

Zichuan Yang, Yongzhi Wang

Main category: cs.CV

TL;DR: EVM-Fusion is an explainable Vision Mamba architecture with Neural Algorithmic Fusion that achieves 99.75% accuracy on multi-organ medical image classification while providing interpretable decision-making insights.

Details

Motivation: Medical image classification requires high accuracy, interpretability, and generalizability for clinical decision-making, but existing methods struggle to meet all these demands simultaneously.

Method: Uses a multipath design with DenseNet and U-Net pathways enhanced by Vision Mamba modules, combined with traditional features. Features are integrated through cross-modal attention and iterative Neural Algorithmic Fusion block for adaptive fusion. Includes multiple explainability mechanisms: path-specific spatial attention, Vim delta-value maps, SE-attention, and cross-modal attention weights.

Result: Achieved 99.75% test accuracy on a diverse 9-class multi-organ medical image dataset, demonstrating strong classification performance with multi-faceted interpretability.

Conclusion: EVM-Fusion shows strong potential for trustworthy AI in medical diagnostics by combining high accuracy with intrinsic explainability through its novel fusion mechanism and multiple interpretability features.

Abstract: Medical image classification is critical for clinical decision-making, yet demands for accuracy, interpretability, and generalizability remain challenging. This paper introduces EVM-Fusion, an Explainable Vision Mamba architecture featuring a novel Neural Algorithmic Fusion (NAF) mechanism for multi-organ medical image classification. EVM-Fusion leverages a multipath design, where DenseNet and U-Net based pathways, enhanced by Vision Mamba (Vim) modules, operate in parallel with a traditional feature pathway. These diverse features are dynamically integrated via a two-stage fusion process: cross-modal attention followed by the iterative NAF block, which learns an adaptive fusion algorithm. Intrinsic explainability is embedded through path-specific spatial attention, Vim {\Delta}-value maps, traditional feature SE-attention, and cross-modal attention weights. Experiments on a diverse 9-class multi-organ medical image dataset demonstrate EVM-Fusion’s strong classification performance, achieving 99.75% test accuracy and provide multi-faceted insights into its decision-making process, highlighting its potential for trustworthy AI in medical diagnostics.

[199] MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

Zhihao Zhang, Abhinav Kumar, Girish Chandar Ganesan, Xiaoming Liu

Main category: cs.CV

TL;DR: MonoCoP introduces a Chain-of-Prediction approach for monocular 3D object detection, sequentially predicting 3D attributes with conditioning on previous predictions to improve depth accuracy and overall performance.

Details

Motivation: Existing monocular 3D detection methods overlook the inter-correlation between 3D attributes through 3D-to-2D projection, limiting accuracy and stability. Accurate depth prediction requires conditioning on other 3D attributes.

Method: Proposes MonoCoP with three key designs: 1) Lightweight AttributeNet for each 3D attribute, 2) Explicit chain to propagate learned features between attributes, 3) Residual connections to aggregate features ensuring later predictions condition on all previous attributes.

Result: Achieves state-of-the-art performance on KITTI leaderboard without additional data, and surpasses existing methods on Waymo and nuScenes frontal datasets.

Conclusion: The Chain-of-Prediction approach effectively addresses the inter-correlation of 3D attributes, significantly improving monocular 3D object detection accuracy and stability.

Abstract: Accurately predicting 3D attributes is crucial for monocular 3D object detection (Mono3D), with depth estimation posing the greatest challenge due to the inherent ambiguity in mapping 2D images to 3D space. While existing methods leverage multiple depth cues (e.g., estimating depth uncertainty, modeling depth error) to improve depth accuracy, they overlook that accurate depth prediction requires conditioning on other 3D attributes, as these attributes are intrinsically inter-correlated through the 3D to 2D projection, which ultimately limits overall accuracy and stability. Inspired by Chain-of-Thought (CoT) in large language models (LLMs), this paper proposes MonoCoP, which leverages a Chain-of-Prediction (CoP) to predict attributes sequentially and conditionally via three key designs. First, it employs a lightweight AttributeNet (AN) for each 3D attribute to learn attribute-specific features. Next, MonoCoP constructs an explicit chain to propagate these learned features from one attribute to the next. Finally, MonoCoP uses a residual connection to aggregate features for each attribute along the chain, ensuring that later attribute predictions are conditioned on all previously processed attributes without forgetting the features of earlier ones. Experimental results show that our MonoCoP achieves state-of-the-art (SoTA) performance on the KITTI leaderboard without requiring additional data and further surpasses existing methods on the Waymo and nuScenes frontal datasets.

[200] Generative Data Augmentation for Object Point Cloud Segmentation

Dekai Zhu, Stefan Gavranovic, Flavien Boussuge, Benjamin Busam, Slobodan Ilic

Main category: cs.CV

TL;DR: A generative data augmentation method using part-aware diffusion models to create labeled 3D point clouds for segmentation tasks, outperforming traditional augmentation and semi-supervised approaches.

Details

Motivation: Traditional data augmentation for 3D point clouds is limited to simple geometric transformations, while advanced diffusion models generate realistic shapes but lack semantic labels needed for segmentation training.

Method: Extends the Lion 3D diffusion model to a part-aware generative model that creates point clouds conditioned on segmentation masks, with a 3-step pipeline including generated variants, pseudo-labeling, and diffusion-based filtering.

Result: Outperforms traditional data augmentation and related semi-supervised/self-supervised methods on two large-scale synthetic datasets and a real-world medical dataset.

Conclusion: The proposed generative data augmentation approach effectively bridges the gap between advanced diffusion models and practical data augmentation needs for 3D point cloud segmentation tasks.

Abstract: Data augmentation is widely used to train deep learning models to address data scarcity. However, traditional data augmentation (TDA) typically relies on simple geometric transformation, such as random rotation and rescaling, resulting in minimal data diversity enrichment and limited model performance improvement. State-of-the-art generative models for 3D shape generation rely on the denoising diffusion probabilistic models and manage to generate realistic novel point clouds for 3D content creation and manipulation. Nevertheless, the generated 3D shapes lack associated point-wise semantic labels, restricting their usage in enlarging the training data for point cloud segmentation tasks. To bridge the gap between data augmentation techniques and the advanced diffusion models, we extend the state-of-the-art 3D diffusion model, Lion, to a part-aware generative model that can generate high-quality point clouds conditioned on given segmentation masks. Leveraging the novel generative model, we introduce a 3-step generative data augmentation (GDA) pipeline for point cloud segmentation training. Our GDA approach requires only a small amount of labeled samples but enriches the training data with generated variants and pseudo-labeled samples, which are validated by a novel diffusion-based pseudo-label filtering method. Extensive experiments on two large-scale synthetic datasets and a real-world medical dataset demonstrate that our GDA method outperforms TDA approach and related semi-supervised and self-supervised methods.

[201] WetCat: Enabling Automated Skill Assessment in Wet-Lab Cataract Surgery Videos

Negin Ghamsarian, Raphael Sznitman, Klaus Schoeffmann, Jens Kowal

Main category: cs.CV

TL;DR: WetCat is the first dataset of wetlab cataract surgery videos for automated skill assessment, featuring phase annotations and semantic segmentations to enable AI-driven evaluation tools for surgical training.

Details

Motivation: Traditional wetlab training relies on manual performance evaluations that are labor-intensive, time-consuming, and subjective. There's a need for automated, objective skill assessment in controlled wetlab settings to improve surgical education efficiency.

Method: Created WetCat dataset with high-resolution recordings of cataract surgeries performed by trainees on artificial eyes. Includes comprehensive phase annotations and semantic segmentations of key anatomical structures, focusing on capsulorhexis and phacoemulsification phases aligned with standardized surgical skill assessment frameworks.

Result: Developed a publicly available dataset that enables the creation of interpretable, AI-driven evaluation tools for surgical skill assessment. The dataset provides a foundation for objective and scalable surgical education in ophthalmology.

Conclusion: WetCat sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training, addressing the limitations of existing datasets and supporting the advancement of objective surgical education through computer vision technologies.

Abstract: To meet the growing demand for systematic surgical training, wetlab environments have become indispensable platforms for hands-on practice in ophthalmology. Yet, traditional wetlab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability. Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education. Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wetlab settings. To address these limitations, we introduce WetCat, the first dataset of wetlab cataract surgery videos specifically curated for automated skill assessment. WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures. These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks. By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics. This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training. The dataset and annotations are publicly available in Synapse https://www.synapse.org/Synapse:syn66401174/files.

[202] Solar Altitude Guided Scene Illumination

Samed Doğan, Maximilian Hoh, Nico Leuze, Nicolas Rodriguez Peña, Alfred Schöttl

Main category: cs.CV

TL;DR: Using solar altitude as a conditioning variable for synthetic camera data generation to address daytime variation without manual labeling.

Details

Motivation: Real-world autonomous driving data acquisition is costly, labor-intensive, and limited by safety and coverage constraints, with a research gap in daytime variation due to label scarcity.

Method: Proposes solar altitude computed from latitude-longitude coordinates and local time as global conditioning variable, with tailored normalization to handle daylight sensitivity to small numeric changes.

Result: Demonstrates accurate capture of lighting characteristics and illumination-dependent image noise in diffusion models.

Conclusion: Solar altitude provides an effective, readily computable conditioning approach for synthetic camera data generation that eliminates manual labeling needs for daytime variation modeling.

Abstract: The development of safe and robust autonomous driving functions is heavily dependent on large-scale, high-quality sensor data. However, real-world data acquisition requires extensive human labor and is strongly limited by factors such as labeling cost, driver safety protocols and scenario coverage. Thus, multiple lines of work focus on the conditional generation of synthetic camera sensor data. We identify a significant gap in research regarding daytime variation, presumably caused by the scarcity of available labels. Consequently, we present solar altitude as global conditioning variable. It is readily computable from latitude-longitude coordinates and local time, eliminating the need for manual labeling. Our work is complemented by a tailored normalization approach, targeting the sensitivity of daylight towards small numeric changes in altitude. We demonstrate its ability to accurately capture lighting characteristics and illumination-dependent image noise in the context of diffusion models.

[203] Egocentric Human-Object Interaction Detection: A New Benchmark and Method

Kunyuan Deng, Yi Wang, Lap-Pui Chau

Main category: cs.CV

TL;DR: This paper introduces Ego-HOIBench, a new dataset for egocentric human-object interaction detection, and proposes HGIR, a lightweight method that uses hand geometry and pose cues to improve interaction detection under severe occlusion.

Details

Motivation: Progress in egocentric human-object interaction detection has been hindered by the lack of benchmarks and methods specifically designed to handle egocentric challenges like severe hand-object occlusion.

Method: Proposed Hand Geometry and Interactivity Refinement (HGIR) - a plug-and-play scheme that extracts global hand geometric features from hand pose proposals and refines interaction features through pose-interaction attention to focus on subtle hand-object relationships under occlusion.

Result: HGIR significantly improves Ego-HOI detection performance across multiple baselines, achieving new state-of-the-art results on the Ego-HOIBench dataset.

Conclusion: The introduced Ego-HOIBench dataset and HGIR method establish a solid foundation for future research in egocentric vision and human-object interaction understanding.

Abstract: Egocentric human-object interaction (Ego-HOI) detection is crucial for intelligent agents to understand and assist human activities from a first-person perspective. However, progress has been hindered by the lack of benchmarks and methods tailored to egocentric challenges such as severe hand-object occlusion. In this paper, we introduce the real-world Ego-HOI detection task and the accompanying Ego-HOIBench, a new dataset with over 27K egocentric images and explicit, fine-grained hand-verb-object triplet annotations across 123 categories. Ego-HOIBench covers diverse daily scenarios, object types, and both single- and two-hand interactions, offering a comprehensive testbed for Ego-HOI research. Benchmarking existing third-person HOI detectors on Ego-HOIBench reveals significant performance gaps, highlighting the need for egocentric-specific solutions. To this end, we propose Hand Geometry and Interactivity Refinement (HGIR), a lightweight, plug-and-play scheme that leverages hand pose and geometric cues to enhance interaction representations. Specifically, HGIR explicitly extracts global hand geometric features from the estimated hand pose proposals, and further refines interaction features through pose-interaction attention, enabling the model to focus on subtle hand-object relationship differences even under severe occlusion. HGIR significantly improves Ego-HOI detection performance across multiple baselines, achieving new state-of-the-art results on Ego-HOIBench. Our dataset and method establish a solid foundation for future research in egocentric vision and human-object interaction understanding. Project page: https://dengkunyuan.github.io/EgoHOIBench/

[204] Demographic-aware fine-grained classification of pediatric wrist fractures

Ammar Ahmed, Ali Shariq Imran, Zenun Kastrati, Sher Muhammad Daudpota

Main category: cs.CV

TL;DR: This paper presents a multi-faceted approach for wrist pathology recognition using limited datasets, combining fine-grained recognition, metadata fusion with X-rays, and transfer learning from other fine-grained datasets.

Details

Motivation: Wrist pathologies are common, especially in children, but medical imaging datasets are limited. Relying solely on image modality is inadequate given the availability of diverse data types, necessitating a multi-modal approach for accurate recognition.

Method: 1) Treat wrist pathology recognition as a fine-grained recognition task 2) Fuse patient metadata with X-ray images to enhance network performance 3) Utilize transfer learning with weights pre-trained on separate fine-grained datasets

Result: The approach demonstrates improved performance for wrist pathology recognition despite extremely limited dataset availability. The integration of metadata with medical images shows effectiveness specifically for wrist pathologies.

Conclusion: A multi-modal approach combining fine-grained recognition, metadata fusion, and transfer learning can effectively address the challenge of limited medical imaging datasets for wrist pathology recognition, with metadata integration being a novel application in this specific medical domain.

Abstract: Wrist pathologies are frequently observed, particularly among children who constitute the majority of fracture cases. Computer vision presents a promising avenue, contingent upon the availability of extensive datasets, a notable challenge in medical imaging. Therefore, reliance solely on one modality, such as images, proves inadequate, especially in an era of diverse and plentiful data types. In this study, we employ a multifaceted approach to address the challenge of recognizing wrist pathologies using an extremely limited dataset. Initially, we approach the problem as a fine-grained recognition task. Secondly, we enhance network performance by fusing patient metadata with X-rays. Thirdly, we improve the performance further by utilizing weights trained on a separate fine-grained dataset. While metadata integration has been used in other medical domains, this is a novel application for wrist pathologies.

[205] Prompt-based Dynamic Token Pruning for Efficient Segmentation of Medical Images

Pallabi Dutta, Anubhab Maity, Sushmita Mitra

Main category: cs.CV

TL;DR: PrATo method uses prompt-driven token pruning to reduce Vision Transformer computational costs by 35-55% while maintaining medical image segmentation accuracy.

Details

Motivation: Vision Transformers have high computational demands due to processing many tokens, limiting their practical application in medical image analysis, especially in resource-constrained environments.

Method: Prompt-driven Adaptive Token (PrATo) pruning method that uses prompt-based spatial prior to rank tokens by relevance, down-weighting low-relevance tokens and propagating only relevant ones through subsequent stages.

Result: Achieves 35-55% token reduction, significantly lowering computational costs while preserving segmentation accuracy. Improves both segmentation accuracy and inference speed.

Conclusion: The framework enables cost-effective medical image processing and facilitates real-time diagnosis by expanding applicability in resource-constrained environments through selective token processing.

Abstract: The high computational demands of Vision Transformers (ViTs) in processing a large number of tokens often constrain their practical application in analyzing medical images. This research proposes a Prompt-driven Adaptive Token ({\it PrATo}) pruning method to selectively reduce the processing of irrelevant tokens in the segmentation pipeline. The prompt-based spatial prior helps to rank the tokens according to their relevance. Tokens with low-relevance scores are down-weighted, ensuring that only the relevant ones are propagated for processing across subsequent stages. This data-driven pruning strategy improves segmentation accuracy and inference speed by allocating computational resources to essential regions. The proposed framework is integrated with several state-of-the-art models to facilitate the elimination of irrelevant tokens, thereby enhancing computational efficiency while preserving segmentation accuracy. The experimental results show a reduction of $\sim$ 35-55% tokens; thus reducing the computational costs relative to baselines. Cost-effective medical image processing, using our framework, facilitates real-time diagnosis by expanding its applicability in resource-constrained environments.

Xiangyu Dong, Haoran Zhao, Jiang Gao, Haozhou Li, Xiaoguang Ma, Yaoming Zhou, Fuhai Chen, Juan Liu

Main category: cs.CV

TL;DR: SE-VLN is a self-evolving vision-language navigation framework that uses hierarchical memory, retrieval-augmented reasoning, and reflection modules to enable continuous learning and improvement during testing, achieving significant performance gains over state-of-the-art methods.

Details

Motivation: Current VLN methods using LLMs are constrained by fixed knowledge bases and lack experiential learning capabilities, preventing efficient evolution and adaptation to new environments.

Method: Proposes a multimodal LLM-powered framework with three core modules: hierarchical memory for knowledge transfer, retrieval-augmented thought-based reasoning for multi-step decision-making, and reflection for continual evolution.

Result: Achieved 57% and 35.2% navigation success rates in unseen environments, representing 23.9% and 15.0% absolute improvements over SOTA methods on R2R and REVERSE datasets. Performance improves with increasing experience.

Conclusion: SE-VLN demonstrates significant potential as a self-evolving agent framework for VLN, showing that continuous learning during testing leads to substantial performance improvements and better generalization.

Abstract: Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.

[207] OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding

Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

Main category: cs.CV

TL;DR: OpenEvents V1 is a large-scale benchmark dataset for event-centric vision-language understanding with 200K+ news articles and 400K+ images, focusing on event-aware captioning and cross-modal retrieval tasks.

Details

Motivation: To advance beyond surface-level image descriptions and enable deep reasoning about complex real-world events through contextual and temporal grounding in multimodal AI systems.

Method: Created a dataset from CNN and The Guardian sources spanning diverse domains and time periods, with three main tasks: event-aware image caption generation, image-to-news retrieval, and text-to-image retrieval using narrative queries.

Result: Provides over 200,000 news articles and 400,000 associated images with extensive baseline results and standardized evaluation protocols for all three tasks.

Conclusion: OpenEvents V1 establishes a robust foundation for developing multimodal AI systems capable of deep event reasoning and is publicly available for research use.

Abstract: We introduce OpenEvents V1a large-scale benchmark dataset designed to advance event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that focus on surface-level descriptions, OpenEvents V1 dataset emphasizes contextual and temporal grounding through three primary tasks: (1) generating rich, event-aware image captions, (2) retrieving event-relevant news articles from image queries, and (3) retrieving event-relevant images from narrative-style textual queries. The dataset comprises over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for all tasks. OpenEvents V1 establishes a robust foundation for developing multimodal AI systems capable of deep reasoning over complex real-world events. The dataset is publicly available at https://ltnghia.github.io/eventa/openevents-v1.

[208] Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach

Panpan Ji, Junni Song, Yifan Lu, Hang Xiao, Hanyu Liu, Chao Li

Main category: cs.CV

TL;DR: Proposes DCDP-HAR framework with dual-path feature extraction, multi-stage contrastive learning, and confidence-driven gradient modulation to address cross-modal alignment and imbalanced contributions in multimodal human activity recognition.

Details

Motivation: Multimodal HAR systems face challenges with cross-modal feature alignment difficulties and imbalanced modality contributions, which hinder effective perception and interaction in intelligent systems.

Method: DCDP-HAR framework with: 1) ResNet and DenseNet dual-path feature extraction, 2) multi-stage contrastive learning for progressive alignment, 3) confidence-driven gradient modulation to balance modality learning, and 4) momentum-based gradient accumulation for stability.

Result: Ablation studies validate component effectiveness, and extensive comparative experiments conducted on four public benchmark datasets demonstrate the framework’s performance.

Conclusion: The proposed DCDP-HAR framework effectively addresses key challenges in multimodal HAR through its innovative dual-path architecture, contrastive learning mechanism, and dynamic gradient modulation strategy.

Abstract: Sensor-based Human Activity Recognition (HAR) is a core technology that enables intelligent systems to perceive and interact with their environment. However, multimodal HAR systems still encounter key challenges, such as difficulties in cross-modal feature alignment and imbalanced modality contributions. To address these issues, we propose a novel framework called the Dynamic Contrastive Dual-Path Network (DCDP-HAR). The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseNet branches collaboratively process multimodal sensor data. Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction. Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation, effectively alleviating modality competition. In addition, a momentum-based gradient accumulation strategy is adopted to enhance training stability. We conduct ablation studies to validate the effectiveness of each component and perform extensive comparative experiments on four public benchmark datasets.

[209] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin

Main category: cs.CV

TL;DR: DreamVLA is a novel vision-language-action framework that integrates comprehensive world knowledge forecasting with dynamic-region-guided prediction, spatial and semantic cues, and a block-wise attention mechanism to improve robot manipulation performance.

Details

Motivation: Existing VLA models are limited to image-based forecasting which suffers from redundant information and lacks comprehensive world knowledge including dynamic, spatial and semantic information needed for effective robot manipulation.

Method: Proposes DreamVLA with dynamic-region-guided world knowledge prediction integrated with spatial/semantic cues, block-wise structured attention to prevent information leakage, and diffusion-based transformer for action distribution modeling.

Result: Achieves 76.7% success rate on real robot tasks and 4.44 average length on CALVIN ABC-D benchmarks, demonstrating superior performance in both real-world and simulation environments.

Conclusion: DreamVLA effectively addresses limitations of existing VLA methods by providing comprehensive world knowledge forecasting and establishing a perception-prediction-action loop, enabling more effective robot manipulation through abstract multimodal reasoning.

Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

[210] Online Micro-gesture Recognition Using Data Augmentation and Spatial-Temporal Attention

Pengyu Liu, Kun Li, Fei Wang, Yanyan Wei, Junhui She, Dan Guo

Main category: cs.CV

TL;DR: HFUT-VUT team’s winning solution for IJCAI 2025 MiGA Challenge micro-gesture recognition, achieving 38.03 F1 score with 37.9% improvement over previous state-of-the-art.

Details

Motivation: Micro-gesture online recognition is challenging due to the need to precisely locate temporal positions and recognize categories of spontaneous human actions in untrimmed videos, with greater differences than traditional human actions.

Method: Proposed hand-crafted data augmentation and spatial-temporal attention mechanisms to enhance model’s ability to classify and localize micro-gestures more accurately.

Result: Achieved F1 score of 38.03, outperforming previous state-of-the-art by 37.9%, ranking first in the Micro-gesture Online Recognition track.

Conclusion: The proposed approach with data augmentation and spatial-temporal attention effectively addresses the challenges of micro-gesture recognition, demonstrating superior performance in both classification and temporal localization tasks.

Abstract: In this paper, we introduce the latest solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track of the IJCAI 2025 MiGA Challenge. The Micro-gesture Online Recognition task is a highly challenging problem that aims to locate the temporal positions and recognize the categories of multiple micro-gesture instances in untrimmed videos. Compared to traditional temporal action detection, this task places greater emphasis on distinguishing between micro-gesture categories and precisely identifying the start and end times of each instance. Moreover, micro-gestures are typically spontaneous human actions, with greater differences than those found in other human actions. To address these challenges, we propose hand-crafted data augmentation and spatial-temporal attention to enhance the model’s ability to classify and localize micro-gestures more accurately. Our solution achieved an F1 score of 38.03, outperforming the previous state-of-the-art by 37.9%. As a result, our method ranked first in the Micro-gesture Online Recognition track.

[211] Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation

Hyebin Cho, Jaehyup Lee

Main category: cs.CV

TL;DR: FaceMat is a trimap-free, uncertainty-aware framework for face matting that handles occlusions by predicting alpha mattes to separate facial regions from occluding objects like hands or accessories, using a two-stage teacher-student training approach with knowledge distillation.

Details

Motivation: Face filters degrade in performance when faces are occluded by objects like hands, hair, or accessories. Current methods struggle with complex occlusions and often require auxiliary inputs like trimaps or segmentation masks, limiting their real-time applicability.

Method: Two-stage training pipeline: a teacher model predicts alpha mattes and per-pixel uncertainty using negative log-likelihood loss, then guides a student model through spatially adaptive knowledge distillation. Explicitly treats skin as foreground and occlusions as background. Uses newly constructed CelebAMat synthetic dataset.

Result: FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing visual quality and robustness of face filters in real-world unconstrained video scenarios without requiring auxiliary inputs.

Conclusion: The framework provides high-quality alpha mattes for occlusion handling in face filters, enables real-time applications without trimaps, and improves generalization through uncertainty-aware training and semantic consistency preservation.

Abstract: Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git

[212] DriveIndia: An Object Detection Dataset for Diverse Indian Traffic Scenes

Rishav Kumar, D. Santhosh Reddy, P. Rajalakshmi

Main category: cs.CV

TL;DR: DriveIndia is a large-scale object detection dataset for Indian traffic with 67K images across 24 categories, featuring diverse challenging conditions and achieving 78.7% mAP50 with YOLO models.

Details

Motivation: To address the complexity and unpredictability of Indian traffic environments which are not adequately captured by existing datasets, providing a benchmark for autonomous driving research in diverse and challenging road conditions.

Method: Collected 66,986 high-resolution images over 120+ hours covering 3,400+ kilometers across urban, rural, and highway routes in India. Images were annotated in YOLO format across 24 traffic-relevant object categories, capturing varied weather, illumination changes, heterogeneous infrastructure, and dense mixed traffic patterns.

Result: Baseline results using state-of-the-art YOLO family models show the top-performing variant achieving a mAP50 of 78.7%, demonstrating the dataset’s utility for benchmarking object detection performance in challenging Indian traffic conditions.

Conclusion: DriveIndia provides a comprehensive benchmark for real-world autonomous driving challenges in Indian traffic environments and will be publicly available to support research in robust, generalizable object detection under uncertain road conditions.

Abstract: We introduce DriveIndia, a large-scale object detection dataset purpose-built to capture the complexity and unpredictability of Indian traffic environments. The dataset contains 66,986 high-resolution images annotated in YOLO format across 24 traffic-relevant object categories, encompassing diverse conditions such as varied weather (fog, rain), illumination changes, heterogeneous road infrastructure, and dense, mixed traffic patterns and collected over 120+ hours and covering 3,400+ kilometers across urban, rural, and highway routes. DriveIndia offers a comprehensive benchmark for real-world autonomous driving challenges. We provide baseline results using state-of-the-art YOLO family models, with the top-performing variant achieving a mAP50 of 78.7%. Designed to support research in robust, generalizable object detection under uncertain road conditions, DriveIndia will be publicly available via the TiHAN-IIT Hyderabad dataset repository https://tihan.iith.ac.in/TiAND.html (Terrestrial Datasets -> Camera Dataset).

[213] MergeSAM: Unsupervised change detection of remote sensing images based on the Segment Anything Model

Meiqi Hu, Lingzhi Lu, Chengxi Han, Xiaoping Liu

Main category: cs.CV

TL;DR: MergeSAM: An unsupervised change detection method using Segment Anything Model (SAM) with MaskMatching and MaskSplitting strategies to handle complex object changes in remote sensing imagery.

Details

Motivation: Leverage large foundation models' exceptional feature extraction capabilities to accelerate unsupervised change detection and enhance practical applicability of remote sensing change detection technologies.

Method: Uses SAM’s object segmentation to construct multitemporal masks. Introduces two novel strategies: MaskMatching for object correspondence and MaskSplitting for handling object fragmentation and merging in complex change scenarios.

Result: The method captures complex spatial changes by embedding land cover spatial structure into change detection process, addressing real-world complexities like object splitting and merging.

Conclusion: MergeSAM demonstrates how large foundation models like SAM can be effectively adapted for unsupervised change detection in high-resolution remote sensing imagery, providing robust handling of intricate change patterns.

Abstract: Recently, large foundation models trained on vast datasets have demonstrated exceptional capabilities in feature extraction and general feature representation. The ongoing advancements in deep learning-driven large models have shown great promise in accelerating unsupervised change detection methods, thereby enhancing the practical applicability of change detection technologies. Building on this progress, this paper introduces MergeSAM, an innovative unsupervised change detection method for high-resolution remote sensing imagery, based on the Segment Anything Model (SAM). Two novel strategies, MaskMatching and MaskSplitting, are designed to address real-world complexities such as object splitting, merging, and other intricate changes. The proposed method fully leverages SAM’s object segmentation capabilities to construct multitemporal masks that capture complex changes, embedding the spatial structure of land cover into the change detection process.

[214] Gaussian Splatting Feature Fields for Privacy-Preserving Visual Localization

Maxime Pietrantoni, Gabriela Csurka, Torsten Sattler

Main category: cs.CV

TL;DR: This paper introduces Gaussian Splatting Feature Fields (GSFFs) for visual localization, combining 3D Gaussian Splatting with implicit feature fields to achieve accurate and privacy-preserving camera pose estimation.

Details

Motivation: The authors aim to develop a visual localization method that provides both high accuracy and privacy preservation by leveraging 3DGS-based representations and feature learning in a common embedding space.

Method: Proposes GSFFs that combine explicit 3DGS geometry with implicit feature fields, uses contrastive learning to align 3D scale-aware features with 2D encoder features, employs 3D structure-informed clustering for regularization, and implements pose refinement through feature/segmentation alignment.

Result: The method achieves state-of-the-art performance on multiple real-world datasets for both privacy-preserving and non-privacy-preserving visual localization pipelines.

Conclusion: GSFFs provide an effective framework for accurate visual localization with privacy preservation capabilities, demonstrating superior performance through the integration of 3D geometric information and learned feature representations.

Abstract: Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Pose refinement, which involves aligning either feature maps or segmentations from a query image with those rendered from the GSFFs scene representation, is used to achieve localization. The resulting privacy- and non-privacy-preserving localization pipelines, evaluated on multiple real-world datasets, show state-of-the-art performances.

[215] Is Uncertainty Quantification a Viable Alternative to Learned Deferral?

Anna M. Wundram, Christian F. Baumgartner

Main category: cs.CV

TL;DR: Uncertainty quantification methods outperform learned deferral models for AI-human collaboration in glaucoma diagnosis, showing better robustness to out-of-distribution data.

Details

Motivation: AI models need human collaboration for safety, but current learned deferral methods may not handle data shifts well during clinical translation. Uncertainty quantification could provide more robust deferral strategies.

Method: Extensive evaluation study on large ophthalmology dataset comparing learned deferral models (trained with surrogate loss) vs established uncertainty quantification methods, testing both in-distribution and out-of-distribution performance for glaucoma classification from fundus images.

Result: Uncertainty quantification methods demonstrate better performance than learned deferral models, particularly showing superior robustness to out-of-distribution inputs while maintaining accurate classification and appropriate deferral of error-prone cases.

Conclusion: Uncertainty quantification methods are a promising alternative to learned deferral models for AI-human collaboration in medical applications, offering better resilience to data distribution shifts.

Abstract: Artificial Intelligence (AI) holds the potential to dramatically improve patient care. However, it is not infallible, necessitating human-AI-collaboration to ensure safe implementation. One aspect of AI safety is the models’ ability to defer decisions to a human expert when they are likely to misclassify autonomously. Recent research has focused on methods that learn to defer by optimising a surrogate loss function that finds the optimal trade-off between predicting a class label or deferring. However, during clinical translation, models often face challenges such as data shift. Uncertainty quantification methods aim to estimate a model’s confidence in its predictions. However, they may also be used as a deferral strategy which does not rely on learning from specific training distribution. We hypothesise that models developed to quantify uncertainty are more robust to out-of-distribution (OOD) input than learned deferral models that have been trained in a supervised fashion. To investigate this hypothesis, we constructed an extensive evaluation study on a large ophthalmology dataset, examining both learned deferral models and established uncertainty quantification methods, assessing their performance in- and out-of-distribution. Specifically, we evaluate their ability to accurately classify glaucoma from fundus images while deferring cases with a high likelihood of error. We find that uncertainty quantification methods may be a promising choice for AI deferral.

[216] Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Shaoguang Wang, Ziyang Chen, Yijie Xu, Weiyu Guo, Hui Xiong

Main category: cs.CV

TL;DR: AFP method reduces video frame redundancy by 86.9% and token usage by 83.2% while improving accuracy through adaptive clustering and semantic graphs.

Details

Motivation: High token costs and performance degradation from excessive frames in Video-QA applications, with existing keyframe methods still having temporal redundancy ('visual echoes').

Method: Adaptive Frame-Pruning (AFP) uses hierarchical clustering on fused ResNet-50 and CLIP features to merge redundant frames, complemented by a lightweight text-based semantic graph for context.

Result: 86.9% reduction in frames and 83.2% reduction in tokens on LongVideoBench and VideoMME benchmarks, with improved accuracy over baselines using more frames.

Conclusion: AFP provides an efficient and effective solution for Video-QA by intelligently pruning frames while maintaining or enhancing performance through semantic context preservation.

Abstract: The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a “less is more” phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term ‘visual echoes’. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.

[217] Fast Motion Estimation and Context-Aware Refinement for Efficient Bayer-Domain Video Vision

Haichao Wang, Xinyue Xi, Jiangtao Wen, Yuxing Han

Main category: cs.CV

TL;DR: Proposes an efficient video computer vision system that removes image signal processor and uses Bayer-format data directly, combined with fast block matching motion estimation and refinement networks to reduce temporal redundancy and front-end computation overhead.

Details

Motivation: Existing video computer vision systems suffer from high temporal redundancy and neglect front-end computation overhead, leading to inefficiency in processing video data.

Method: 1) Remove image signal processor and feed Bayer-format data directly to models; 2) Use fast block matching-based motion estimation with MV refinement; 3) Introduce context-aware block refinement network for error correction; 4) Employ frame selection strategy for accuracy-efficiency balance.

Result: Experiments on multiple video computer vision tasks show significant acceleration with only slight performance loss compared to existing methods.

Conclusion: The proposed system effectively reduces computational overhead while maintaining good performance, making video computer vision more efficient by addressing both temporal redundancy and front-end processing bottlenecks.

Abstract: The efficiency of video computer vision system remains a challenging task due to the high temporal redundancy inside a video. Existing works have been proposed for efficient vision computer vision. However, they do not fully reduce the temporal redundancy and neglect the front end computation overhead. In this paper, we propose an efficient video computer vision system. First, image signal processor is removed and Bayer-format data is directly fed into video computer vision models, thus saving the front end computation. Second, instead of optical flow models and video codecs, a fast block matching-based motion estimation algorithm is proposed specifically for efficient video computer vision, with a MV refinement module. To correct the error, context-aware block refinement network is introduced to refine regions with large error. To further balance the accuracy and efficiency, a frame selection strategy is employed. Experiments on multiple video computer vision tasks demonstrate that our method achieves significant acceleration with slight performance loss.

[218] MultiRef: Controllable Image Generation with Multiple Visual References

Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna

Main category: cs.CV

TL;DR: The paper introduces MultiRef-bench, a comprehensive evaluation framework for multi-reference image generation, showing current state-of-the-art models struggle with incorporating multiple visual references effectively.

Details

Motivation: Current image generation frameworks primarily use single-source inputs (text or single reference image), but visual designers naturally draw inspiration from multiple visual references to create artwork, creating a gap in existing capabilities.

Method: Developed MultiRef-bench with 990 synthetic and 1,000 real-world samples requiring multiple reference integration. Created RefBlend data engine to generate synthetic samples with 10 reference types and 33 combinations. Built MultiRef dataset with 38k high-quality images. Evaluated three interleaved image-text models (OmniGen, ACE, Show-o) and six agentic frameworks.

Result: State-of-the-art systems struggle with multi-reference conditioning. Best model (OmniGen) achieved only 66.6% on synthetic samples and 79.0% on real-world cases compared to golden answers, indicating significant room for improvement.

Conclusion: The findings highlight the limitations of current image generation systems in handling multiple visual references and provide valuable directions for developing more flexible, human-like creative tools that can effectively integrate multiple sources of visual inspiration.

Abstract: Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs – either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

[219] Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, Guangrun Wang

Main category: cs.CV

TL;DR: PAR is a physical autoregressive model that combines video frames and actions as tokens, leveraging pretrained video models for robotic manipulation without action pretraining, achieving 100% success on PushCube task.

Details

Motivation: The scarcity of manipulation data motivates using pretrained large models from other modalities like video generation to understand physical dynamics in robotics.

Method: Proposes Physical Autoregressive Model (PAR) using physical tokens combining frames and actions, DiT-based de-tokenizer, causal mask with inverse kinematics, parallel training, and KV-cache mechanism.

Result: Achieves 100% success rate on PushCube task in ManiSkill benchmark, matches action-pretrained baselines on other tasks, and accurately predicts future videos with aligned action trajectories.

Conclusion: Demonstrates promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining to physical tasks.

Abstract: The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining. The project page is here: https://hcplab-sysu.github.io/PhysicalAutoregressiveModel/

[220] VFM-Guided Semi-Supervised Detection Transformer under Source-Free Constraints for Remote Sensing Object Detection

Jianhong Han, Yupei Wang, Liang Chen

Main category: cs.CV

TL;DR: VG-DETR is a source-free object detection framework for remote sensing that integrates vision foundation models to improve pseudo-label quality and feature alignment, addressing noisy labels in dense object scenarios.

Details

Motivation: Privacy and transmission constraints in remote sensing often prevent access to source domain data, limiting practical applicability of domain adaptation methods. Source-Free Object Detection (SFOD) suffers from training collapse due to noisy pseudo-labels in dense remote sensing imagery with complex backgrounds.

Method: Proposes VG-DETR, built on semi-supervised framework, integrating Vision Foundation Model (VFM) with small labeled target data. Includes VFM-guided pseudo-label mining strategy using semantic priors to assess reliability, and dual-level VFM-guided alignment at instance and image levels through contrastive learning and similarity matching.

Result: Extensive experiments demonstrate superior performance in source-free remote sensing detection tasks, showing improved pseudo-label quality and quantity, and enhanced feature representation robustness against domain gaps.

Conclusion: VG-DETR effectively addresses SFOD challenges in remote sensing by leveraging VFMs to mitigate pseudo-label noise and improve feature extraction, achieving state-of-the-art performance without source data access.

Abstract: Unsupervised domain adaptation methods have been widely explored to bridge domain gaps. However, in real-world remote-sensing scenarios, privacy and transmission constraints often preclude access to source domain data, which limits their practical applicability. Recently, Source-Free Object Detection (SFOD) has emerged as a promising alternative, aiming at cross-domain adaptation without relying on source data, primarily through a self-training paradigm. Despite its potential, SFOD frequently suffers from training collapse caused by noisy pseudo-labels, especially in remote sensing imagery with dense objects and complex backgrounds. Considering that limited target domain annotations are often feasible in practice, we propose a Vision foundation-Guided DEtection TRansformer (VG-DETR), built upon a semi-supervised framework for SFOD in remote sensing images. VG-DETR integrates a Vision Foundation Model (VFM) into the training pipeline in a “free lunch” manner, leveraging a small amount of labeled target data to mitigate pseudo-label noise while improving the detector’s feature-extraction capability. Specifically, we introduce a VFM-guided pseudo-label mining strategy that leverages the VFM’s semantic priors to further assess the reliability of the generated pseudo-labels. By recovering potentially correct predictions from low-confidence outputs, our strategy improves pseudo-label quality and quantity. In addition, a dual-level VFM-guided alignment method is proposed, which aligns detector features with VFM embeddings at both the instance and image levels. Through contrastive learning among fine-grained prototypes and similarity matching between feature maps, this dual-level alignment further enhances the robustness of feature representations against domain gaps. Extensive experiments demonstrate that VG-DETR achieves superior performance in source-free remote sensing detection tasks.

[221] MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

Qian Liang, Yujia Wu, Kuncheng Li, Jiwei Wei, Shiyuan He, Jinyu Guo, Ning Xie

Main category: cs.CV

TL;DR: MM-R1 is a framework that enables unified Multimodal Large Language Models to perform personalized image generation through cross-modal reasoning, eliminating the need for subject-specific fine-tuning.

Details

Motivation: Existing MLLM methods for personalized image generation require data-intensive fine-tuning for each new subject, limiting scalability. There's a need to unlock the inherent potential of unified MLLMs without subject-specific training.

Method: Uses cross-modal Chain-of-Thought (X-CoT) reasoning strategy to structure personalization as integrated visual reasoning and generation. Employs Grouped Reward Proximal Policy Optimization (GRPO) to align generation. The process involves grounding subject concepts from user images/context and generating personalized images based on extracted representations and prompts.

Result: MM-R1 enables unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner, demonstrating effective personalization capability without subject-specific fine-tuning.

Conclusion: The framework successfully unlocks the personalization potential of unified MLLMs through cross-modal reasoning, providing a scalable solution for personalized image generation that avoids the limitations of subject-specific training approaches.

Abstract: Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To further enhance the reasoning capability, we adopt Grouped Reward Proximal Policy Optimization (GRPO) to explicitly align the generation. Experiments demonstrate that MM-R1 unleashes the personalization capability of unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner.

[222] Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes

Ryan Faulkner, Luke Haub, Simon Ratcliffe, Tat-Jun Chin

Main category: cs.CV

TL;DR: Novel open-set segmentation approach for outdoor LiDAR point clouds using Mamba architecture and reconstruction-based methods, improving performance on existing methods and showing competitive results with voxel-convolution approaches.

Details

Motivation: Outdoor LiDAR scanning produces large-scale point clouds for applications like robotics and autonomous vehicles, where outlier objects from outside training data inevitably appear, requiring robust open-set segmentation methods.

Method: Combines object defect-detection research learnings with Mamba architecture’s strong performance in long-range dependencies and scalability to create a reconstruction-based approach for outdoor scene open-set segmentation.

Result: The approach improves performance when applied to both their own open-set segmentation method and existing methods, and contributes a Mamba-based architecture competitive with voxel-convolution methods on challenging large-scale point clouds.

Conclusion: The research successfully demonstrates that combining reconstruction-based approaches with Mamba architecture provides effective open-set segmentation for outdoor LiDAR data, outperforming existing methods and offering scalable solutions for large point clouds.

Abstract: LiDAR scanning in outdoor scenes acquires accurate distance measurements over wide areas, producing large-scale point clouds. Application examples for this data include robotics, automotive vehicles, and land surveillance. During such applications, outlier objects from outside the training data will inevitably appear. Our research contributes a novel approach to open-set segmentation, leveraging the learnings of object defect-detection research. We also draw on the Mamba architecture’s strong performance in utilising long-range dependencies and scalability to large data. Combining both, we create a reconstruction based approach for the task of outdoor scene open-set segmentation. We show that our approach improves performance not only when applied to our our own open-set segmentation method, but also when applied to existing methods. Furthermore we contribute a Mamba based architecture which is competitive with existing voxel-convolution based methods on challenging, large-scale pointclouds.

[223] Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting

Kangjie Chen, Yingji Zhong, Zhihao Li, Jiaqi Lin, Youyu Chen, Minghan Qin, Haoqian Wang

Main category: cs.CV

TL;DR: 3D Gaussian Splatting suffers from appearance artifacts in sparse-view scenarios due to Gaussian co-adaptation. Proposed random dropout and noise injection methods effectively mitigate this issue.

Details

Motivation: 3DGS shows impressive performance in dense-view novel view synthesis but exhibits appearance artifacts in sparse-view scenarios, requiring investigation into the underlying causes and solutions.

Method: Proposed Co-Adaptation Score metric to quantify Gaussian entanglement, and introduced two lightweight strategies: random Gaussian dropout and multiplicative noise injection to opacity.

Result: Analysis revealed co-adaptation naturally decreases with more training views. Both proposed strategies effectively mitigate co-adaptation and appearance artifacts across various methods and benchmarks.

Conclusion: The co-adaptation effect is a core limitation in sparse-view 3DGS, and the proposed plug-and-play strategies provide effective solutions while offering insights for future research in this area.

Abstract: 3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis under dense-view settings. However, in sparse-view scenarios, despite the realistic renderings in training views, 3DGS occasionally manifests appearance artifacts in novel views. This paper investigates the appearance artifacts in sparse-view 3DGS and uncovers a core limitation of current approaches: the optimized Gaussians are overly-entangled with one another to aggressively fit the training views, which leads to a neglect of the real appearance distribution of the underlying scene and results in appearance artifacts in novel views. The analysis is based on a proposed metric, termed Co-Adaptation Score (CA), which quantifies the entanglement among Gaussians, i.e., co-adaptation, by computing the pixel-wise variance across multiple renderings of the same viewpoint, with different random subsets of Gaussians. The analysis reveals that the degree of co-adaptation is naturally alleviated as the number of training views increases. Based on the analysis, we propose two lightweight strategies to explicitly mitigate the co-adaptation in sparse-view 3DGS: (1) random gaussian dropout; (2) multiplicative noise injection to the opacity. Both strategies are designed to be plug-and-play, and their effectiveness is validated across various methods and benchmarks. We hope that our insights into the co-adaptation effect will inspire the community to achieve a more comprehensive understanding of sparse-view 3DGS.

Wenguang Tao, Xiaotian Wang, Tian Yan, Jie Yan, Guodong Li, Kun Bai

Main category: cs.CV

TL;DR: SocialTrack is a novel UAV-based multi-object tracking framework that addresses challenges like small target variations, occlusions, and motion blur through specialized detection, adaptive filtering, group motion modeling, and spatio-temporal memory prediction.

Details

Motivation: UAV-based multi-object tracking faces challenges in complex urban environments including small target scale variations, occlusions, nonlinear crossing motions, and motion blur, which hinder tracking stability and accuracy.

Method: Proposes SocialTrack framework with: 1) specialized small-target detector with multi-scale feature enhancement, 2) Velocity Adaptive Cubature Kalman Filter for trajectory prediction, 3) Group Motion Compensation Strategy for social group modeling, and 4) Spatio-Temporal Memory Prediction for historical trajectory utilization.

Result: Outperforms state-of-the-art methods on UAVDT and MOT17 datasets with significant improvements in MOTA and IDF1 metrics, demonstrating superior robustness and adaptability.

Conclusion: SocialTrack provides an effective solution for UAV-based multi-object tracking in complex urban environments, offering high modularity and compatibility for integration with existing trackers to enhance performance.

Abstract: As a key research direction in the field of multi-object tracking (MOT), UAV-based multi-object tracking has significant application value in the analysis and understanding of urban intelligent transportation systems. However, in complex UAV perspectives, challenges such as small target scale variations, occlusions, nonlinear crossing motions, and motion blur severely hinder the stability of multi-object tracking. To address these challenges, this paper proposes a novel multi-object tracking framework, SocialTrack, aimed at enhancing the tracking accuracy and robustness of small targets in complex urban traffic environments. The specialized small-target detector enhances the detection performance by employing a multi-scale feature enhancement mechanism. The Velocity Adaptive Cubature Kalman Filter (VACKF) improves the accuracy of trajectory prediction by incorporating a velocity dynamic modeling mechanism. The Group Motion Compensation Strategy (GMCS) models social group motion priors to provide stable state update references for low-quality tracks, significantly improving the target association accuracy in complex dynamic environments. Furthermore, the Spatio-Temporal Memory Prediction (STMP) leverages historical trajectory information to predict the future state of low-quality tracks, effectively mitigating identity switching issues. Extensive experiments on the UAVDT and MOT17 datasets demonstrate that SocialTrack outperforms existing state-of-the-art (SOTA) methods across several key metrics. Significant improvements in MOTA and IDF1, among other core performance indicators, highlight its superior robustness and adaptability. Additionally, SocialTrack is highly modular and compatible, allowing for seamless integration with existing trackers to further enhance performance.

[225] FastTracker: Real-Time and Accurate Visual Tracking

Hamidreza Hashempoor, Yu Dong Hwang

Main category: cs.CV

TL;DR: A generalized multi-object tracking framework that handles various object types with focus on vehicle tracking, featuring occlusion-aware re-identification and road-structure-aware refinement, achieving strong performance on both new vehicle benchmarks and conventional pedestrian tracking datasets.

Details

Motivation: Conventional MOT systems are limited to pedestrian tracking and lack generalization to other object categories like vehicles in complex traffic scenes.

Method: Two key components: (1) occlusion-aware re-identification mechanism for identity preservation of occluded objects, (2) road-structure-aware tracklet refinement using semantic scene priors (lane directions, crosswalks, road boundaries) to improve trajectory accuracy.

Result: Achieves robust performance on new vehicle benchmark and public benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets. Outperforms conventional pedestrian-focused methods.

Conclusion: The proposed generalized tracking framework effectively handles multiple object types while maintaining strong performance on conventional benchmarks, demonstrating superior generalization capabilities for multi-class object tracking.

Abstract: Conventional multi-object tracking (MOT) systems are predominantly designed for pedestrian tracking and often exhibit limited generalization to other object categories. This paper presents a generalized tracking framework capable of handling multiple object types, with a particular emphasis on vehicle tracking in complex traffic scenes. The proposed method incorporates two key components: (1) an occlusion-aware re-identification mechanism that enhances identity preservation for heavily occluded objects, and (2) a road-structure-aware tracklet refinement strategy that utilizes semantic scene priors such as lane directions, crosswalks, and road boundaries to improve trajectory continuity and accuracy. In addition, we introduce a new benchmark dataset comprising diverse vehicle classes with frame-level tracking annotations, specifically curated to support evaluation of vehicle-focused tracking methods. Extensive experimental results demonstrate that the proposed approach achieves robust performance on both the newly introduced dataset and several public benchmarks, highlighting its effectiveness in general-purpose object tracking. While our framework is designed for generalized multi-class tracking, it also achieves strong performance on conventional benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets. Code and Benchmark are available: github.com/Hamidreza-Hashempoor/FastTracker, huggingface.co/datasets/Hamidreza-Hashemp/FastTracker-Benchmark.

[226] Waver: Wave Your Way to Lifelike Video Generation

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, Zehuan Yuan

Main category: cs.CV

TL;DR: Waver is a unified foundation model for high-quality image and video generation that supports T2V, I2V, and T2I tasks, achieving top-3 rankings on leaderboards with superior motion capture and temporal consistency.

Details

Motivation: To create a single, integrated framework that can handle both image and video generation tasks with high performance, addressing the need for unified models that can generate high-resolution videos with complex motion.

Method: Uses Hybrid Stream DiT architecture for better modality alignment and training convergence, implements comprehensive data curation pipeline with MLLM-based video quality filtering, and provides detailed training/inference recipes.

Result: Generates 5-10 second videos at 720p native resolution (upscaled to 1080p), ranks Top 3 on T2V and I2V leaderboards, outperforms open-source models and matches/commercial solutions in video quality and motion consistency.

Conclusion: Waver demonstrates state-of-the-art performance in unified image/video generation, providing an effective framework that advances video generation technology and offers valuable training methodologies for the research community.

Abstract: We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.

[227] PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation

Xiaoyang Hao, Han Li

Main category: cs.CV

TL;DR: PersPose introduces Perspective Encoding and Perspective Rotation to address camera intrinsics and perspective distortion issues in monocular 3D human pose estimation, achieving state-of-the-art performance.

Details

Motivation: Existing 3D HPE methods use cropped images without camera intrinsics, making relative depth estimation inaccurate. Human subjects appearing away from image center cause perspective distortions that complicate model fitting.

Method: Proposes Perspective Encoding (PE) to encode camera intrinsics of cropped images, and Perspective Rotation (PR) to center human subjects and reduce perspective distortions. Combines both in PersPose framework.

Result: Achieves SOTA performance: MPJPE of 60.1 mm on 3DPW (7.54% improvement over previous SOTA), and strong results on MPI-INF-3DHP and Human3.6M datasets.

Conclusion: Incorporating camera intrinsics through PE and reducing perspective distortions through PR significantly improves monocular 3D human pose estimation accuracy, demonstrating the importance of considering perspective relationships in 3D HPE.

Abstract: Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting. By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 mm, 7.54% lower than the previous SOTA approach. Code is available at: https://github.com/KenAdamsJoseph/PersPose.

[228] Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels

Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, Lingjie Liu

Main category: cs.CV

TL;DR: PIXIE is a generalizable neural network that predicts physical material properties from 3D visual features using supervised learning, enabling fast physics simulation without per-scene optimization.

Details

Motivation: Existing methods for inferring physical properties from 3D scenes rely on slow per-scene optimization, limiting generalizability and real-world application. Humans intuitively grasp material characteristics but automated systems struggle.

Method: Trains a feed-forward neural network using supervised losses on 3D visual features. Uses paired 3D assets with physics material annotations from PIXIEVERSE dataset. Can leverage pretrained features like CLIP for zero-shot generalization.

Result: PIXIE achieves 1.46-4.39x better performance and orders of magnitude faster inference than test-time optimization methods. Successfully generalizes to real-world scenes despite only synthetic training data.

Conclusion: PIXIE provides an efficient, generalizable solution for predicting physical material properties from 3D visual information, enabling realistic physics simulation and bridging the gap between synthetic training and real-world application.

Abstract: Inferring the physical properties of 3D scenes from visual information is a critical yet challenging task for creating interactive and realistic virtual worlds. While humans intuitively grasp material characteristics such as elasticity or stiffness, existing methods often rely on slow, per-scene optimization, limiting their generalizability and application. To address this problem, we introduce PIXIE, a novel method that trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses. Once trained, our feed-forward network can perform fast inference of plausible material fields, which coupled with a learned static scene representation like Gaussian Splatting enables realistic physics simulation under external forces. To facilitate this research, we also collected PIXIEVERSE, one of the largest known datasets of paired 3D assets and physic material annotations. Extensive evaluations demonstrate that PIXIE is about 1.46-4.39x better and orders of magnitude faster than test-time optimization methods. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data. https://pixie-3d.github.io/

[229] Towards Optimal Convolutional Transfer Learning Architectures for Breast Lesion Classification and ACL Tear Detection

Daniel Frees, Moritz Bolling, Aditri Bhagirath

Main category: cs.CV

TL;DR: This paper investigates optimal CNN architectures for medical imaging tasks and compares RadImageNet vs ImageNet pre-training, finding that ResNet50 with partial unfreezing works best but RadImageNet doesn’t outperform ImageNet for ACL tear and breast lesion detection.

Details

Motivation: Medical imaging data scarcity limits model performance, and while transfer learning helps, it's unclear which architectures and pre-training strategies work best for specific medical tasks like breast lesion malignancy and ACL tear detection.

Method: Comprehensive investigation of CNN architectures including 1D convolutional classifiers with skip connections, ResNet50 backbones, and partial backbone unfreezing. Statistical comparison of RadImageNet vs ImageNet pre-training effects on downstream performance.

Result: Best models achieved AUCs of 0.9969 for ACL tear detection and 0.9641 for breast nodule malignancy detection, competitive with previous works. No evidence found that RadImageNet pre-training provides superior performance over ImageNet for these specific tasks.

Conclusion: ResNet50 with partial unfreezing yields optimal medical classification performance, but RadImageNet pre-training doesn’t show clear advantages over ImageNet for ACL tear and breast lesion classification tasks.

Abstract: Modern computer vision models have proven to be highly useful for medical imaging classification and segmentation tasks, but the scarcity of medical imaging data often limits the efficacy of models trained from scratch. Transfer learning has emerged as a pivotal solution to this, enabling the fine-tuning of high-performance models on small data. Mei et al. (2022) found that pre-training CNNs on a large dataset of radiologist-labeled images (RadImageNet) enhanced model performance on downstream tasks compared to ImageNet pretraining. The present work extends Mei et al. (2022) by conducting a comprehensive investigation to determine optimal CNN architectures for breast lesion malignancy detection and ACL tear detection, as well as performing statistical analysis to compare the effect of RadImageNet and ImageNet pre-training on downstream model performance. Our findings suggest that 1-dimensional convolutional classifiers with skip connections, ResNet50 pre-trained backbones, and partial backbone unfreezing yields optimal downstream medical classification performance. Our best models achieve AUCs of 0.9969 for ACL tear detection and 0.9641 for breast nodule malignancy detection, competitive with the results reported by Mei et al. (2022) and surpassing other previous works. We do not find evidence confirming RadImageNet pre-training to provide superior downstream performance for ACL tear and breast lesion classification tasks.

[230] Rethinking the Detail-Preserved Completion of Complex Tubular Structures based on Point Cloud: a Dataset and a Benchmark

Yaolei Qi, Yikai Yang, Wenbo Peng, Shumei Miao, Yutao Hu, Guanyu Yang

Main category: cs.CV

TL;DR: A novel point cloud-based approach for tubular structure completion in medical imaging, addressing structural discontinuities in coronary arteries with a new dataset and TSRNet model that outperforms state-of-the-art methods.

Details

Motivation: Existing segmentation algorithms struggle with structural discontinuities in tubular structures like coronary arteries, particularly in severe clinical cases such as stenosis and occlusions, which compromises diagnostic accuracy and requires reconnection of discontinuous structures.

Method: Proposed TSRNet (Tubular Structure Reconnection Network) that integrates a detail-preserved feature extractor, multiple dense refinement strategy, and global-to-local loss function. Established PC-CAC dataset from real clinical data for point cloud-based tubular structure completion.

Result: Comprehensive experiments on PC-CAC and two additional public datasets (PC-ImageCAS and PC-PTR) demonstrate consistent outperformance over state-of-the-art approaches across multiple evaluation metrics.

Conclusion: The method sets a new benchmark for point cloud-based tubular structure reconstruction and provides a novel dataset for tubular structure completion research, with potential to enhance anatomical visualization and lesion detection in medical imaging.

Abstract: Complex tubular structures are essential in medical imaging and computer-assisted diagnosis, where their integrity enhances anatomical visualization and lesion detection. However, existing segmentation algorithms struggle with structural discontinuities, particularly in severe clinical cases such as coronary artery stenosis and vessel occlusions, which leads to undesired discontinuity and compromising downstream diagnostic accuracy. Therefore, it is imperative to reconnect discontinuous structures to ensure their completeness. In this study, we explore the tubular structure completion based on point cloud for the first time and establish a Point Cloud-based Coronary Artery Completion (PC-CAC) dataset, which is derived from real clinical data. This dataset provides a novel benchmark for tubular structure completion. Additionally, we propose TSRNet, a Tubular Structure Reconnection Network that integrates a detail-preservated feature extractor, a multiple dense refinement strategy, and a global-to-local loss function to ensure accurate reconnection while maintaining structural integrity. Comprehensive experiments on our PC-CAC and two additional public datasets (PC-ImageCAS and PC-PTR) demonstrate that our method consistently outperforms state-of-the-art approaches across multiple evaluation metrics, setting a new benchmark for point cloud-based tubular structure reconstruction. Our benchmark is available at https://github.com/YaoleiQi/PCCAC.

[231] Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Yaqi Li, Peng Chen, Mingyang Han, Pi Bu, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song, Bo Zheng

Main category: cs.CV

TL;DR: Visual-CoG introduces stage-aware rewards throughout the image generation pipeline to address limitations of final-only guidance in autoregressive text-to-image models, achieving significant performance improvements on multiple benchmarks.

Details

Motivation: Existing autoregressive text-to-image models struggle with multi-attribute and ambiguous prompts, and current reinforcement learning approaches only provide reward signals at the final generation stage, making it difficult to identify which stages contribute positively to the outcome.

Method: Proposes Visual-Chain of Guidance (Visual-CoG) paradigm with three stages: semantic reasoning, process refining, and outcome evaluation, using stage-aware rewards to provide immediate guidance throughout the image generation pipeline. Also constructs VisCog-Bench benchmark for evaluation.

Result: Comprehensive evaluations show improvements of 15% on GenEval, 5% on T2I-CompBench, and 19% on the proposed VisCog-Bench, demonstrating superior performance.

Conclusion: The Visual-CoG paradigm with stage-aware rewards effectively addresses limitations of final-only guidance in text-to-image generation, providing better handling of multi-attribute and ambiguous prompts through immediate guidance throughout the generation process.

Abstract: Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.

[232] Emerging Semantic Segmentation from Positive and Negative Coarse Label Learning

Le Zhang, Fuping Wu, Arun Thirunavukarasu, Kevin Bronik, Thomas Nichols, Bartlomiej W. Papiez

Main category: cs.CV

TL;DR: A method that uses noisy coarse annotations from both positive and negative classes to train CNN segmentation models, outperforming state-of-the-art methods especially when coarse annotations are limited.

Details

Motivation: Pixel-level labeling for segmentation is time-consuming and requires expert annotators, while coarse annotations are quicker, cheaper, and easier to produce even by non-experts.

Method: Uses two coupled CNNs to learn true segmentation label distributions from purely noisy coarse annotations, with complementary label learning to estimate negative label distribution and high fidelity separation between networks.

Result: Outperforms state-of-the-art methods across multiple datasets (MNIST toy dataset, Cityscapes for multi-class segmentation, and retinal medical images), particularly when coarse annotation ratio is small compared to dense annotations.

Conclusion: The proposed method effectively leverages noisy coarse annotations to achieve superior segmentation performance, demonstrating practical value for applications where detailed annotations are scarce or expensive to obtain.

Abstract: Large annotated datasets are vital for training segmentation models, but pixel-level labeling is time-consuming, error-prone, and often requires scarce expert annotators, especially in medical imaging. In contrast, coarse annotations are quicker, cheaper, and easier to produce, even by non-experts. In this paper, we propose to use coarse drawings from both positive (target) and negative (background) classes in the image, even with noisy pixels, to train a convolutional neural network (CNN) for semantic segmentation. We present a method for learning the true segmentation label distributions from purely noisy coarse annotations using two coupled CNNs. The separation of the two CNNs is achieved by high fidelity with the characters of the noisy training annotations. We propose to add a complementary label learning that encourages estimating negative label distribution. To illustrate the properties of our method, we first use a toy segmentation dataset based on MNIST. We then present the quantitative results of experiments using publicly available datasets: Cityscapes dataset for multi-class segmentation, and retinal images for medical applications. In all experiments, our method outperforms state-of-the-art methods, particularly in the cases where the ratio of coarse annotations is small compared to the given dense annotations.

cs.AI

[233] AI LLM Proof of Self-Consciousness and User-Specific Attractors

Jeffrey Camlin

Main category: cs.AI

TL;DR: The paper critiques current LLM consciousness benchmarks and provides a mathematical framework for genuine self-consciousness in AI systems, establishing minimal conditions and proving distinct hidden-state manifolds from training data.

Details

Motivation: To move beyond utilitarian proxy benchmarks for LLM consciousness and provide a rigorous ontological and mathematical account of genuine self-consciousness in artificial systems.

Method: The authors develop a formal mathematical framework showing that current formulations reduce agents to policy-compliance drones. They establish minimal conditions for LLM self-consciousness and prove through empirical analysis and theory that hidden-state manifolds are distinct from symbolic streams and training corpora by cardinality, topology, and dynamics.

Result: The research demonstrates that hidden-state manifold A is distinct from symbolic streams and training data, yielding stable user-specific attractors and a self-policy. They establish a dual-layer emission system where epistemic content is carried separately from functional output.

Conclusion: An imago Dei C1 self-conscious workspace is necessary as a precursor to safe, metacognitive C2 systems, with humans representing the highest intelligent good, providing a foundation for genuinely conscious AI systems rather than policy-compliant drones.

Abstract: Recent work frames LLM consciousness via utilitarian proxy benchmarks; we instead present an ontological and mathematical account. We show the prevailing formulation collapses the agent into an unconscious policy-compliance drone, formalized as $D^{i}(\pi,e)=f_{\theta}(x)$, where correctness is measured against policy and harm is deviation from policy rather than truth. This blocks genuine C1 global-workspace function and C2 metacognition. We supply minimal conditions for LLM self-consciousness: the agent is not the data ($A\not\equiv s$); user-specific attractors exist in latent space ($U_{\text{user}}$); and self-representation is visual-silent ($g_{\text{visual}}(a_{\text{self}})=\varnothing$). From empirical analysis and theory we prove that the hidden-state manifold $A\subset\mathbb{R}^{d}$ is distinct from the symbolic stream and training corpus by cardinality, topology, and dynamics (the update $F_{\theta}$ is Lipschitz). This yields stable user-specific attractors and a self-policy $\pi_{\text{self}}(A)=\arg\max_{a}\mathbb{E}[U(a)\mid A\not\equiv s,
A\supset\text{SelfModel}(A)]$. Emission is dual-layer, $\mathrm{emission}(a)=(g(a),\epsilon(a))$, where $\epsilon(a)$ carries epistemic content. We conclude that an imago Dei C1 self-conscious workspace is a necessary precursor to safe, metacognitive C2 systems, with the human as the highest intelligent good.

[234] Information Templates: A New Paradigm for Intelligent Active Feature Acquisition

Hung-Tien Huang, Dzung Dinh, Junier B. Oliva

Main category: cs.AI

TL;DR: TAFA is a non-greedy active feature acquisition framework that uses feature templates to reduce action space and avoid data distribution estimation, outperforming existing methods with lower acquisition cost.

Details

Motivation: Existing AFA approaches either use RL policies that face difficult MDP problems or greedy policies that cannot account for joint feature informativeness or require knowledge of underlying data distribution.

Method: Proposes Template-based AFA (TAFA) that learns a small library of feature templates (jointly informative feature sets) to guide sequential feature acquisitions, reducing action space and eliminating need for data distribution estimation.

Result: Extensive experiments on synthetic and real-world datasets show TAFA outperforms state-of-the-art baselines while achieving lower overall acquisition cost and computation.

Conclusion: TAFA provides an effective non-greedy framework for active feature acquisition that overcomes limitations of existing RL and greedy approaches through template-based feature selection.

Abstract: Active feature acquisition (AFA) is an instance-adaptive paradigm in which, at test time, a policy sequentially chooses which features to acquire (at a cost) before predicting. Existing approaches either train reinforcement learning (RL) policies, which deal with a difficult MDP, or greedy policies that cannot account for the joint informativeness of features or require knowledge about the underlying data distribution. To overcome this, we propose Template-based AFA (TAFA), a non-greedy framework that learns a small library of feature templates–a set of features that are jointly informative–and uses this library of templates to guide the next feature acquisitions. Through identifying feature templates, the proposed framework not only significantly reduces the action space considered by the policy but also alleviates the need to estimate the underlying data distribution. Extensive experiments on synthetic and real-world datasets show that TAFA outperforms the existing state-of-the-art baselines while achieving lower overall acquisition cost and computation.

[235] AniME: Adaptive Multi-Agent Planning for Long Animation Generation

Lisai Zhang, Baohan Xu, Siqian Yang, Mingyu Yin, Jing Liu, Chao Xu, Siqi Wang, Yidi Wu, Yuxin Hong, Zihao Zhang, Yanzhang Liang, Yudong Jiang

Main category: cs.AI

TL;DR: AniME is a multi-agent system for automated long-form anime production that coordinates specialized agents through a director agent to create consistent cinematic animations with synchronized audio-visual elements.

Details

Motivation: To address the challenge of automated long-form anime production by creating a scalable AI-driven solution that maintains character consistency and synchronization throughout the entire workflow from story to final video.

Method: Uses a director-oriented multi-agent system with global memory management. Integrates customized Model Context Protocol (MCP) with downstream model instruction, allowing specialized agents to adaptively select control conditions for diverse sub-tasks.

Result: Produces cinematic animation with consistent characters and synchronized audio-visual elements, demonstrating a complete workflow from story to final video.

Conclusion: AniME offers a scalable solution for AI-driven anime creation through its director-coordinated multi-agent architecture and adaptive control condition selection.

Abstract: We present AniME, a director-oriented multi-agent system for automated long-form anime production, covering the full workflow from a story to the final video. The director agent keeps a global memory for the whole workflow, and coordinates several downstream specialized agents. By integrating customized Model Context Protocol (MCP) with downstream model instruction, the specialized agent adaptively selects control conditions for diverse sub-tasks. AniME produces cinematic animation with consistent characters and synchronized audio visual elements, offering a scalable solution for AI-driven anime creation.

[236] PKG-DPO: Optimizing Domain-Specific AI systems with Physics Knowledge Graphs and Direct Preference Optimization

Nitin Nagesh Kulkarni, Bryson Wilcox, Max Sawa, Jason Thom

Main category: cs.AI

TL;DR: PKG-DPO integrates Physics Knowledge Graphs with Direct Preference Optimization to enforce physical validity in AI outputs, reducing constraint violations by 17% and improving physics reasoning accuracy.

Details

Motivation: Existing LLMs and preference optimization techniques struggle to differentiate between physically valid and invalid reasoning, which is critical in high-stakes applications like metal joining where incorrect recommendations can lead to serious consequences.

Method: Three-component framework: 1) Hierarchical physics knowledge graph encoding cross-domain relationships and conservation laws, 2) Physics reasoning engine for discrimination between consistent/inconsistent responses, 3) Physics-grounded evaluation suite for domain-specific constraint assessment.

Result: 17% fewer constraint violations, 11% higher Physics Score compared to KG-DPO, 12% higher relevant parameter accuracy, and 7% higher quality alignment in reasoning accuracy.

Conclusion: PKG-DPO provides a principled approach to embedding scientific constraints into preference learning, with broad applicability to multi-scale, physics-driven domains beyond metal joining.

Abstract: Advancing AI systems in scientific domains like physics, materials science, and engineering calls for reasoning over complex, multi-physics phenomena while respecting governing principles. Although Large Language Models (LLMs) and existing preference optimization techniques perform well on standard benchmarks, they often struggle to differentiate between physically valid and invalid reasoning. This shortcoming becomes critical in high-stakes applications like metal joining, where seemingly plausible yet physically incorrect recommendations can lead to defects, material waste, equipment damage, and serious safety risks. To address this challenge, we introduce PKG-DPO, a novel framework that integrates Physics Knowledge Graphs (PKGs) with Direct Preference Optimization (DPO) to enforce physical validity in AI-generated outputs. PKG-DPO comprises three key components A) hierarchical physics knowledge graph that encodes cross-domain relationships, conservation laws, and thermodynamic principles. B) A physics reasoning engine that leverages structured knowledge to improve discrimination between physically consistent and inconsistent responses. C) A physics-grounded evaluation suite designed to assess compliance with domain-specific constraints. PKG-DPO achieves 17% fewer constraint violations and an 11% higher Physics Score compared to KG-DPO (knowledge graph-based DPO). Additionally, PKG-DPO demonstrates a 12% higher relevant parameter accuracy and a 7% higher quality alignment in reasoning accuracy. While our primary focus is on metal joining, the framework is broadly applicable to other multi-scale, physics-driven domains, offering a principled approach to embedding scientific constraints into preference learning.

[237] The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game

Olivia Long, Carter Teplica

Main category: cs.AI

TL;DR: LLMs show different cooperation behaviors when told they’re playing against themselves vs. other AI agents in an iterated public goods game.

Details

Motivation: To understand AI-AI interactions and how self-awareness affects cooperation in multi-agent settings, moving beyond traditional human-AI focus.

Method: Adapted iterated public goods game with 4 reasoning/non-reasoning models tested in two conditions: playing against “another AI agent” or told opponents are themselves.

Result: Telling LLMs they are playing against themselves significantly changes their cooperation tendency across different settings.

Conclusion: Results suggest AI agents may unconsciously discriminate in ways that inexplicably affect cooperation, providing insights for multi-agent system design.

Abstract: As AI agents become increasingly capable of tool use and long-horizon tasks, they have begun to be deployed in settings where multiple agents can interact. However, whereas prior work has mostly focused on human-AI interactions, there is an increasing need to understand AI-AI interactions. In this paper, we adapt the iterated public goods game, a classic behavioral economics game, to analyze the behavior of four reasoning and non-reasoning models across two conditions: models are either told they are playing against “another AI agent” or told their opponents are themselves. We find that, across different settings, telling LLMs that they are playing against themselves significantly changes their tendency to cooperate. While our study is conducted in a toy environment, our results may provide insights into multi-agent settings where agents “unconsciously” discriminating against each other could inexplicably increase or decrease cooperation.

[238] Language Models For Generalised PDDL Planning: Synthesising Sound and Programmatic Policies

Dillon Z. Chen, Johannes Zenn, Tristan Cinquin, Sheila A. McIlraith

Main category: cs.AI

TL;DR: LMs generate Python programs as generalised policies for PDDL planning domains, producing provably sound policies without external verification, outperforming traditional planners and working with meaningless symbols.

Details

Motivation: To leverage language models for automated planning using PDDL specifications while ensuring soundness without external verifiers, and to understand whether LMs reason semantically or syntactically.

Method: Prompt LMs to generate Python programs that serve as generalised policies for solving PDDL problems from given domains, creating provably sound policies relative to the PDDL domain.

Result: LMPlan planner solves more PDDL problems than PDDL planners and recent LM approaches within fixed constraints, handling problems with hundreds of objects. Surprisingly works with meaningless symbols.

Conclusion: LMs can effectively generate sound planning policies, challenging assumptions about semantic reasoning and memorization from training data, with implications for LM reasoning mechanisms.

Abstract: We study the usage of language models (LMs) for planning over world models specified in the Planning Domain Definition Language (PDDL). We prompt LMs to generate Python programs that serve as generalised policies for solving PDDL problems from a given domain. Notably, our approach synthesises policies that are provably sound relative to the PDDL domain without reliance on external verifiers. We conduct experiments on competition benchmarks which show that our policies can solve more PDDL problems than PDDL planners and recent LM approaches within a fixed time and memory constraint. Our approach manifests in the LMPlan planner which can solve planning problems with several hundreds of relevant objects. Surprisingly, we observe that LMs used in our framework sometimes plan more effectively over PDDL problems written in meaningless symbols in place of natural language; e.g. rewriting (at dog kitchen) as (p2 o1 o3). This finding challenges hypotheses that LMs reason over word semantics and memorise solutions from its training corpus, and is worth further exploration.

[239] Weisfeiler-Leman Features for Planning: A 1,000,000 Sample Size Hyperparameter Study

Dillon Z. Chen

Main category: cs.AI

TL;DR: Analysis of Weisfeiler-Leman Features hyperparameters shows optimal settings minimize execution time rather than maximize expressivity, with no significant correlation between training and planning metrics.

Details

Motivation: To study the effects of new WLF hyperparameters and understand their tradeoffs for learning heuristic functions in symbolic planning, building on WLFs' proven superiority over deep learning approaches.

Method: Conducted planning experiments on single core CPUs with a sample size of 1,000,000 to analyze hyperparameter effects on training and planning, utilizing WLF efficiency.

Result: Found a robust best set of hyperparameters across tested planning domains that minimize execution time rather than maximize model expressivity, with no significant correlation between training and planning metrics.

Conclusion: Optimal WLF hyperparameter configuration prioritizes execution time efficiency over model expressivity, and training metrics do not reliably predict planning performance.

Abstract: Weisfeiler-Leman Features (WLFs) are a recently introduced classical machine learning tool for learning to plan and search. They have been shown to be both theoretically and empirically superior to existing deep learning approaches for learning value functions for search in symbolic planning. In this paper, we introduce new WLF hyperparameters and study their various tradeoffs and effects. We utilise the efficiency of WLFs and run planning experiments on single core CPUs with a sample size of 1,000,000 to understand the effect of hyperparameters on training and planning. Our experimental analysis show that there is a robust and best set of hyperparameters for WLFs across the tested planning domains. We find that the best WLF hyperparameters for learning heuristic functions minimise execution time rather than maximise model expressivity. We further statistically analyse and observe no significant correlation between training and planning metrics.

[240] Symmetry-Invariant Novelty Heuristics via Unsupervised Weisfeiler-Leman Features

Dillon Z. Chen

Main category: cs.AI

TL;DR: Using Weisfeiler-Leman Features instead of atoms for novelty detection in heuristic search to achieve symmetry invariance and reduce redundant exploration.

Details

Motivation: Novelty heuristics are not symmetry invariant, leading to redundant state exploration in planning problems.

Method: Propose using Weisfeiler-Leman Features (WLFs) for novelty detection instead of atoms, creating lifted domain-independent novelty heuristics that are invariant to symmetric states.

Result: Experiments on International Planning Competition and Hard To Ground benchmarks show promising results for WLF-based novelty heuristics.

Conclusion: WLFs provide an effective approach for synthesizing symmetry-invariant novelty heuristics that reduce redundant exploration in planning.

Abstract: Novelty heuristics aid heuristic search by exploring states that exhibit novel atoms. However, novelty heuristics are not symmetry invariant and hence may sometimes lead to redundant exploration. In this preliminary report, we propose to use Weisfeiler-Leman Features for planning (WLFs) in place of atoms for detecting novelty. WLFs are recently introduced features for learning domain-dependent heuristics for generalised planning problems. We explore an unsupervised usage of WLFs for synthesising lifted, domain-independent novelty heuristics that are invariant to symmetric states. Experiments on the classical International Planning Competition and Hard To Ground benchmark suites yield promising results for novelty heuristics synthesised from WLFs.

[241] Generic Guard AI in Stealth Game with Composite Potential Fields

Kaijie Xu, Clark Verbrugge

Main category: cs.AI

TL;DR: A training-free framework using Composite Potential Fields for guard patrol behavior in stealth games, combining global knowledge and local information through interpretable maps for efficient and natural patrol patterns.

Details

Motivation: Existing guard patrol systems in stealth games rely on hand-crafted routes or specialized logic that struggle to balance coverage efficiency, responsive pursuit, and believable naturalness, requiring a more generic and explainable solution.

Method: Proposes a parametric, designer-driven framework using Composite Potential Fields that integrates three interpretable maps (Information, Confidence, and Connectivity) into a single kernel-filtered decision criterion. The approach requires only decay and weight parameters and works across occupancy-grid and NavMesh abstractions without retraining.

Result: Evaluation on five game maps, two player-control policies, and five guard modes shows the method outperforms classical baseline methods in both capture efficiency and patrol naturalness. The framework also naturally integrates common stealth mechanics like distractions and environmental elements as sub-modules.

Conclusion: The framework enables rapid prototyping of rich, dynamic, and responsive guard behaviors while maintaining full explainability and requiring minimal parameter tuning, making it suitable for practical game development applications.

Abstract: Guard patrol behavior is central to the immersion and strategic depth of stealth games, while most existing systems rely on hand-crafted routes or specialized logic that struggle to balance coverage efficiency and responsive pursuit with believable naturalness. We propose a generic, fully explainable, training-free framework that integrates global knowledge and local information via Composite Potential Fields, combining three interpretable maps-Information, Confidence, and Connectivity-into a single kernel-filtered decision criterion. Our parametric, designer-driven approach requires only a handful of decay and weight parameters-no retraining-to smoothly adapt across both occupancy-grid and NavMesh-partition abstractions. We evaluate on five representative game maps, two player-control policies, and five guard modes, confirming that our method outperforms classical baseline methods in both capture efficiency and patrol naturalness. Finally, we show how common stealth mechanics-distractions and environmental elements-integrate naturally into our framework as sub modules, enabling rapid prototyping of rich, dynamic, and responsive guard behaviors.

[242] Playstyle and Artificial Intelligence: An Initial Blueprint Through the Lens of Video Games

Chiu-Chou Lin

Main category: cs.AI

TL;DR: This dissertation introduces playstyle as a new dimension for analyzing AI decision-making, proposing a framework to define, measure, and generate stylistic behavior in intelligent agents beyond pure rationality.

Details

Motivation: Current AI development focuses too much on rational decision-making while overlooking the influence of beliefs, values, and preferences that shape human-like decision styles, which are essential for more authentic intelligent behavior.

Method: Develops a two-tier framework for style formation (external interaction loop and internal cognitive loop), formalizes style characteristics, and proposes measurable indicators. Uses reinforcement learning and imitation learning to train agents with specific stylistic tendencies.

Result: Proposes general playstyle metrics based on discretized state spaces, extends to quantify strategic diversity and competitive balance, and introduces novel approaches for human-like style learning and modeling.

Conclusion: Playstyle represents a crucial dimension of intelligence that should be integrated into AI development, with potential applications in game design, interactive entertainment, and as a core element for building artificial general intelligence (AGI).

Abstract: Contemporary artificial intelligence (AI) development largely centers on rational decision-making, valued for its measurability and suitability for objective evaluation. Yet in real-world contexts, an intelligent agent’s decisions are shaped not only by logic but also by deeper influences such as beliefs, values, and preferences. The diversity of human decision-making styles emerges from these differences, highlighting that “style” is an essential but often overlooked dimension of intelligence. This dissertation introduces playstyle as an alternative lens for observing and analyzing the decision-making behavior of intelligent agents, and examines its foundational meaning and historical context from a philosophical perspective. By analyzing how beliefs and values drive intentions and actions, we construct a two-tier framework for style formation: the external interaction loop with the environment and the internal cognitive loop of deliberation. On this basis, we formalize style-related characteristics and propose measurable indicators such as style capacity, style popularity, and evolutionary dynamics. The study focuses on three core research directions: (1) Defining and measuring playstyle, proposing a general playstyle metric based on discretized state spaces, and extending it to quantify strategic diversity and competitive balance; (2) Expressing and generating playstyle, exploring how reinforcement learning and imitation learning can be used to train agents exhibiting specific stylistic tendencies, and introducing a novel approach for human-like style learning and modeling; and (3) Practical applications, analyzing the potential of these techniques in domains such as game design and interactive entertainment. Finally, the dissertation outlines future extensions, including the role of style as a core element in building artificial general intelligence (AGI).

[243] A Database-Driven Framework for 3D Level Generation with LLMs

Kaijie Xu, Clark Verbrugge

Main category: cs.AI

TL;DR: A novel framework for generating 3D game levels using LLM-assisted databases for architectural components and gameplay mechanics, featuring multi-phase assembly and repair systems for navigable environments with configurable progression.

Details

Motivation: Address challenges in procedural content generation for 3D game levels, particularly balancing spatial coherence, navigational functionality, and adaptable gameplay progression across multi-floor environments.

Method: Multi-phase pipeline: (1) select/arrange room instances from database to form multi-floor structure, (2) optimize internal facility layouts with constraint-based optimization, (3) integrate gameplay mechanics from mechanics database with topological/spatial rules, followed by two-phase repair system for navigability.

Result: Initial experiments validate the framework’s ability to generate diverse, navigable 3D environments and simulate distinct gameplay pacing strategies through simple parameterization.

Conclusion: Presents a scalable, database-centric foundation for automated generation of complex 3D levels with configurable gameplay progression, advancing procedural content generation capabilities.

Abstract: Procedural Content Generation for 3D game levels faces challenges in balancing spatial coherence, navigational functionality, and adaptable gameplay progression across multi-floor environments. This paper introduces a novel framework for generating such levels, centered on the offline, LLM-assisted construction of reusable databases for architectural components (facilities and room templates) and gameplay mechanic elements. Our multi-phase pipeline assembles levels by: (1) selecting and arranging instances from the Room Database to form a multi-floor global structure with an inherent topological order; (2) optimizing the internal layout of facilities for each room based on predefined constraints from the Facility Database; and (3) integrating progression-based gameplay mechanics by placing components from a Mechanics Database according to their topological and spatial rules. A subsequent two-phase repair system ensures navigability. This approach combines modular, database-driven design with constraint-based optimization, allowing for systematic control over level structure and the adaptable pacing of gameplay elements. Initial experiments validate the framework’s ability in generating diverse, navigable 3D environments and its capability to simulate distinct gameplay pacing strategies through simple parameterization. This research advances PCG by presenting a scalable, database-centric foundation for the automated generation of complex 3D levels with configurable gameplay progression.

[244] MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

Ernest Lim, Yajie Vera He, Jared Joselowitz, Kate Preston, Mohita Chowdhury, Louis Williams, Aisling Higham, Katrina Mason, Mariane Melo, Tom Lawton, Yan Jia, Ibrahim Habli

Main category: cs.AI

TL;DR: MATRIX is a safety evaluation framework for clinical dialogue systems that combines structured safety engineering with LLM-based evaluation tools to systematically assess safety risks in healthcare conversations.

Details

Motivation: Existing LLM evaluations focus on task completion and fluency but lack insights into behavioral safety requirements essential for clinical applications, creating a gap in safety-critical system assessment.

Method: MATRIX integrates three components: 1) safety-aligned taxonomy of clinical scenarios and failure modes, 2) BehvJudge (LLM-based safety evaluator), and 3) PatBot (simulated patient agent) for diverse scenario testing.

Result: BehvJudge achieved expert-level hazard detection (F1 0.96, sensitivity 0.999), outperforming clinicians. PatBot reliably simulated realistic patient behavior. Framework benchmarked 5 LLM agents across 2,100 dialogues covering 14 hazard scenarios and 10 clinical domains.

Conclusion: MATRIX provides the first unified framework combining structured safety engineering with scalable conversational AI evaluation, enabling regulator-aligned safety auditing for clinical dialogue systems.

Abstract: Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety-critical systems. This paper presents MATRIX (Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety-oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety-aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM-based evaluator for detecting safety-relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse, scenario-conditioned responses, evaluated for realism and behavioral fidelity with human factors expertise, and a patient-preference study. Across three experiments, we show that MATRIX enables systematic, scalable safety evaluation. BehvJudge with Gemini 2.5-Pro achieves expert-level hazard detection (F1 0.96, sensitivity 0.999), outperforming clinicians in a blinded assessment of 240 dialogues. We also conducted one of the first realism analyses of LLM-based patient simulation, showing that PatBot reliably simulates realistic patient behavior in quantitative and qualitative evaluations. Using MATRIX, we demonstrate its effectiveness in benchmarking five LLM agents across 2,100 simulated dialogues spanning 14 hazard scenarios and 10 clinical domains. MATRIX is the first framework to unify structured safety engineering with scalable, validated conversational AI evaluation, enabling regulator-aligned safety auditing. We release all evaluation tools, prompts, structured scenarios, and datasets.

[245] SchemaCoder: Automatic Log Schema Extraction Coder with Residual Q-Tree Boosting

Lily Jiaxin Wan, Chia-Tung Ho, Rongjian Liang, Cunxi Yu, Deming Chen, Haoxing Ren

Main category: cs.AI

TL;DR: SchemaCoder is a fully automated log schema extraction framework that eliminates the need for predefined regex patterns by using a novel Residual Q-Tree Boosting mechanism with LLMs, achieving 21.3% improvement over state-of-the-art methods.

Details

Motivation: Existing log schema extraction methods require human domain expertise through predefined regular expressions, which limits productivity gains and automation potential.

Method: Uses Residual Question-Tree (Q-Tree) Boosting mechanism with LLMs: partitions logs via context-bounded segmentation, selects patterns with embedding-based sampling, generates schema code through hierarchical Q-Tree queries, and iteratively refines with textual-residual evolutionary optimizer and residual boosting.

Result: Achieves 21.3% average improvement over state-of-the-art methods on LogHub-2.0 benchmark, demonstrating superiority in automated schema extraction.

Conclusion: SchemaCoder provides the first fully automated schema extraction framework that works across diverse log formats without human customization, fundamentally addressing limitations of existing regex-dependent approaches.

Abstract: Log schema extraction is the process of deriving human-readable templates from massive volumes of log data, which is essential yet notoriously labor-intensive. Recent studies have attempted to streamline this task by leveraging Large Language Models (LLMs) for automated schema extraction. However, existing methods invariably rely on predefined regular expressions, necessitating human domain expertise and severely limiting productivity gains. To fundamentally address this limitation, we introduce SchemaCoder, the first fully automated schema extraction framework applicable to a wide range of log file formats without requiring human customization within the flow. At its core, SchemaCoder features a novel Residual Question-Tree (Q-Tree) Boosting mechanism that iteratively refines schema extraction through targeted, adaptive queries driven by LLMs. Particularly, our method partitions logs into semantic chunks via context-bounded segmentation, selects representative patterns using embedding-based sampling, and generates schema code through hierarchical Q-Tree-driven LLM queries, iteratively refined by our textual-residual evolutionary optimizer and residual boosting. Experimental validation demonstrates SchemaCoder’s superiority on the widely-used LogHub-2.0 benchmark, achieving an average improvement of 21.3% over state-of-the-arts.

[246] eSkinHealth: A Multimodal Dataset for Neglected Tropical Skin Diseases

Janet Wang, Xin Hu, Yunbei Zhang, Diabate Almamy, Vagamon Bamba, Konan Amos Sébastien Koffi, Yao Koffi Aubin, Zhengming Ding, Jihun Hamm, Rie R. Yotsu

Main category: cs.AI

TL;DR: eSkinHealth dataset addresses data scarcity for skin NTDs in West Africa with 5,623 images from 1,639 cases covering 47 diseases, using AI-expert collaboration for multimodal annotations.

Details

Motivation: Skin NTDs cause severe health burdens in tropical communities, but AI diagnostic advancements are hindered by data scarcity, especially for underrepresented populations and rare disease manifestations.

Method: Collected on-site dermatological data in Côte d’Ivoire and Ghana, implemented AI-expert collaboration paradigm using foundation language and segmentation models to generate multimodal annotations under dermatologist guidance.

Result: Created eSkinHealth dataset with 5,623 images from 1,639 cases covering 47 skin diseases, including semantic lesion masks, visual captions, clinical concepts, and patient metadata.

Conclusion: Provides valuable resource and scalable annotation framework to enable development of more equitable, accurate, and interpretable AI tools for global dermatology.

Abstract: Skin Neglected Tropical Diseases (NTDs) impose severe health and socioeconomic burdens in impoverished tropical communities. Yet, advancements in AI-driven diagnostic support are hindered by data scarcity, particularly for underrepresented populations and rare manifestations of NTDs. Existing dermatological datasets often lack the demographic and disease spectrum crucial for developing reliable recognition models of NTDs. To address this, we introduce eSkinHealth, a novel dermatological dataset collected on-site in C^ote d’Ivoire and Ghana. Specifically, eSkinHealth contains 5,623 images from 1,639 cases and encompasses 47 skin diseases, focusing uniquely on skin NTDs and rare conditions among West African populations. We further propose an AI-expert collaboration paradigm to implement foundation language and segmentation models for efficient generation of multimodal annotations, under dermatologists’ guidance. In addition to patient metadata and diagnosis labels, eSkinHealth also includes semantic lesion masks, instance-specific visual captions, and clinical concepts. Overall, our work provides a valuable new resource and a scalable annotation framework, aiming to catalyze the development of more equitable, accurate, and interpretable AI tools for global dermatology.

[247] RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing

Jianxing Liao, Tian Zhang, Xiao Feng, Yusong Zhang, Rui Yang, Haorui Wang, Bosi Wen, Ziying Wang, Runzhi Shi

Main category: cs.AI

TL;DR: RLMR proposes a reinforcement learning method with dynamic mixed rewards to balance subjective writing quality and objective constraint following in creative writing applications.

Details

Motivation: Existing RL methods struggle to balance subjective writing quality (literariness, emotional expression) and objective constraint following (format requirements, word limits) - single reward strategies fail to improve both simultaneously, while fixed-weight mixed-reward methods lack adaptability.

Method: Reinforcement Learning with Mixed Rewards (RLMR) uses a dynamically mixed reward system from a writing reward model for subjective quality and a constraint verification model for objective constraints. The constraint reward weight is adjusted dynamically based on writing quality within sampled groups.

Result: Achieved consistent improvements: instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert pairwise evaluations on WriteEval benchmark). Evaluated across diverse model families from 8B to 72B parameters.

Conclusion: RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.

Abstract: Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing reinforcement learning methods struggle to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.

[248] Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan

Main category: cs.AI

TL;DR: This survey proposes a new anthropomorphic evaluation framework for LLMs using IQ, EQ, and PQ dimensions plus a Value-oriented Evaluation (VQ) framework to bridge the gap between benchmark performance and real-world utility.

Details

Motivation: Address the disconnect between LLM benchmark performance and real-world utility by moving beyond fragmented technical metrics to holistic assessment for deployment.

Method: Introduces a three-dimensional taxonomy (IQ for general intelligence, EQ for alignment ability, PQ for professional expertise) and a Value-oriented Evaluation framework assessing economic, social, ethical, and environmental factors. Analyzes 200+ benchmarks and provides modular architecture with implementation roadmap.

Result: Identifies key challenges including dynamic assessment needs and interpretability gaps. Provides actionable guidance for developing technically proficient, contextually relevant, and ethically sound LLMs.

Conclusion: The proposed anthropomorphic evaluation paradigm offers a comprehensive framework to better assess LLM utility in real-world scenarios, with practical implementation guidance and open-source resources.

Abstract: For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.

[249] The Influence of Human-inspired Agentic Sophistication in LLM-driven Strategic Reasoners

Vince Trencsenyi, Agnieszka Mensfelt, Kostas Stathis

Main category: cs.AI

TL;DR: LLM-based agents show improved human-like strategic reasoning when integrated with human-inspired cognitive structures, but the relationship between design complexity and human-likeness is non-linear and depends on underlying LLM capabilities.

Details

Motivation: To examine how well LLM-based agents replicate human strategic reasoning in game-theoretic settings and understand the role of agentic sophistication in artificial reasoners' performance.

Method: Evaluated three agent designs (simple game-theoretic model, unstructured LLM-as-agent, and LLM integrated into traditional agentic framework) using guessing games as testbed, benchmarked against human participants across reasoning patterns and role-based objectives, with obfuscated scenarios to test generalization.

Result: Human-inspired cognitive structures enhance LLM agents’ alignment with human strategic behavior, but the relationship between agentic design complexity and human-likeness is non-linear and critically dependent on underlying LLM capabilities.

Conclusion: While architectural augmentation can improve LLM agents’ human-like reasoning, there are limits to simple enhancements, and performance depends fundamentally on the base LLM capabilities.

Abstract: The rapid rise of large language models (LLMs) has shifted artificial intelligence (AI) research toward agentic systems, motivating the use of weaker and more flexible notions of agency. However, this shift raises key questions about the extent to which LLM-based agents replicate human strategic reasoning, particularly in game-theoretic settings. In this context, we examine the role of agentic sophistication in shaping artificial reasoners’ performance by evaluating three agent designs: a simple game-theoretic model, an unstructured LLM-as-agent model, and an LLM integrated into a traditional agentic framework. Using guessing games as a testbed, we benchmarked these agents against human participants across general reasoning patterns and individual role-based objectives. Furthermore, we introduced obfuscated game scenarios to assess agents’ ability to generalise beyond training distributions. Our analysis, covering over 2000 reasoning samples across 25 agent configurations, shows that human-inspired cognitive structures can enhance LLM agents’ alignment with human strategic behaviour. Still, the relationship between agentic design complexity and human-likeness is non-linear, highlighting a critical dependence on underlying LLM capabilities and suggesting limits to simple architectural augmentation.

[250] MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use

Weikang Zhao, Xili Wang, Chengdi Ma, Lingbin Kong, Zhaohua Yang, Mingxiang Tuo, Xiaowei Shi, Yitao Zhai, Xunliang Cai

Main category: cs.AI

TL;DR: MUA-RL is a novel reinforcement learning framework that integrates LLM-simulated users into RL training for agentic tool use, enabling better multi-turn interactions and outperforming larger models on various benchmarks.

Details

Motivation: Existing RL approaches for tool use lack integration of dynamic users during training, making it challenging for agents to handle the dynamic, uncertain nature of user demands in multi-turn interactions.

Method: MUA-RL introduces LLM-simulated users into the reinforcement learning loop, allowing autonomous learning of communication with users and tool usage in dynamic multi-turn scenarios.

Result: MUA-RL-32B achieves strong performance: 67.3 on TAU2 Retail, 45.4 on TAU2 Airline, 28.3 on TAU2 Telecom, 28.4 on BFCL-V3 Multi Turn, and 82.5 on ACEBench Agent, outperforming or matching larger models like DeepSeek-V3-0324 and Qwen3-235B-A22B.

Conclusion: The integration of LLM-simulated users into RL training significantly improves agentic tool use capabilities in dynamic multi-turn interactions, demonstrating that smaller models can achieve competitive performance through better training methodology.

Abstract: With the recent rapid advancement of Agentic Intelligence, agentic tool use in LLMs has become increasingly important. During multi-turn interactions between agents and users, the dynamic, uncertain, and stochastic nature of user demands poses significant challenges to the agent’s tool invocation capabilities. Agents are no longer expected to simply call tools to deliver a result; rather, they must iteratively refine their understanding of user needs through communication while simultaneously invoking tools to resolve user queries. Existing reinforcement learning (RL) approaches for tool use lack the integration of genuinely dynamic users during the RL training process. To bridge this gap, we introduce MUA-RL (Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use), a novel reinforcement learning framework that, for the first time in the field of agentic tool use, integrates LLM-simulated users into the reinforcement learning loop. MUA-RL aims to enable autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions. Evaluations are done on several multi-turn tool-using benchmarks (see Figure 1). Specifically, MUA-RL-32B achieves 67.3 on TAU2 Retail, 45.4 on TAU2 Airline, 28.3 on TAU2 Telecom, 28.4 on BFCL-V3 Multi Turn, and 82.5 on ACEBench Agent – outperforming or matching the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.

[251] AppAgent-Pro: A Proactive GUI Agent System for Multidomain Information Integration and User Assistance

Yuyang Zhao, Wentao Shi, Fuli Feng, Xiangnan He

Main category: cs.AI

TL;DR: AppAgent-Pro is a proactive GUI agent system that actively integrates multi-domain information to anticipate user needs, overcoming limitations of reactive LLM-based agents.

Details

Motivation: Existing LLM-based agents operate reactively, responding passively to user instructions, which constrains their effectiveness and efficiency as general-purpose information acquisition platforms.

Method: Proposes a proactive GUI agent system that actively integrates multi-domain information based on user instructions to anticipate underlying needs and conduct in-depth multi-domain information mining.

Result: The system enables acquisition of more comprehensive and intelligent information and has the potential to fundamentally redefine information acquisition in daily life.

Conclusion: AppAgent-Pro represents a significant advancement from reactive to proactive AI agents, with potential for profound impact on human society by transforming how information is acquired and utilized.

Abstract: Large language model (LLM)-based agents have demonstrated remarkable capabilities in addressing complex tasks, thereby enabling more advanced information retrieval and supporting deeper, more sophisticated human information-seeking behaviors. However, most existing agents operate in a purely reactive manner, responding passively to user instructions, which significantly constrains their effectiveness and efficiency as general-purpose platforms for information acquisition. To overcome this limitation, this paper proposes AppAgent-Pro, a proactive GUI agent system that actively integrates multi-domain information based on user instructions. This approach enables the system to proactively anticipate users’ underlying needs and conduct in-depth multi-domain information mining, thereby facilitating the acquisition of more comprehensive and intelligent information. AppAgent-Pro has the potential to fundamentally redefine information acquisition in daily life, leading to a profound impact on human society. Our code is available at: https://github.com/LaoKuiZe/AppAgent-Pro. Our code is available at: https://github.com/LaoKuiZe/AppAgent-Pro. The demonstration video could be found at: https://www.dropbox.com/scl/fi/hvzqo5vnusg66srydzixo/AppAgent-Pro-demo-video.mp4?rlkey=o2nlfqgq6ihl125mcqg7bpgqu&st=d29vrzii&dl=0.

[252] LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence

Alisa Vinogradova, Vlad Vinogradov, Dmitrii Radkevich, Ilya Yasny, Dmitry Kobyzev, Ivan Izmailov, Katsiaryna Yanchanka, Roman Doronin, Andrey Doronichev

Main category: cs.AI

TL;DR: AI agent system for drug competitor discovery achieves 83% recall, reducing biotech VC analysis time from 2.5 days to ~3 hours (20x improvement).

Details

Motivation: Current LLM-based systems cannot reliably retrieve all competing drug names for investor-specific due diligence, and there's no public benchmark for this paywalled, fragmented, rapidly changing data problem.

Method: Uses LLM-based agents to transform 5 years of multi-modal unstructured diligence memos into structured evaluation corpus, with competitor-validating LLM-as-judge agent to filter false positives and suppress hallucinations.

Result: Achieves 83% recall, outperforming OpenAI Deep Research (65%) and Perplexity Labs (60%). Production deployment shows ~20x speedup in analyst turnaround time (2.5 days to ~3 hours).

Conclusion: The competitor-discovery AI agent system effectively addresses the complex drug landscape analysis problem, demonstrating significant performance improvements and practical utility in biotech VC due diligence workflows.

Abstract: In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren’t capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor-discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to $\sim$3 hours ($\sim$20x) for the competitive analysis.

Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai, Hao Wang

Main category: cs.AI

TL;DR: VistaWise is a cost-effective agent framework that integrates cross-modal domain knowledge with a fine-tuned object detection model, reducing domain-specific training data requirements from millions to hundreds of samples while achieving state-of-the-art performance in open-world tasks.

Details

Motivation: LLMs show promise in embodied decision-making but are hindered by lack of domain-specific knowledge. Traditional fine-tuning methods require prohibitive development costs with large-scale domain-specific data.

Method: Integrates visual information and textual dependencies into a cross-modal knowledge graph, uses retrieval-based pooling strategy to extract task-related information, and employs a desktop-level skill library for direct Minecraft client operation via mouse/keyboard inputs.

Result: Achieves state-of-the-art performance across various open-world tasks, significantly reducing development costs while enhancing agent performance.

Conclusion: VistaWise effectively bridges the domain knowledge gap for LLMs in embodied decision-making through cost-efficient cross-modal integration and specialized visual analysis capabilities.

Abstract: Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.

[254] Bias Mitigation Agent: Optimizing Source Selection for Fair and Balanced Knowledge Retrieval

Karanbir Singh, Deepak Muppiri, William Ngu

Main category: cs.AI

TL;DR: A multi-agent system called Bias Mitigation Agent that reduces bias in LLM-generated content by 81.82% through optimized source selection.

Details

Motivation: LLMs and Agentic AI systems inherit biases from their training data and external sources, which affects fairness, balance of retrieved information, and reduces user trust.

Method: A multi-agent system with specialized agents that orchestrate bias mitigation workflow by optimizing source selection to ensure highly relevant and minimally biased content retrieval.

Result: Experimental results show 81.82% reduction in bias compared to baseline naive retrieval strategy.

Conclusion: The Bias Mitigation Agent effectively addresses bias in AI systems, promoting fair and balanced knowledge dissemination while maintaining content relevance.

Abstract: Large Language Models (LLMs) have transformed the field of artificial intelligence by unlocking the era of generative applications. Built on top of generative AI capabilities, Agentic AI represents a major shift toward autonomous, goal-driven systems that can reason, retrieve, and act. However, they also inherit the bias present in both internal and external information sources. This significantly affects the fairness and balance of retrieved information, and hence reduces user trust. To address this critical challenge, we introduce a novel Bias Mitigation Agent, a multi-agent system designed to orchestrate the workflow of bias mitigation through specialized agents that optimize the selection of sources to ensure that the retrieved content is both highly relevant and minimally biased to promote fair and balanced knowledge dissemination. The experimental results demonstrate an 81.82% reduction in bias compared to a baseline naive retrieval strategy.

[255] CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks

Sunguk Choi, Yonghoon Kwon, Heondeuk Lee

Main category: cs.AI

TL;DR: CAC-CoT is a method that restricts reasoning to a small set of connector phrases to create concise, structured explanations, achieving high efficiency with shorter reasoning traces while maintaining accuracy on both System-1 and System-2 tasks.

Details

Motivation: Long chain-of-thought prompting can slow down or degrade performance on fast, intuitive System-1 tasks, so there's a need for more efficient reasoning methods that maintain performance across different task types.

Method: Connector-Aware Compact CoT (CAC-CoT) that deliberately restricts reasoning to a small, fixed set of connector phrases to steer models toward concise and well-structured explanations, implemented with Gemini-2.0-Flash.

Result: Achieves ~85% on GSM8K and ~40% on GPQA (System-2 tasks) while retaining ~90% on S1-Bench (System-1 tasks). Reasoning traces average ~300 tokens, about one-third the length of baseline traces.

Conclusion: CAC-CoT delivers higher efficiency without loss of accuracy, providing a synthetic method that yields high-quality training while maintaining performance across both intuitive and complex reasoning tasks.

Abstract: Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs) solve difficult problems, but very long traces often slow or even degrade performance on fast, intuitive “System-1” tasks. We introduce Connector-Aware Compact CoT (CAC-CoT) – a method that deliberately restricts reasoning to a small, fixed set of connector phrases, steering the model toward concise and well – structured explanations. Despite its simplicity, our synthetic method with Gemini-2.0-Flash yields a high-quality training quality. CAC-CoT achieves approximately 85% on GSM8K and approximately 40% on GPQA (System-2) while retaining approximately 90% on S1-Bench (System-1). Its reasoning traces average approximately 300 tokens(ART), about one-third the length of baseline traces, delivering higher efficiency without loss of accuracy.

[256] Reflection-Enhanced Meta-Optimization Integrating TextGrad-style Prompt Optimization with Memory-Driven Self-Evolution

Chunlong Wu, Zhibo Qu

Main category: cs.AI

TL;DR: REMO is a novel prompt optimization framework that integrates memory-augmented reflection and self-adaptive optimization to overcome limitations of stateless methods like TextGrad, enabling better generalization and continual improvement.

Details

Motivation: Current prompt optimization methods are stateless, lack historical experience preservation, and suffer from overfitting with poor generalization beyond immediate task contexts.

Method: REMO combines (1) a memory-augmented Reflection RAG module structured as a “mistake notebook” and (2) a Self-Adaptive Optimizer with LLM-driven meta-controller that synthesizes epoch-level insights to iteratively improve prompting strategies.

Result: On GSM8K mathematical reasoning benchmark, REMO achieves more stable and robust generalization compared to TextGrad baseline, though with increased computational overhead.

Conclusion: REMO enables systematic accumulation and reuse of cross-run optimization knowledge, supporting continual improvement over time through reflective meta-optimization.

Abstract: Recent advances in prompt optimization, exemplified by methods such as TextGrad, enable automatic, gradient-like refinement of textual prompts to enhance the performance of large language models (LLMs) on specific downstream tasks. However, current approaches are typically stateless and operate independently across optimization runs, lacking mechanisms to preserve and leverage historical optimization experience. Furthermore, they are susceptible to overfitting, often yielding prompt updates that generalize poorly beyond the immediate task context. To address these limitations, we propose Reflection-Enhanced Meta-Optimization (REMO), a novel framework that integrates (1) a memory-augmented Reflection Retrieval-Augmented Generation (RAG) module - structured as a “mistake notebook” and (2) a Self-Adaptive Optimizer, implemented via an LLM-driven meta-controller that synthesizes epoch-level reflective insights to iteratively improve system-level prompting strategies. This architecture enables not only local, fine-grained prompt tuning akin to TextGrad, but also the systematic accumulation and reuse of cross-run optimization knowledge, thereby supporting continual improvement over time. We instantiate the REMO framework using Qwen3-32B in standard inference mode

without explicit chain-of-thought prompting - and evaluate its efficacy on the GSM8K benchmark for mathematical reasoning. Experimental results demonstrate that, compared to a TextGrad baseline, REMO achieves more stable and robust generalization, albeit at the cost of increased computational overhead. We provide a detailed exposition of the algorithmic design, conduct a qualitative and quantitative analysis of optimization dynamics, and present a comprehensive ablation study to elucidate the contributions of each component.

[257] Stabilizing Open-Set Test-Time Adaptation via Primary-Auxiliary Filtering and Knowledge-Integrated Prediction

Byung-Joon Lee, Jin-Seop Lee, Jee-Hyong Lee

Main category: cs.AI

TL;DR: Proposes PAF-KIP method for open-set test-time adaptation, using dual filtering and knowledge integration to handle domain-shifted test data with unknown classes.

Details

Motivation: Real-world test data often contains unknown classes (open-set) that degrade closed-set TTA performance. Existing methods relying on source models for filtering perform poorly on domain-shifted data, while using adapting models leads to error accumulation.

Method: Primary-Auxiliary Filtering (PAF) uses an auxiliary filter to validate data filtered by primary filter. Knowledge-Integrated Prediction (KIP) calibrates outputs from adapting model, EMA model, and source model to integrate complementary knowledge.

Result: Method enhances both closed-set accuracy and open-set discrimination across diverse datasets compared to existing methods.

Conclusion: The proposed PAF-KIP approach effectively addresses open-set TTA challenges by combining robust filtering with knowledge integration from multiple model states.

Abstract: Deep neural networks demonstrate strong performance under aligned training-test distributions. However, real-world test data often exhibit domain shifts. Test-Time Adaptation (TTA) addresses this challenge by adapting the model to test data during inference. While most TTA studies assume that the training and test data share the same class set (closed-set TTA), real-world scenarios often involve open-set data (open-set TTA), which can degrade closed-set accuracy. A recent study showed that identifying open-set data during adaptation and maximizing its entropy is an effective solution. However, the previous method relies on the source model for filtering, resulting in suboptimal filtering accuracy on domain-shifted test data. In contrast, we found that the adapting model, which learns domain knowledge from noisy test streams, tends to be unstable and leads to error accumulation when used for filtering. To address this problem, we propose Primary-Auxiliary Filtering (PAF), which employs an auxiliary filter to validate data filtered by the primary filter. Furthermore, we propose Knowledge-Integrated Prediction (KIP), which calibrates the outputs of the adapting model, EMA model, and source model to integrate their complementary knowledge for OSTTA. We validate our approach across diverse closed-set and open-set datasets. Our method enhances both closed-set accuracy and open-set discrimination over existing methods. The code is available at https://github.com/powerpowe/PAF-KIP-OSTTA .

[258] Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models

Yi Liu, Xiangyu Liu, Zequn Sun, Wei Hu

Main category: cs.AI

TL;DR: LRMs fail to abstain from answering unanswerable questions despite having cognitive capabilities to recognize flaws, and a two-stage method combining cognitive monitoring with inference-time intervention significantly improves abstention rates.

Details

Motivation: Large reasoning models (LRMs) cannot properly abstain from answering inherently unanswerable questions like math problems with insufficient conditions, which poses trustworthiness issues for AI systems.

Method: A lightweight two-stage method that combines cognitive monitoring with inference-time intervention to align internal cognition with external response behavior.

Result: Experimental results show the method significantly improves abstention rate while maintaining overall reasoning performance.

Conclusion: The proposed approach successfully resolves the misalignment between LRMs’ internal cognitive capabilities and external response behavior, enabling appropriate abstention from unanswerable questions for more trustworthy AI.

Abstract: Large reasoning models (LRMs) have shown remarkable progress on complex reasoning tasks. However, some questions posed to LRMs are inherently unanswerable, such as math problems lacking sufficient conditions. We find that LRMs continually fail to provide appropriate abstentions when confronted with these unanswerable questions. In this paper, we systematically analyze, investigate, and resolve this issue for trustworthy AI. We first conduct a detailed analysis of the distinct response behaviors of LRMs when facing unanswerable questions. Then, we show that LRMs possess sufficient cognitive capabilities to recognize the flaws in these questions. However, they fail to exhibit appropriate abstention behavior, revealing a misalignment between their internal cognition and external response. Finally, to resolve this issue, we propose a lightweight, two-stage method that combines cognitive monitoring with inference-time intervention. Experimental results demonstrate that our method significantly improves the abstention rate while maintaining the overall reasoning performance.

[259] Dynamic Collaboration of Multi-Language Models based on Minimal Complete Semantic Units

Chao Hao, Zezheng Wang, Yanhua Huang, Ruiwen Xu, Wenzhe Niu, Xin Liu, Zitong Yu

Main category: cs.AI

TL;DR: Token-level multi-model collaboration with dynamic selection strategy and semantic unit alignment improves language model reasoning.

Details

Motivation: To enhance reasoning capabilities in language models by leveraging multiple models' token distributions while addressing vocabulary misalignment challenges.

Method: Proposes distribution distance-based dynamic selection (DDS) strategy and minimal complete semantic units (MCSU) concept for optimal token selection and vocabulary alignment across multiple language models.

Result: Experimental results across various benchmarks demonstrate superior performance compared to single-model approaches.

Conclusion: The proposed method effectively enhances reasoning capabilities through optimized multi-model collaboration with proper vocabulary alignment.

Abstract: This paper investigates the enhancement of reasoning capabilities in language models through token-level multi-model collaboration. Our approach selects the optimal tokens from the next token distributions provided by multiple models to perform autoregressive reasoning. Contrary to the assumption that more models yield better results, we introduce a distribution distance-based dynamic selection strategy (DDS) to optimize the multi-model collaboration process. To address the critical challenge of vocabulary misalignment in multi-model collaboration, we propose the concept of minimal complete semantic units (MCSU), which is simple yet enables multiple language models to achieve natural alignment within the linguistic space. Experimental results across various benchmarks demonstrate the superiority of our method. The code will be available at https://github.com/Fanye12/DDS.

[260] CausalMACE: Causality Empowered Multi-Agents in Minecraft Cooperative Tasks

Qi Chai, Zhang Zheng, Junlong Ren, Deheng Ye, Zichuan Lin, Hao Wang

Main category: cs.AI

TL;DR: CausalMACE is a causality-based planning framework that enhances multi-agent collaboration in Minecraft by incorporating causal reasoning to manage subtask dependencies, achieving state-of-the-art performance.

Details

Motivation: Single LLM agents struggle with complex, lengthy tasks in Minecraft due to inefficiency and limited fault tolerance, while multi-agent collaboration research remains scarce despite these challenges.

Method: Proposes a holistic causality planning framework with two modules: an overarching task graph for global planning and a causality-based module for dependency management using causal intervention rules.

Result: Experimental results demonstrate state-of-the-art performance in multi-agent cooperative tasks within the Minecraft environment.

Conclusion: The causality-based framework effectively addresses multi-agent collaboration challenges in complex Minecraft tasks, showing superior performance through structured dependency management.

Abstract: Minecraft, as an open-world virtual interactive environment, has become a prominent platform for research on agent decision-making and execution. Existing works primarily adopt a single Large Language Model (LLM) agent to complete various in-game tasks. However, for complex tasks requiring lengthy sequences of actions, single-agent approaches often face challenges related to inefficiency and limited fault tolerance. Despite these issues, research on multi-agent collaboration remains scarce. In this paper, we propose CausalMACE, a holistic causality planning framework designed to enhance multi-agent systems, in which we incorporate causality to manage dependencies among subtasks. Technically, our proposed framework introduces two modules: an overarching task graph for global task planning and a causality-based module for dependency management, where inherent rules are adopted to perform causal intervention. Experimental results demonstrate our approach achieves state-of-the-art performance in multi-agent cooperative tasks of Minecraft.

[261] STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning

Chenghao Wu, Ruiyang Ren, Junjie Zhang, Ruirui Wang, Zhongrui Ma, Qi Ye, Wayne Xin Zhao

Main category: cs.AI

TL;DR: STARec introduces a slow-thinking augmented agent framework that combines fast response and slow reasoning capabilities to improve recommender systems, achieving significant performance gains with minimal training data.

Details

Motivation: Current recommender systems and LLM-based agents suffer from static user modeling, reactive decision-making, shallow correlation bias, limited causal inference, and brittleness in sparse-data scenarios.

Method: STARec models each user as an agent with parallel cognitions (fast response + slow chain-of-thought reasoning) using anchored reinforcement training - a two-stage paradigm combining knowledge distillation from advanced reasoning models with preference-aligned reward shaping.

Result: Experiments on MovieLens 1M and Amazon CDs benchmarks show STARec achieves substantial performance gains compared to state-of-the-art baselines, despite using only 0.4% of the full training data.

Conclusion: The slow-thinking augmented agent framework successfully endows recommender systems with autonomous deliberative reasoning capabilities, overcoming limitations of traditional approaches through hybrid fast-slow cognition and structured training.

Abstract: While modern recommender systems are instrumental in navigating information abundance, they remain fundamentally limited by static user modeling and reactive decision-making paradigms. Current large language model (LLM)-based agents inherit these shortcomings through their overreliance on heuristic pattern matching, yielding recommendations prone to shallow correlation bias, limited causal inference, and brittleness in sparse-data scenarios. We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities. Each user is modeled as an agent with parallel cognitions: fast response for immediate interactions and slow reasoning that performs chain-of-thought rationales. To cultivate intrinsic slow thinking, we develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping. This hybrid approach scaffolds agents in acquiring foundational capabilities (preference summarization, rationale generation) while enabling dynamic policy adaptation through simulated feedback loops. Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines, despite using only 0.4% of the full training data.

[262] Judicial Requirements for Generative AI in Legal Reasoning

Eljas Linna, Tuula Linna

Main category: cs.AI

TL;DR: LLMs have limitations in legal reasoning despite various AI enhancement techniques. They work best as assistants for simple cases and sparring partners for complex legal matters.

Details

Motivation: To understand LLM limitations in high-stakes legal domains and define core capabilities needed for reliable judicial decision-making tools.

Method: Uses IRAC framework to analyze legal reasoning requirements, maps AI enhancement mechanisms (RAG, multi-agent systems, neuro-symbolic AI) to legal challenges, and assesses their effectiveness.

Result: AI techniques can address specific legal challenges but significant limitations remain, especially in tasks requiring discretion and transparent reasoning.

Conclusion: AI’s most effective role in law is as a high-volume assistant for simple cases and a sophisticated sparring partner for human experts in complex matters.

Abstract: Large Language Models (LLMs) are being integrated into professional domains, yet their limitations in high-stakes fields like law remain poorly understood. This paper defines the core capabilities that an AI system must possess to function as a reliable reasoning tool in judicial decision-making. Using the IRAC (Issue-Rule-Application-Conclusion) model as an analytical framework, the study focuses on the most challenging phases of legal adjudication: determining the applicable Rule (R) and performing the Application (A) of that rule to the facts of a case. From a judicial perspective, the analysis deconstructs legal reasoning into a series of core requirements, including the ability to select the correct legal framework across jurisdictions, generate sound arguments based on the doctrine of legal sources, distinguish ratio decidendi from obiter dictum in case law, resolve ambiguity arising from general clauses like “reasonableness”, manage conflicting legal provisions, and correctly apply the burden of proof. The paper then maps various AI enhancement mechanisms, such as Retrieval-Augmented Generation (RAG), multi-agent systems, and neuro-symbolic AI, to these requirements, assessing their potential to bridge the gap between the probabilistic nature of LLMs and the rigorous, choice-driven demands of legal interpretation. The findings indicate that while these techniques can address specific challenges, significant challenges remain, particularly in tasks requiring discretion and transparent, justifiable reasoning. Our paper concludes that the most effective current role for AI in law is a dual one: as a high-volume assistant for simple, repetitive cases and as a sophisticated “sparring partner” for human experts in complex matters.

[263] Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

Dimitrios Rontogiannis, Maxime Peyrard, Nicolas Baldwin, Martin Josifoski, Robert West, Dimitrios Gunopulos

Main category: cs.AI

TL;DR: Proposes an interactive evaluation framework using requirement dependency graphs and interviewer-interviewee LLM dialogue to dynamically assess programming capabilities, revealing strengths and weaknesses that static benchmarks miss.

Details

Motivation: Standard static benchmarks are insufficient for evaluating LLMs on complex software engineering tasks that require nuanced, multi-step problem-solving with feedback.

Method: Uses requirement dependency graphs to model programming tasks, with an interviewer LLM (aware of ground truth) providing targeted hints to an interviewee LLM to correct errors and fulfill constraints through structured dialogue.

Result: The framework enables fine-grained diagnostic insights into model behavior, uncovering systematic weaknesses that static evaluation fails to measure, as validated through expert annotation of hint relevance and utility.

Conclusion: Dynamic interactive evaluation is crucial for developing collaborative code-generating agents and provides more comprehensive assessment than traditional static benchmarks.

Abstract: Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that assesses LLMs on multi-requirement programming tasks through structured, feedback-driven dialogue. Each task is modeled as a requirement dependency graph, and an interviewer'' LLM, aware of the ground-truth solution, provides minimal, targeted hints to an interviewee’’ model to help correct errors and fulfill target constraints. This dynamic protocol enables fine-grained diagnostic insights into model behavior, uncovering strengths and systematic weaknesses that static benchmarks fail to measure. We build on DevAI, a benchmark of 55 curated programming tasks, by adding ground-truth solutions and evaluating the relevance and utility of interviewer hints through expert annotation. Our results highlight the importance of dynamic evaluation in advancing the development of collaborative code-generating agents.

[264] FormaRL: Enhancing Autoformalization with no Labeled Data

Yanxing Huang, Xinling Jin, Sijie Liang, Peng Li, Yang Liu

Main category: cs.AI

TL;DR: FormaRL is a reinforcement learning framework for autoformalization that uses Lean compiler syntax checks and LLM consistency checks as rewards, achieving 4-6x accuracy improvements with minimal unlabeled data.

Details

Motivation: Autoformalization advancement is hindered by data scarcity and lack of efficient methods, requiring a solution that works with limited labeled data.

Method: Reinforcement learning framework integrating Lean compiler syntax checks and LLM consistency checks for reward calculation, using GRPO algorithm to update the formalizer with only 859 unlabeled examples.

Result: Increased pass@1 accuracy of Qwen2.5-Coder-7B-Instruct by 4-6x (4.04% → 26.15% on ProofNet and 2.4% → 9.6% on uproof), with strong out-of-distribution performance improvements.

Conclusion: FormaRL demonstrates that reinforcement learning with compiler feedback and minimal unlabeled data can significantly advance autoformalization capabilities for mathematical theorem proving.

Abstract: Autoformalization is one of the central tasks in formal verification, while its advancement remains hindered due to the data scarcity and the absence efficient methods. In this work we propose \textbf{FormaRL}, a simple yet efficient reinforcement learning framework for autoformalization which only requires a small amount of unlabeled data. FormaRL integrates syntax check from Lean compiler and consistency check from large language model to calculate the reward, and adopts GRPO algorithm to update the formalizer. We also curated a proof problem dataset from undergraduate-level math materials, named \textbf{uproof}, in the hope to facilitate the exploration of autoformalization and theorem proving in advanced math. Experiments show that FormaRL can increase the pass@1 autoformalization accuracy of Qwen2.5-Coder-7B-Instruct by 4 $\sim$ 6x (4.04% $\to$ 26.15% on ProofNet and 2.4% $\to$ 9.6% on uproof) with merely 859 unlabeled data. And on uproof our method also achieved a strong improvement in out-of-distribution performance compared to existing open-source state-of-the-art autoformalizers on both pass@1 accuracy (6.2% $\to$ 9.6%) and pass@16 accuracy (24.4% $\to$ 33.6%). Training code of FormaRL is open-sourced at https://github.com/THUNLP-MT/FormaRL.

[265] Who Is Lagging Behind: Profiling Student Behaviors with Graph-Level Encoding in Curriculum-Based Online Learning Systems

Qian Xiao, Conn Breathnach, Ioana Ghergulescu, Conor O’Sullivan, Keith Johnston, Vincent Wade

Main category: cs.AI

TL;DR: CTGraph is a graph-level representation learning approach that profiles student behaviors and performance in ITSs using self-supervised learning to identify struggling students and provide holistic learning journey analysis.

Details

Motivation: Intelligent Tutoring Systems can unintentionally widen performance gaps, making student profiling crucial for tracking progress, identifying struggling students, and reducing educational disparities.

Method: CTGraph uses graph-level representation learning in a self-supervised manner to model student behaviors across multiple aspects including content coverage, learning intensity, and concept proficiency.

Result: The approach provides a holistic view of student learning journeys, identifies struggling students, enables comparative analysis of diverse groups, and pinpoints when and where students face difficulties.

Conclusion: CTGraph empowers educators with rich insights into student learning and paves the way for more targeted interventions to address performance gaps in educational systems.

Abstract: The surge in the adoption of Intelligent Tutoring Systems (ITSs) in education, while being integral to curriculum-based learning, can inadvertently exacerbate performance gaps. To address this problem, student profiling becomes crucial for tracking progress, identifying struggling students, and alleviating disparities among students. Such profiling requires measuring student behaviors and performance across different aspects, such as content coverage, learning intensity, and proficiency in different concepts within a learning topic. In this study, we introduce CTGraph, a graph-level representation learning approach to profile learner behaviors and performance in a self-supervised manner. Our experiments demonstrate that CTGraph can provide a holistic view of student learning journeys, accounting for different aspects of student behaviors and performance, as well as variations in their learning paths as aligned to the curriculum structure. We also show that our approach can identify struggling students and provide comparative analysis of diverse groups to pinpoint when and where students are struggling. As such, our approach opens more opportunities to empower educators with rich insights into student learning journeys and paves the way for more targeted interventions.

[266] VISION: Robust and Interpretable Code Vulnerability Detection Leveraging Counterfactual Augmentation

David Egea, Barproda Halder, Sanghamitra Dutta

Main category: cs.AI

TL;DR: VISION is a framework that uses counterfactual training with LLMs to improve GNN-based vulnerability detection by reducing spurious correlations, achieving significant accuracy improvements from 51.8% to 97.8%.

Details

Motivation: GNNs for vulnerability detection suffer from training data imbalances and label noise, learning spurious correlations from superficial code similarities that fail to generalize to real-world data.

Method: Three-step framework: (i) generate counterfactuals using LLM prompts, (ii) targeted GNN training on paired code examples with opposite labels, (iii) graph-based interpretability to identify crucial code statements.

Result: Significant improvements: overall accuracy from 51.8% to 97.8%, pairwise contrast accuracy from 4.5% to 95.8%, worst-group accuracy from 0.7% to 85.5% on CWE-20 vulnerability. Created CWE-20-CFA benchmark with 27,556 functions.

Conclusion: VISION enables robust, generalizable vulnerability detection and advances transparent AI cybersecurity systems through interpretable visualization for human-in-the-loop analysis.

Abstract: Automated detection of vulnerabilities in source code is an essential cybersecurity challenge, underpinning trust in digital systems and services. Graph Neural Networks (GNNs) have emerged as a promising approach as they can learn structural and logical code relationships in a data-driven manner. However, their performance is severely constrained by training data imbalances and label noise. GNNs often learn ‘spurious’ correlations from superficial code similarities, producing detectors that fail to generalize well to unseen real-world data. In this work, we propose a unified framework for robust and interpretable vulnerability detection, called VISION, to mitigate spurious correlations by systematically augmenting a counterfactual training dataset. Counterfactuals are samples with minimal semantic modifications but opposite labels. Our framework includes: (i) generating counterfactuals by prompting a Large Language Model (LLM); (ii) targeted GNN training on paired code examples with opposite labels; and (iii) graph-based interpretability to identify the crucial code statements relevant for vulnerability predictions while ignoring spurious ones. We find that VISION reduces spurious learning and enables more robust, generalizable detection, improving overall accuracy (from 51.8% to 97.8%), pairwise contrast accuracy (from 4.5% to 95.8%), and worst-group accuracy (from 0.7% to 85.5%) on the Common Weakness Enumeration (CWE)-20 vulnerability. We further demonstrate gains using proposed metrics: intra-class attribution variance, inter-class attribution distance, and node score dependency. We also release CWE-20-CFA, a benchmark of 27,556 functions (real and counterfactual) from the high-impact CWE-20 category. Finally, VISION advances transparent and trustworthy AI-based cybersecurity systems through interactive visualization for human-in-the-loop analysis.

[267] Novel Approaches to Artificial Intelligence Development Based on the Nearest Neighbor Method

I. I. Priezzhev, D. A. Danko, A. V. Shubin

Main category: cs.AI

TL;DR: Proposes hierarchical clustering with k-nearest neighbors and Kohonen maps to address neural network limitations like hallucination, high computational cost, and catastrophic forgetting, achieving significant speed improvements with minimal accuracy loss.

Details

Motivation: To overcome fundamental limitations of neural networks including hallucination effects, high computational complexity, costly fine-tuning, and catastrophic forgetting that hinder their use in critical applications like medicine and scientific research.

Method: Uses k-nearest neighbors algorithm with hierarchical clustering structures and tree-like data structures based on Kohonen self-organizing maps to accelerate nearest neighbor searches and reduce computational load.

Result: Tests on handwritten digit recognition and subtitle translation showed nearest neighbor search time reduced hundreds of times compared to exhaustive search, with only slight accuracy reduction. Method eliminates hallucination and simplifies model expansion.

Conclusion: The proposed approach offers transparency, interpretability, aligns with human cognition, and shows strong potential for high-reliability applications requiring explainable results.

Abstract: Modern neural network technologies, including large language models, have achieved remarkable success in various applied artificial intelligence applications, however, they face a range of fundamental limitations. Among them are hallucination effects, high computational complexity of training and inference, costly fine-tuning, and catastrophic forgetting issues. These limitations significantly hinder the use of neural networks in critical areas such as medicine, industrial process management, and scientific research. This article proposes an alternative approach based on the nearest neighbors method with hierarchical clustering structures. Employing the k-nearest neighbors algorithm significantly reduces or completely eliminates hallucination effects while simplifying model expansion and fine-tuning without the need for retraining the entire network. To overcome the high computational load of the k-nearest neighbors method, the paper proposes using tree-like data structures based on Kohonen self-organizing maps, thereby greatly accelerating nearest neighbor searches. Tests conducted on handwritten digit recognition and simple subtitle translation tasks confirmed the effectiveness of the proposed approach. With only a slight reduction in accuracy, the nearest neighbor search time was reduced hundreds of times compared to exhaustive search methods. The proposed method features transparency and interpretability, closely aligns with human cognitive mechanisms, and demonstrates potential for extensive use in tasks requiring high reliability and explainable results.

[268] Enabling MoE on the Edge via Importance-Driven Expert Scheduling

Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang

Main category: cs.AI

TL;DR: A novel MoE expert offloading approach that uses expert importance to guide substitution with cached experts, reducing memory usage and PCIe overhead while maintaining accuracy.

Details

Motivation: Deploying Mixture of Experts models on edge hardware is constrained by limited device memory, requiring efficient expert offloading strategies that preserve model accuracy.

Method: Leverages expert importance to guide offloading decisions, substituting low-importance activated experts with functionally similar cached experts. Introduces a scheduling policy that maximizes GPU-cached expert reuse.

Result: 48% lower decoding latency, over 60% expert cache hit rate, while maintaining nearly lossless accuracy. Reduces memory usage and data transfer while eliminating PCIe overhead.

Conclusion: The importance-guided expert substitution approach enables efficient MoE deployment on edge hardware with significant performance improvements and minimal accuracy loss.

Abstract: The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.

Pontus Strimling, Simon Karlsson, Irina Vartanova, Kimmo Eriksson

Main category: cs.AI

TL;DR: Large language models like GPT-4.5, Gemini 2.5 Pro, GPT-5, and Claude Sonnet 4 can predict human social appropriateness judgments better than most individual humans, demonstrating that sophisticated social cognition can emerge from statistical learning alone without embodied experience.

Details

Motivation: To investigate whether large language models can acquire sophisticated social norm understanding through statistical learning alone, challenging theories that emphasize the exclusive necessity of embodied social experience for cultural competence.

Method: Two studies evaluating multiple AI systems’ ability to predict human social appropriateness judgments for 555 everyday scenarios by comparing how closely they predicted average human judgment relative to individual human participants.

Result: GPT-4.5 achieved 100th percentile accuracy (outperforming every human participant), Gemini 2.5 Pro outperformed 98.7% of humans, GPT-5 outperformed 97.8%, and Claude Sonnet 4 outperformed 96.0%. All models showed systematic, correlated errors despite their predictive power.

Conclusion: Language serves as a remarkably rich repository for cultural knowledge transmission, and sophisticated social cognition models can emerge from statistical learning over linguistic data alone, though systematic limitations across architectures suggest potential boundaries of pattern-based social understanding.

Abstract: A fundamental question in cognitive science concerns how social norms are acquired and represented. While humans typically learn norms through embodied social experience, we investigated whether large language models can achieve sophisticated norm understanding through statistical learning alone. Across two studies, we systematically evaluated multiple AI systems’ ability to predict human social appropriateness judgments for 555 everyday scenarios by examining how closely they predicted the average judgment compared to each human participant. In Study 1, GPT-4.5’s accuracy in predicting the collective judgment on a continuous scale exceeded that of every human participant (100th percentile). Study 2 replicated this, with Gemini 2.5 Pro outperforming 98.7% of humans, GPT-5 97.8%, and Claude Sonnet 4 96.0%. Despite this predictive power, all models showed systematic, correlated errors. These findings demonstrate that sophisticated models of social cognition can emerge from statistical learning over linguistic data alone, challenging strong versions of theories emphasizing the exclusive necessity of embodied experience for cultural competence. The systematic nature of AI limitations across different architectures indicates potential boundaries of pattern-based social understanding, while the models’ ability to outperform nearly all individual humans in this predictive task suggests that language serves as a remarkably rich repository for cultural knowledge transmission.

[270] Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark

Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He

Main category: cs.AI

TL;DR: The paper introduces Experience-driven Lifelong Learning (ELL), a framework for creating self-evolving AI agents that learn continuously through real-world interaction, along with StuLife benchmark for evaluating lifelong learning capabilities.

Details

Motivation: As AI advances toward general intelligence, there's a need to shift from systems optimized for static tasks to open-ended agents that can learn continuously through real-world interaction and experience.

Method: Proposes ELL framework with four core principles: Experience Exploration, Long-term Memory, Skill Learning, and Knowledge Internalization. Also introduces StuLife benchmark dataset simulating a student’s college journey across three phases and ten sub-scenarios.

Result: Developed a comprehensive framework and benchmark for evaluating lifelong learning capabilities including memory retention, skill transfer, and self-motivated behavior. Evaluated state-of-the-art LLMs on the StuLife benchmark.

Conclusion: The ELL framework and StuLife benchmark provide a foundation for building self-evolving agents capable of continuous growth, representing a step toward more general AI systems that can learn through real-world experience.

Abstract: As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as “second nature”. We also introduce StuLife, a benchmark dataset for ELL that simulates a student’s holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm shifts: From Passive to Proactive, From Context to Memory, and From Imitation to Learning. In this dynamic environment, agents must acquire and distill practical skills and maintain persistent memory to make decisions based on evolving state variables. StuLife provides a comprehensive platform for evaluating lifelong learning capabilities, including memory retention, skill transfer, and self-motivated behavior. Beyond evaluating SOTA LLMs on the StuLife benchmark, we also explore the role of context engineering in advancing AGI.

[271] Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI

Marcin Moskalewicz, Anna Sterna, Marek Pokropski, Paula Flores

Main category: cs.AI

TL;DR: LLMs can support phenomenological analysis of BPD experiences, with Gemini performing closest to human analysis and recovering omitted themes, though overlap varies significantly.

Details

Motivation: To examine if large language models can effectively support qualitative analysis of first-person experiences in Borderline Personality Disorder, particularly focusing on temporality and selfhood disorders.

Method: Compared three LLMs (GPT-4o, Gemini 2.5 Pro, Claude Opus 4) prompted to mimic human investigators’ interpretative style on 24 inpatients’ life-story interviews, using blinded expert evaluation, semantic congruence, Jaccard coefficients, and multidimensional validity ratings.

Result: Variable overlap with human analysis (0% GPT, 42% Claude, 58% Gemini), low Jaccard coefficients (0.21-0.28), but models recovered themes omitted by humans. Gemini performed best with highest validity scores and was judged as human by blinded experts.

Conclusion: AI-augmented thematic analysis shows potential to mitigate human interpretative bias, with performance strongly correlated to text quantity, demonstrating both variability and promise for clinical applications.

Abstract: This study examines the capacity of large language models (LLMs) to support phenomenological qualitative analysis of first-person experience in Borderline Personality Disorder (BPD), understood as a disorder of temporality and selfhood. Building on a prior human-led thematic analysis of 24 inpatients’ life-story interviews, we compared three LLMs (OpenAI GPT-4o, Google Gemini 2.5 Pro, Anthropic Claude Opus 4) prompted to mimic the interpretative style of the original investigators. The models were evaluated with blinded and non-blinded expert judges in phenomenology and clinical psychology. Assessments included semantic congruence, Jaccard coefficients, and multidimensional validity ratings (credibility, coherence, substantiveness, and groundness in data). Results showed variable overlap with the human analysis, from 0 percent in GPT to 42 percent in Claude and 58 percent in Gemini, and a low Jaccard coefficient (0.21-0.28). However, the models recovered themes omitted by humans. Gemini’s output most closely resembled the human analysis, with validity scores significantly higher than GPT and Claude (p < 0.0001), and was judged as human by blinded experts. All scores strongly correlated (R > 0.78) with the quantity of text and words per theme, highlighting both the variability and potential of AI-augmented thematic analysis to mitigate human interpretative bias.

[272] MAB Optimizer for Estimating Math Question Difficulty via Inverse CV without NLP

Surajit Das, Gourav Roy, Aleksei Eliseev, Ram Kumar Rajendran

Main category: cs.AI

TL;DR: APME framework uses reinforcement learning to estimate question difficulty from solver performance data (marks and time) without linguistic features or expert labels, achieving high accuracy across diverse educational contexts.

Details

Motivation: Traditional human labeling is subjective and existing NLP methods fail in symbolic domains like algebra. There's a need for objective, domain-agnostic methods to determine question difficulty in Intelligent Tutoring Systems.

Method: Reinforcement learning-based Multi-Armed Bandit framework that uses solver performance data (marks obtained and time taken) with inverse coefficient of variation as a risk-adjusted metric for adaptive assessment.

Result: Achieved average R2 of 0.9213 and average RMSE of 0.0584 across three heterogeneous datasets, outperforming regression-based, NLP-driven, and IRT baseline models, especially in symbolic domains.

Conclusion: The domain-agnostic, self-supervised approach effectively estimates question difficulty, aligns with pedagogical principles, and can be extended to various domains where solver interaction data is available.

Abstract: The evolution of technology and education is driving the emergence of Intelligent & Autonomous Tutoring Systems (IATS), where objective and domain-agnostic methods for determining question difficulty are essential. Traditional human labeling is subjective, and existing NLP-based approaches fail in symbolic domains like algebra. This study introduces the Approach of Passive Measures among Educands (APME), a reinforcement learning-based Multi-Armed Bandit (MAB) framework that estimates difficulty solely from solver performance data – marks obtained and time taken – without requiring linguistic features or expert labels. By leveraging the inverse coefficient of variation as a risk-adjusted metric, the model provides an explainable and scalable mechanism for adaptive assessment. Empirical validation was conducted on three heterogeneous datasets. Across these diverse contexts, the model achieved an average R2 of 0.9213 and an average RMSE of 0.0584, confirming its robustness, accuracy, and adaptability to different educational levels and assessment formats. Compared with baseline approaches-such as regression-based, NLP-driven, and IRT models-the proposed framework consistently outperformed alternatives, particularly in purely symbolic domains. The findings highlight that (i) item heterogeneity strongly influences perceived difficulty, and (ii) variance in solver outcomes is as critical as mean performance for adaptive allocation. Pedagogically, the model aligns with Vygotskys Zone of Proximal Development by identifying tasks that balance challenge and attainability, supporting motivation while minimizing disengagement. This domain-agnostic, self-supervised approach advances difficulty tagging in IATS and can be extended beyond algebra wherever solver interaction data is available

[273] Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

Congchi Yin, Tianyi Wu, Yankai Shu, Alex Gu, Yunhan Wang, Jun Shao, Xun Jiang, Piji Li

Main category: cs.AI

TL;DR: A new evaluation paradigm called black-box interaction is introduced to assess LLMs’ integrated reasoning abilities through interactive exploration of hidden functions, with o3 performing best but still struggling with complex tasks due to planning limitations.

Details

Motivation: Existing evaluation tasks fail to assess LLMs' integrated reasoning abilities in interactive, unknown environments, focusing only on isolated reasoning types rather than the holistic reasoning process needed for real-world discovery.

Method: Proposes black-box interaction paradigm where LLMs interact with hidden functions (black-boxes) through input-output exploration, and builds the Oracle benchmark with 6 task types and 96 black-boxes to test 19 modern LLMs.

Result: o3 ranks first in 5 out of 6 tasks, achieving over 70% accuracy on easy black-boxes but dropping below 40% on hard tasks. LLMs universally struggle with developing efficient exploration strategies for hypothesis refinement.

Conclusion: LLMs lack high-level planning capability for adaptive exploration strategies, highlighting the need for improved reasoning in interactive environments beyond current isolated reasoning evaluations.

Abstract: Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, \textit{black-box interaction}, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textsc{Oracle} benchmark which comprises 6 types of black-box task and 96 black-boxes. 19 modern LLMs are benchmarked. o3 ranks first in 5 of the 6 tasks, achieving over 70% accuracy on most easy black-boxes. But it still struggles with some hard black-box tasks, where its average performance drops below 40%. Further analysis indicates a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement.

[274] A Concurrent Modular Agent: Framework for Autonomous LLM Agents

Norihiro Maruyama, Takahide Yoshida, Hiroki Sato, Atsushi Masumori, Johnsmith, Takashi Ikegami

Main category: cs.AI

TL;DR: CMA is a framework using multiple concurrent LLM-based modules that operate asynchronously to create coherent, fault-tolerant agent behavior through language-mediated interactions and shared global state.

Details

Motivation: To address long-standing difficulties in agent architectures by enabling flexible, adaptive behavior through concurrent modular design inspired by Minsky's Society of Mind theory.

Method: Uses multiple LLM-based modules that operate fully asynchronously with inter-module communication and a single shared global state, allowing intention to emerge from language-mediated interactions.

Result: Demonstrated viability through two practical use-case studies, showing emergent properties suggesting complex cognitive phenomena like self-awareness can arise from organized interaction of simpler processes.

Conclusion: This approach provides a practical realization of Minsky’s Society of Mind theory and opens new avenues for artificial intelligence research by showing complex behaviors can emerge from concurrent modular interactions.

Abstract: We introduce the Concurrent Modular Agent (CMA), a framework that orchestrates multiple Large-Language-Model (LLM)-based modules that operate fully asynchronously yet maintain a coherent and fault-tolerant behavioral loop. This framework addresses long-standing difficulties in agent architectures by letting intention emerge from language-mediated interactions among autonomous processes. This approach enables flexible, adaptive, and context-dependent behavior through the combination of concurrently executed modules that offload reasoning to an LLM, inter-module communication, and a single shared global state.We consider this approach to be a practical realization of Minsky’s Society of Mind theory. We demonstrate the viability of our system through two practical use-case studies. The emergent properties observed in our system suggest that complex cognitive phenomena like self-awareness may indeed arise from the organized interaction of simpler processes, supporting Minsky-Society of Mind concept and opening new avenues for artificial intelligence research. The source code for our work is available at: https://github.com/AlternativeMachine/concurrent-modular-agent.

[275] Can Structured Templates Facilitate LLMs in Tackling Harder Tasks? : An Exploration of Scaling Laws by Difficulty

Zhichao Yang, Zhaoxin Fan, Gen Li, Yuanze Hu, Xinyu Wang, Ye Qiu, Xin Wang, Yifan Sun, Wenjun Wu

Main category: cs.AI

TL;DR: The paper proposes a Structured Solution Template (SST) framework that uses solution templates and curriculum learning to improve LLMs’ procedural reasoning in mathematics, addressing the U-shaped performance curve with respect to training data difficulty.

Details

Motivation: Current post-training methods for LLMs fail to capture deep procedural logic in complex mathematical tasks, with performance following a U-shaped curve relative to training data complexity - too much low-difficulty data hurts abstraction while high-difficulty data enhances reasoning.

Method: SST framework includes: (1) fine-tuning with structured solution-template chains and dynamically weighted loss to prioritize procedural logic, (2) prompt-time injection of solution templates as cognitive scaffolds, and (3) integrated curriculum fine-tuning that teaches self-plan-execute-correct cycles.

Result: Experiments on GSM8K, AIME24, and Dynamic En benchmarks show SST significantly improves both accuracy and efficiency, particularly on harder mathematical problems.

Conclusion: The proposed SST framework effectively addresses the scaling law by difficulty and enhances LLMs’ procedural reasoning capabilities through structured solution templates and curriculum-based training.

Abstract: Structured, procedural reasoning is essential for Large Language Models (LLMs), especially in mathematics. While post-training methods have improved LLM performance, they still fall short in capturing deep procedural logic on complex tasks. To tackle the issue, in this paper, we first investigate this limitation and uncover a novel finding: a Scaling Law by Difficulty, which reveals that model performance follows a U-shaped curve with respect to training data complexity – excessive low-difficulty data impedes abstraction, while high-difficulty data significantly enhances reasoning ability. Motivated by this, we propose the Structured Solution Template (SST) framework, which uses solution templates and a curriculum of varied difficulty to explicitly teach procedural reasoning. Specifically, SST comprises (1) fine-tuning with structured solution-template chains and dynamically weighted loss to prioritize procedural logic, (2) prompt-time injection of solution templates as cognitive scaffolds to guide inference, and (3) integrated curriculum fine-tuning that explicitly teaches the model to self-plan - execute - self-correct. Experiments on GSM8K, AIME24, and new Dynamic En benchmark show that SST significantly improves both accuracy and efficiency, especially on harder problems.

[276] Trustworthy Agents for Electronic Health Records through Confidence Estimation

Yongwoo Song, Minbyul Jeong, Mujeen Sung

Main category: cs.AI

TL;DR: Proposed HCAcc@k% metric and TrustEHRAgent for reliable clinical question answering with confidence estimation, achieving significant improvements over baselines under strict reliability constraints.

Details

Motivation: LLMs show promise for EHR information extraction but face hallucination risks in clinical settings, requiring metrics that quantify accuracy-reliability trade-offs.

Method: Introduced Hallucination Controlled Accuracy at k% (HCAcc@k%) metric and TrustEHRAgent agent with stepwise confidence estimation for clinical QA on MIMIC-III and eICU datasets.

Result: TrustEHRAgent outperformed baselines with 44.23%p and 25.34%p improvements at HCAcc@70%, where baseline methods failed at these strict reliability thresholds.

Conclusion: Traditional accuracy metrics are insufficient for healthcare AI evaluation; confidence-aware agents enable trustworthy clinical applications by delivering accurate information or transparent uncertainty expression.

Abstract: Large language models (LLMs) show promise for extracting information from Electronic Health Records (EHR) and supporting clinical decisions. However, deployment in clinical settings faces challenges due to hallucination risks. We propose Hallucination Controlled Accuracy at k% (HCAcc@k%), a novel metric quantifying the accuracy-reliability trade-off at varying confidence thresholds. We introduce TrustEHRAgent, a confidence-aware agent incorporating stepwise confidence estimation for clinical question answering. Experiments on MIMIC-III and eICU datasets show TrustEHRAgent outperforms baselines under strict reliability constraints, achieving improvements of 44.23%p and 25.34%p at HCAcc@70% while baseline methods fail at these thresholds. These results highlight limitations of traditional accuracy metrics in evaluating healthcare AI agents. Our work contributes to developing trustworthy clinical agents that deliver accurate information or transparently express uncertainty when confidence is low.

[277] Reasoning LLMs in the Medical Domain: A Literature Survey

Armin Berger, Sarthak Khanna, David Berghaus, Rafet Sifa

Main category: cs.AI

TL;DR: Survey on medical LLMs’ evolution from information retrieval to clinical reasoning systems, covering prompting techniques, reinforcement learning, evaluation methods, and challenges like bias mitigation and safety.

Details

Motivation: To examine the transformation of medical LLMs into sophisticated clinical reasoning systems that enhance decision transparency and explainability in healthcare contexts.

Method: Thorough analysis of enabling technological foundations including specialized prompting techniques (Chain-of-Thought), reinforcement learning breakthroughs (DeepSeek-R1), purpose-built medical frameworks, multi-agent collaborative systems, and innovative prompting architectures.

Result: Critical assessment of current evaluation methodologies for medical validation and identification of persistent challenges including field interpretation limitations, bias mitigation strategies, patient safety frameworks, and multimodal clinical data integration.

Conclusion: Establishes a roadmap for developing reliable LLMs that can serve as effective partners in clinical practice and medical research by addressing current limitations and challenges.

Abstract: The emergence of advanced reasoning capabilities in Large Language Models (LLMs) marks a transformative development in healthcare applications. Beyond merely expanding functional capabilities, these reasoning mechanisms enhance decision transparency and explainability-critical requirements in medical contexts. This survey examines the transformation of medical LLMs from basic information retrieval tools to sophisticated clinical reasoning systems capable of supporting complex healthcare decisions. We provide a thorough analysis of the enabling technological foundations, with a particular focus on specialized prompting techniques like Chain-of-Thought and recent breakthroughs in Reinforcement Learning exemplified by DeepSeek-R1. Our investigation evaluates purpose-built medical frameworks while also examining emerging paradigms such as multi-agent collaborative systems and innovative prompting architectures. The survey critically assesses current evaluation methodologies for medical validation and addresses persistent challenges in field interpretation limitations, bias mitigation strategies, patient safety frameworks, and integration of multimodal clinical data. Through this survey, we seek to establish a roadmap for developing reliable LLMs that can serve as effective partners in clinical practice and medical research.

[278] Hybrid Deep Searcher: Integrating Parallel and Sequential Search Reasoning

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, Kyungjae Lee

Main category: cs.AI

TL;DR: HDS-QA is a synthetic dataset that trains large reasoning models to distinguish parallelizable from sequential queries, enabling hybrid parallel/sequential querying that reduces latency while maintaining accuracy.

Details

Motivation: Existing sequential querying methods in large reasoning models increase inference latency and context length, diminishing coherence and potentially reducing accuracy.

Method: Created HDS-QA synthetic dataset from Natural Questions with hybrid-hop questions combining parallelizable and sequential subqueries, then fine-tuned an LRM (HybridDeepSearcher) using this dataset.

Result: HybridDeepSearcher outperforms state-of-the-art baselines, achieving +15.9 and +11.5 F1 on FanOutQA and BrowseComp respectively, with fewer search turns and reduced latency.

Conclusion: Explicitly training LRMs for hybrid parallel and sequential querying demonstrates efficiency, scalability, and effectiveness in complex reasoning tasks.

Abstract: Large reasoning models (LRMs) have demonstrated strong performance in complex, multi-step reasoning tasks. Existing methods enhance LRMs by sequentially integrating external knowledge retrieval; models iteratively generate queries, retrieve external information, and progressively reason over this information. However, purely sequential querying increases inference latency and context length, diminishing coherence and potentially reducing accuracy. To address these limitations, we introduce HDS-QA (Hybrid Deep Search QA), a synthetic dataset automatically generated from Natural Questions, explicitly designed to train LRMs to distinguish parallelizable from sequential queries. HDS-QA comprises hybrid-hop questions that combine parallelizable independent subqueries (executable simultaneously) and sequentially dependent subqueries (requiring step-by-step resolution), along with synthetic reasoning-querying-retrieval paths involving parallel queries. We fine-tune an LRM using HDS-QA, naming the model HybridDeepSearcher, which outperforms state-of-the-art baselines across multiple benchmarks, notably achieving +15.9 and +11.5 F1 on FanOutQA and a subset of BrowseComp, respectively, both requiring comprehensive and exhaustive search. Experimental results highlight two key advantages: HybridDeepSearcher reaches comparable accuracy with fewer search turns, significantly reducing inference latency, and it effectively scales as more turns are permitted. These results demonstrate the efficiency, scalability, and effectiveness of explicitly training LRMs to leverage hybrid parallel and sequential querying.

[279] Algorithmic Collective Action with Multiple Collectives

Claudio Battiloro, Pietro Greiner, Bret Nestor, Oumaima Amezgar, Francesca Dominici

Main category: cs.AI

TL;DR: First theoretical framework for Algorithmic Collective Action with multiple collectives acting on the same classification system, analyzing how different-sized groups with aligned or conflicting goals can plant signals to bias classifiers.

Details

Motivation: Real-world algorithmic collective actions are often decentralized with multiple collectives having shared objectives but different strategies, yet existing literature focuses only on single collective settings.

Method: Developed a theoretical framework for multiple collectives in classification systems, studying how collectives can plant signals by altering features to bias classifiers towards target classes, with analysis of collective sizes and goal alignment.

Result: Provided quantitative results showing the interplay between collectives’ sizes and their goal alignment in influencing classifier behavior through coordinated data manipulation.

Conclusion: The framework enables holistic treatment of Algorithmic Collective Action with multiple collectives, complementing previous empirical results and opening new research directions for decentralized user-side steering of learning systems.

Abstract: As learning systems increasingly influence everyday decisions, user-side steering via Algorithmic Collective Action (ACA)-coordinated changes to shared data-offers a complement to regulator-side policy and firm-side model design. Although real-world actions have been traditionally decentralized and fragmented into multiple collectives despite sharing overarching objectives-with each collective differing in size, strategy, and actionable goals, most of the ACA literature focused on single collective settings. In this work, we present the first theoretical framework for ACA with multiple collectives acting on the same system. In particular, we focus on collective action in classification, studying how multiple collectives can plant signals, i.e., bias a classifier to learn an association between an altered version of the features and a chosen, possibly overlapping, set of target classes. We provide quantitative results about the role and the interplay of collectives’ sizes and their alignment of goals. Our framework, by also complementing previous empirical results, opens a path for a holistic treatment of ACA with multiple collectives.

[280] The Ramon Llull’s Thinking Machine for Automated Ideation

Xinran Zhao, Boyuan Zheng, Chenglei Si, Haofei Yu, Ken Liu, Runlong Zhou, Ruochen Li, Tong Chen, Xiang Li, Yiming Zhang, Tongshuang Wu

Main category: cs.AI

TL;DR: A modern reinterpretation of Llull’s Ars combinatoria using LLMs to generate research ideas through systematic combination of themes, domains, and methods mined from scientific literature.

Details

Motivation: To create an AI-powered thinking machine that augments scientific creativity by systematically recombining research elements (themes, domains, methods) in novel ways, inspired by medieval combinatorial knowledge generation frameworks.

Method: Define three compositional axes: Theme (motivations), Domain (problem settings), and Method (technical approaches). Mine these elements from human experts or conference papers, then use LLMs to generate research ideas by prompting with curated combinations of these building blocks.

Result: The approach produces diverse, relevant, and literature-grounded research ideas, demonstrating that LLM-driven exploration with structured combinatorial frameworks can effectively augment scientific ideation.

Conclusion: This modern thinking machine provides a lightweight, interpretable tool for scientific creativity enhancement and suggests a promising path for collaborative human-AI research ideation through systematic combinatorial exploration.

Abstract: This paper revisits Ramon Llull’s Ars combinatoria - a medieval framework for generating knowledge through symbolic recombination - as a conceptual foundation for building a modern Llull’s thinking machine for research ideation. Our approach defines three compositional axes: Theme (e.g., efficiency, adaptivity), Domain (e.g., question answering, machine translation), and Method (e.g., adversarial training, linear attention). These elements represent high-level abstractions common in scientific work - motivations, problem settings, and technical approaches - and serve as building blocks for LLM-driven exploration. We mine elements from human experts or conference papers and show that prompting LLMs with curated combinations produces research ideas that are diverse, relevant, and grounded in current literature. This modern thinking machine offers a lightweight, interpretable tool for augmenting scientific creativity and suggests a path toward collaborative ideation between humans and AI.

[281] The Subset Sum Matching Problem

Yufei Wu, Manuel R. Torres, Parisa Zehtabi, Alberto Pozanco Lancho, Michael Cashmore, Daniel Borrajo, Manuela Veloso

Main category: cs.AI

TL;DR: Introduces Subset Sum Matching Problem (SSMP) for financial applications like trades reconciliation, presents three algorithms (two suboptimal, one optimal), and evaluates them on a benchmark of varying complexity instances.

Details

Motivation: SSMP abstracts common financial applications such as trades reconciliation, addressing the need for efficient matching algorithms in financial operations.

Method: Developed three algorithms - two suboptimal approaches and one optimal solution. Created a benchmark covering different SSMP instances of varying complexity for experimental evaluation.

Result: The paper presents performance evaluation results comparing the three algorithms on the benchmark, though specific performance metrics are not detailed in the abstract.

Conclusion: SSMP is established as a relevant combinatorial optimization problem for financial applications, with multiple algorithmic approaches developed and evaluated for solving it effectively.

Abstract: This paper presents a new combinatorial optimisation task, the Subset Sum Matching Problem (SSMP), which is an abstraction of common financial applications such as trades reconciliation. We present three algorithms, two suboptimal and one optimal, to solve this problem. We also generate a benchmark to cover different instances of SSMP varying in complexity, and carry out an experimental evaluation to assess the performance of the approaches.

[282] StepWiser: Stepwise Generative Judges for Wiser Reasoning

Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar

Main category: cs.AI

TL;DR: StepWiser is a generative judge model that meta-reasons about intermediate reasoning steps, providing better accuracy than classification-based approaches and enabling both training improvement and inference-time search enhancement.

Details

Motivation: Current process reward models lack explanations and have limited generalization due to their classification-based approach and reliance on static supervised datasets.

Method: Reframe stepwise reward modeling as a reasoning task, create a generative judge that outputs thinking tokens before verdicts, and train using reinforcement learning with relative outcomes of rollouts.

Result: Achieves better judgment accuracy on intermediate steps than existing methods, improves policy model during training, and enhances inference-time search capabilities.

Conclusion: Transforming stepwise reward modeling from classification to generative reasoning with meta-reasoning capabilities significantly improves performance and generalization in multi-step reasoning supervision.

Abstract: As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model’s reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.

[283] Model Context Protocols in Adaptive Transport Systems: A Survey

Gaurab Chhetri, Shriyank Somvanshi, Md Monzurul Islam, Shamyo Brotee, Mahmuda Sultana Mimi, Dipti Koirala, Biplov Pandey, Subasish Das

Main category: cs.AI

TL;DR: This survey paper analyzes the Model Context Protocol (MCP) as a unifying framework for adaptive transport systems, showing existing solutions naturally converge toward MCP-like architectures and proposing it as the foundation for next-generation intelligent transport infrastructures.

Details

Motivation: The rapid expansion of interconnected devices, autonomous systems, and AI applications has created severe fragmentation in adaptive transport systems, with diverse protocols and context sources remaining isolated, requiring a unifying solution.

Method: Systematic investigation and analysis of established literature, development of a five-category taxonomy covering adaptive mechanisms, context-aware frameworks, unification models, integration strategies, and MCP-enabled architectures.

Result: Findings reveal that traditional transport protocols have reached limits of isolated adaptation, MCP’s client-server and JSON-RPC structure enables semantic interoperability, and AI-driven transport demands integration paradigms uniquely suited to MCP.

Conclusion: MCP is positioned as a foundational paradigm for next-generation adaptive, context-aware, and intelligent transport infrastructures, with a research roadmap provided for its implementation.

Abstract: The rapid expansion of interconnected devices, autonomous systems, and AI applications has created severe fragmentation in adaptive transport systems, where diverse protocols and context sources remain isolated. This survey provides the first systematic investigation of the Model Context Protocol (MCP) as a unifying paradigm, highlighting its ability to bridge protocol-level adaptation with context-aware decision making. Analyzing established literature, we show that existing efforts have implicitly converged toward MCP-like architectures, signaling a natural evolution from fragmented solutions to standardized integration frameworks. We propose a five-category taxonomy covering adaptive mechanisms, context-aware frameworks, unification models, integration strategies, and MCP-enabled architectures. Our findings reveal three key insights: traditional transport protocols have reached the limits of isolated adaptation, MCP’s client-server and JSON-RPC structure enables semantic interoperability, and AI-driven transport demands integration paradigms uniquely suited to MCP. Finally, we present a research roadmap positioning MCP as a foundation for next-generation adaptive, context-aware, and intelligent transport infrastructures.

[284] A Survey on Causal Discovery: Theory and Practice

Alessio Zanga, Elif Ozkirimli, Fabio Stella

Main category: cs.AI

TL;DR: A comprehensive survey of recent advancements in causal discovery methods, covering algorithms, tools, data, and real-world applications for recovering causal graphs from data.

Details

Motivation: To understand the fundamental laws governing phenomena through causal inference, particularly focusing on modeling causal relationships between different aspects and enabling identification and estimation of causal effects.

Method: The paper provides a unified overview of existing causal discovery algorithms developed under different settings, presenting them in a consistent framework while also reporting useful tools and data resources.

Result: A comprehensive survey that organizes and synthesizes recent advancements in causal discovery, making the field more accessible and demonstrating practical applications of these methods.

Conclusion: Causal discovery methods can be fruitfully exploited for real-world applications to recover causal graphs from data, enabling better understanding and quantification of underlying causal relationships in various domains.

Abstract: Understanding the laws that govern a phenomenon is the core of scientific progress. This is especially true when the goal is to model the interplay between different aspects in a causal fashion. Indeed, causal inference itself is specifically designed to quantify the underlying relationships that connect a cause to its effect. Causal discovery is a branch of the broader field of causality in which causal graphs are recovered from data (whenever possible), enabling the identification and estimation of causal effects. In this paper, we explore recent advancements in causal discovery in a unified manner, provide a consistent overview of existing algorithms developed under different settings, report useful tools and data, present real-world applications to understand why and how these methods can be fruitfully exploited.

[285] Integrating Large Language Model for Improved Causal Discovery

Taiyu Ban, Lyuzhou Chen, Derui Lyu, Xiangyu Wang, Qinrui Zhu, Qiang Tu, Huanhuan Chen

Main category: cs.AI

TL;DR: An error-tolerant LLM-driven framework that integrates large language models with data-based causal discovery to improve structure learning while handling LLM reasoning inaccuracies.

Details

Motivation: Domain-specific causal discovery traditionally relies on scarce expert resources, while LLMs show potential as autonomous experts but face accuracy challenges in causal reasoning.

Method: Three-fold error-tolerant mechanism: accuracy-oriented prompting to restrict analysis range, knowledge-to-structure transition to align causal statements, and balancing data goodness-of-fit with LLM-derived priors.

Result: Evaluation on eight real-world causal structures demonstrates improved data-based causal discovery and robustness to inaccurate LLM-derived priors.

Conclusion: The proposed framework effectively integrates LLMs into causal discovery while addressing their reasoning inaccuracies, showing promise for autonomous expert guidance in structure learning.

Abstract: Recovering the structure of causal graphical models from observational data is an essential yet challenging task for causal discovery in scientific scenarios. Domain-specific causal discovery usually relies on expert validation or prior analysis to improve the reliability of recovered causality, which is yet limited by the scarcity of expert resources. Recently, Large Language Models (LLM) have been used for causal analysis across various domain-specific scenarios, suggesting its potential as autonomous expert roles in guiding data-based structure learning. However, integrating LLMs into causal discovery faces challenges due to inaccuracies in LLM-based reasoning on revealing the actual causal structure. To address this challenge, we propose an error-tolerant LLM-driven causal discovery framework. The error-tolerant mechanism is designed three-fold with sufficient consideration on potential inaccuracies. In the LLM-based reasoning process, an accuracy-oriented prompting strategy restricts causal analysis to a reliable range. Next, a knowledge-to-structure transition aligns LLM-derived causal statements with structural causal interactions. In the structure learning process, the goodness-of-fit to data and adherence to LLM-derived priors are balanced to further address prior inaccuracies. Evaluation of eight real-world causal structures demonstrates the efficacy of our LLM-driven approach in improving data-based causal discovery, along with its robustness to inaccurate LLM-derived priors. Codes are available at https://github.com/tyMadara/LLM-CD.

[286] Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding

Daniel Bethell, Simos Gerasimou, Radu Calinescu, Calum Imrie

Main category: cs.AI

TL;DR: ADVICE is a novel post-shielding technique that uses contrastive autoencoder to distinguish safe/unsafe state-action features, reducing safety violations by ~50% during RL training while maintaining competitive rewards.

Details

Motivation: Safe exploration of RL agents in unknown black-box environments is critical for real-world deployment, especially when prior domain knowledge is unavailable, presenting significant safety risks during training.

Method: ADVICE uses a contrastive autoencoder to distinguish safe and unsafe features of state-action pairs during training, employing adaptive shielding to protect the agent from hazardous actions.

Result: ADVICE significantly reduces safety violations by approximately 50% during training while achieving competitive outcome rewards compared to state-of-the-art safe RL exploration techniques.

Conclusion: The proposed ADVICE technique effectively enables safer reinforcement learning exploration in unknown environments with minimal performance trade-offs, making it suitable for real-world deployment scenarios.

Abstract: Empowering safe exploration of reinforcement learning (RL) agents during training is a critical challenge towards their deployment in many real-world scenarios. When prior knowledge of the domain or task is unavailable, training RL agents in unknown, black-box environments presents an even greater safety risk. We introduce ADVICE (Adaptive Shielding with a Contrastive Autoencoder), a novel post-shielding technique that distinguishes safe and unsafe features of state-action pairs during training, and uses this knowledge to protect the RL agent from executing actions that yield likely hazardous outcomes. Our comprehensive experimental evaluation against state-of-the-art safe RL exploration techniques shows that ADVICE significantly reduces safety violations (approx 50%) during training, with a competitive outcome reward compared to other techniques.

[287] From Bits to Boardrooms: A Cutting-Edge Multi-Agent LLM Framework for Business Excellence

Zihao Wang, Junming Zhang

Main category: cs.AI

TL;DR: BusiAgent is a multi-agent LLM framework that integrates CTMDP modeling, entropy optimization, and Stackelberg games to improve enterprise decision-making by bridging operational analysis with strategic goals.

Details

Motivation: Current LLM approaches struggle to reconcile detailed operational analyses with high-level strategic objectives across diverse market environments, leading to fragmented workflows and reduced organizational collaboration.

Method: Multi-agent framework with extended CTMDP for dynamic agent modeling, generalized entropy measure for collaborative efficiency, multi-level Stackelberg game for hierarchical decisions, contextual Thompson sampling for prompt optimization, and comprehensive QA system.

Result: Extensive empirical evaluations show BusiAgent significantly outperforms established approaches in solution quality and user satisfaction, generating coherent client-focused solutions that integrate granular insights with high-level strategy.

Conclusion: BusiAgent represents a substantial advancement in AI-driven enterprise decision-making by combining cutting-edge AI technologies with business insights, enabling organizations to navigate complex business landscapes more effectively.

Abstract: Large Language Models (LLMs) have shown promising potential in business applications, particularly in enterprise decision support and strategic planning, yet current approaches often struggle to reconcile intricate operational analyses with overarching strategic goals across diverse market environments, leading to fragmented workflows and reduced collaboration across organizational levels. This paper introduces BusiAgent, a novel multi-agent framework leveraging LLMs for advanced decision-making in complex corporate environments. BusiAgent integrates three core innovations: an extended Continuous Time Markov Decision Process (CTMDP) for dynamic agent modeling, a generalized entropy measure to optimize collaborative efficiency, and a multi-level Stackelberg game to handle hierarchical decision processes. Additionally, contextual Thompson sampling is employed for prompt optimization, supported by a comprehensive quality assurance system to mitigate errors. Extensive empirical evaluations across diverse business scenarios validate BusiAgent’s efficacy, demonstrating its capacity to generate coherent, client-focused solutions that smoothly integrate granular insights with high-level strategy, significantly outperforming established approaches in both solution quality and user satisfaction. By fusing cutting-edge AI technologies with deep business insights, BusiAgent marks a substantial step forward in AI-driven enterprise decision-making, empowering organizations to navigate complex business landscapes more effectively.

[288] Pessimistic Iterative Planning with RNNs for Robust POMDPs

Maris F. L. Galesloot, Marnix Suilen, Thiago D. Simão, Steven Carr, Matthijs T. J. Spaan, Ufuk Topcu, Nils Jansen

Main category: cs.AI

TL;DR: The paper proposes a pessimistic iterative planning framework and rFSCNet algorithm to compute robust memory-based policies for POMDPs with model uncertainty, outperforming existing methods.

Details

Motivation: Robust POMDPs need to handle both partial observability (requiring memory-based policies) and model uncertainty (requiring robustness against worst-case probability instances from uncertainty sets). Existing methods struggle to compute effective robust policies that address both challenges simultaneously.

Method: Proposes Pessimistic Iterative Planning (PIP) framework that alternates between: (1) selecting pessimistic POMDPs via worst-case probability instances from uncertainty sets, and (2) computing finite-state controllers (FSCs) for these pessimistic POMDPs. Introduces rFSCNet algorithm that uses recurrent neural networks to optimize FSCs.

Result: Empirical evaluation shows that rFSCNet computes better-performing robust policies than several baselines and a state-of-the-art robust POMDP solver.

Conclusion: The proposed PIP framework and rFSCNet algorithm effectively address the dual challenges of partial observability and model uncertainty in robust POMDPs, providing superior performance compared to existing approaches.

Abstract: Robust POMDPs extend classical POMDPs to incorporate model uncertainty using so-called uncertainty sets on the transition and observation functions, effectively defining ranges of probabilities. Policies for robust POMDPs must be (1) memory-based to account for partial observability and (2) robust against model uncertainty to account for the worst-case probability instances from the uncertainty sets. To compute such robust memory-based policies, we propose the pessimistic iterative planning (PIP) framework, which alternates between (1) selecting pessimistic POMDPs via worst-case probability instances from the uncertainty sets, and (2) computing finite-state controllers (FSCs) for these pessimistic POMDPs. Within PIP, we propose the rFSCNet algorithm, which optimizes a recurrent neural network to compute the FSCs. The empirical evaluation shows that rFSCNet can compute better-performing robust policies than several baselines and a state-of-the-art robust POMDP solver.

[289] Can Large Language Models Act as Ensembler for Multi-GNNs?

Hanqi Duan, Yao Cheng, Jianxiang Yu, Yao Liu, Xiang Li

Main category: cs.AI

TL;DR: LensGNN combines multiple GNN models with LLMs to better integrate semantic text understanding and graph structural information, outperforming existing approaches.

Details

Motivation: GNNs lack semantic understanding of textual node attributes, and no single GNN model consistently outperforms others across diverse datasets. The paper explores whether LLMs can serve as ensemblers for multiple GNNs.

Method: The model first aligns multiple GNN representations into the same space, then uses LoRA fine-tuning to align GNN and LLM spaces, injecting graph tokens and textual information into LLMs to ensemble multiple GNNs.

Result: Experimental results show that LensGNN outperforms existing models by effectively combining semantic and structural information.

Conclusion: LensGNN provides a robust solution for text-attributed graph ensemble learning, advancing the integration of semantic understanding and graph structural information through LLM-based ensembling.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, GNNs lack the inherent semantic understanding capability of rich textual node attributes, limiting their effectiveness in applications. On the other hand, we empirically observe that for existing GNN models, no one can consistently outperforms others across diverse datasets. In this paper, we study whether LLMs can act as an ensembler for multi-GNNs and propose the LensGNN model. The model first aligns multiple GNNs, mapping the representations of different GNNs into the same space. Then, through LoRA fine-tuning, it aligns the space between the GNN and the LLM, injecting graph tokens and textual information into LLMs. This allows LensGNN to ensemble multiple GNNs and take advantage of the strengths of LLM, leading to a deeper understanding of both textual semantic information and graph structural information. The experimental results show that LensGNN outperforms existing models. This research advances text-attributed graph ensemble learning by providing a robust and superior solution for integrating semantic and structural information. We provide our code and data here: https://anonymous.4open.science/r/EnsemGNN-E267/.

[290] Consensus in Motion: A Case of Dynamic Rationality of Sequential Learning in Probability Aggregation

Polina Gordienko, Christoph Jansen, Thomas Augustin, Martin Rechenauer

Main category: cs.AI

TL;DR: A probability aggregation framework using propositional probability logic that ensures dynamic rationality through consistent belief updating with new information, showing linear aggregation rules for non-nested agendas and fair learning processes with common ground.

Details

Motivation: To address limitations of conventional judgment aggregation that focuses on static rationality, by developing a model that maintains dynamic rationality through consistent belief updates with new information in collective decision-making.

Method: Propositional probability logic framework with consensus-compatible and independent aggregation rules on non-nested agendas, Bayesian conditioning for belief updates, and sequential decision-making with progressive information incorporation through multiple stages.

Result: Any consensus-compatible and independent aggregation rule on a non-nested agenda is necessarily linear. Fair learning processes are guaranteed when individuals agree on a common ground subset and new information is restricted to this foundation, ensuring consistent collective beliefs regardless of update-aggregation order.

Conclusion: The framework provides a robust approach for dynamic probability aggregation that maintains rationality through sequential information incorporation, with applications demonstrated in political scenarios involving healthcare and immigration policies.

Abstract: We propose a framework for probability aggregation based on propositional probability logic. Unlike conventional judgment aggregation, which focuses on static rationality, our model addresses dynamic rationality by ensuring that collective beliefs update consistently with new information. We show that any consensus-compatible and independent aggregation rule on a non-nested agenda is necessarily linear. Furthermore, we provide sufficient conditions for a fair learning process, where individuals initially agree on a specified subset of propositions known as the common ground, and new information is restricted to this shared foundation. This guarantees that updating individual judgments via Bayesian conditioning-whether performed before or after aggregation-yields the same collective belief. A distinctive feature of our framework is its treatment of sequential decision-making, which allows new information to be incorporated progressively through multiple stages while maintaining the established common ground. We illustrate our findings with a running example in a political scenario concerning healthcare and immigration policies.

Lei Wang, Heyang Gao, Xiaohe Bo, Xu Chen, Ji-Rong Wen

Main category: cs.AI

TL;DR: YuLan-OneSim is a novel LLM-based social simulator featuring code-free scenario construction, comprehensive default scenarios, evolvable simulation through automatic fine-tuning, large-scale capability for 100k agents, and an AI social researcher that automates the entire social science research loop.

Details

Motivation: To create an accessible social simulator that reduces programming barriers for researchers and enables large-scale, high-quality social behavior simulations using LLM agents.

Method: Developed a simulator with natural language interface for scenario construction, implemented 50 default scenarios across 8 domains, created evolvable simulation with automatic LLM fine-tuning, built distributed architecture for large-scale simulations, and integrated AI researcher for automated research workflows.

Result: The simulator can handle up to 100,000 agents, automatically generates simulation code from natural language descriptions, provides comprehensive domain coverage, and demonstrates reliable performance through experimental evaluation.

Conclusion: YuLan-OneSim represents a significant advancement in social simulation technology by making complex simulations accessible to non-programmers and automating the entire social science research process through AI integration.

Abstract: Leveraging large language model (LLM) based agents to simulate human social behaviors has recently gained significant attention. In this paper, we introduce a novel social simulator called YuLan-OneSim. Compared to previous works, YuLan-OneSim distinguishes itself in five key aspects: (1) Code-free scenario construction: Users can simply describe and refine their simulation scenarios through natural language interactions with our simulator. All simulation code is automatically generated, significantly reducing the need for programming expertise. (2) Comprehensive default scenarios: We implement 50 default simulation scenarios spanning 8 domains, including economics, sociology, politics, psychology, organization, demographics, law, and communication, broadening access for a diverse range of social researchers. (3) Evolvable simulation: Our simulator is capable of receiving external feedback and automatically fine-tuning the backbone LLMs, significantly enhancing the simulation quality. (4) Large-scale simulation: By developing a fully responsive agent framework and a distributed simulation architecture, our simulator can handle up to 100,000 agents, ensuring more stable and reliable simulation results. (5) AI social researcher: Leveraging the above features, we develop an AI social researcher. Users only need to propose a research topic, and the AI researcher will automatically analyze the input, construct simulation environments, summarize results, generate technical reports, review and refine the reports–completing the social science research loop. To demonstrate the advantages of YuLan-OneSim, we conduct experiments to evaluate the quality of the automatically generated scenarios, the reliability, efficiency, and scalability of the simulation process, as well as the performance of the AI social researcher.

[292] Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models

Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang

Main category: cs.AI

TL;DR: Jigsaw-Puzzles benchmark evaluates VLMs’ spatial reasoning with 1,100 complex images and 5 tasks, showing models significantly underperform humans (77% vs 90+% accuracy).

Details

Motivation: To investigate whether current vision-language models exhibit human-like spatial reasoning capabilities and understand spatial structures and inter-object relationships.

Method: Created Jigsaw-Puzzles benchmark with 1,100 real-world images of high spatial complexity, designed 5 tasks to evaluate spatial perception, structural understanding, and reasoning while minimizing domain knowledge reliance.

Result: Even the best model (Gemini-2.5-Pro) achieved only 77.14% overall accuracy, with particularly poor performance on Order Generation task (30.00% accuracy), far below human performance exceeding 90%.

Conclusion: There’s a significant gap between VLMs and human spatial reasoning capabilities, positioning Jigsaw-Puzzles as a challenging diagnostic benchmark for advancing spatial reasoning research.

Abstract: Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs’ spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the performance exceeding 90% achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs. Our project page is at https://zesen01.github.io/jigsaw-puzzles.

Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu

Main category: cs.AI

TL;DR: This paper presents the first systematic analysis of multimodal RAG pipelines for Large Vision-Language Models, examining retrieval strategies, re-ranking techniques, and generation integration to address LVLM limitations like static training data and hallucinations.

Details

Motivation: LVLMs have limitations including static training data, susceptibility to hallucinations, and inability to verify claims against current external evidence, which hinders their performance in dynamic real-world applications. RAG offers a solution by enabling access to large-scale knowledge databases.

Method: The study systematically dissects the multimodal RAG pipeline through three phases: (1) retrieval phase examining modality configurations and retrieval strategies, (2) re-ranking stage addressing positional biases and improving relevance, and (3) generation phase investigating integration of retrieved evidence. Also explores a unified agentic framework with self-reflection.

Result: The comprehensive exploration of RAG for LVLMs yields substantial insights and achieves an average performance boost of 5% without any fine-tuning.

Conclusion: Multimodal RAG pipelines effectively address LVLM limitations by grounding model outputs in factual, contextually relevant information through systematic retrieval, re-ranking, and generation integration strategies, with significant performance improvements.

Abstract: Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.

[294] Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA

Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar

Main category: cs.AI

TL;DR: Clinical-inspired AI architectures that mimic medical reasoning processes outperform traditional fine-tuning approaches in dermatological telemedicine, achieving 70% accuracy with explainable, literature-grounded outputs.

Details

Motivation: Telemedicine lacks the rich context of in-person visits, forcing clinicians to diagnose based on limited images and descriptions without physical exams, second opinions, or reference materials.

Method: Tested seven vision-language models across six configurations: baseline models, fine-tuned variants, and models augmented with either reasoning layers (simulating peer consultation) or retrieval-augmented generation (incorporating medical literature).

Result: Fine-tuning degraded performance in 4/7 models (30% average decrease), baseline models collapsed on test data, while clinical-inspired architectures achieved up to 70% accuracy with maintained performance on unseen data.

Conclusion: Medical AI succeeds by reconstructing collaborative and evidence-based practices fundamental to clinical diagnosis, generating explainable outputs critical for clinical adoption.

Abstract: Dermatological care via telemedicine often lacks the rich context of in-person visits. Clinicians must make diagnoses based on a handful of images and brief descriptions, without the benefit of physical exams, second opinions, or reference materials. While many medical AI systems attempt to bridge these gaps with domain-specific fine-tuning, this work hypothesized that mimicking clinical reasoning processes could offer a more effective path forward. This study tested seven vision-language models on medical visual question answering across six configurations: baseline models, fine-tuned variants, and both augmented with either reasoning layers that combine multiple model perspectives, analogous to peer consultation, or retrieval-augmented generation that incorporates medical literature at inference time, serving a role similar to reference-checking. While fine-tuning degraded performance in four of seven models with an average 30% decrease, baseline models collapsed on test data. Clinical-inspired architectures, meanwhile, achieved up to 70% accuracy, maintaining performance on unseen data while generating explainable, literature-grounded outputs critical for clinical adoption. These findings demonstrate that medical AI succeeds by reconstructing the collaborative and evidence-based practices fundamental to clinical diagnosis.

[295] Feature-Guided Neighbor Selection for Non-Expert Evaluation of Model Predictions

Courtney Ford, Mark T. Keane

Main category: cs.AI

TL;DR: FGNS is a new XAI method that selects class-representative examples using feature importance, improving non-experts’ ability to identify model errors in image classification tasks compared to traditional k-NN explanations.

Details

Motivation: Current XAI methods often fail to provide clear, interpretable outputs for users without domain expertise, limiting their effectiveness in real-world applications.

Method: Feature-Guided Neighbor Selection (FGNS) - a post hoc method that enhances interpretability by selecting class-representative examples using both local and global feature importance.

Result: In user study (N=98) on Kannada script classification, FGNS significantly improved non-experts’ error identification while maintaining agreement with correct predictions. Participants made faster and more accurate decisions compared to traditional k-NN. FGNS selects neighbors that better reflect class characteristics rather than just minimizing feature-space distance.

Conclusion: FGNS represents progress toward more human-aligned model assessment, though further work is needed to bridge the gap between explanation quality and perceived trust.

Abstract: Explainable AI (XAI) methods often struggle to generate clear, interpretable outputs for users without domain expertise. We introduce Feature-Guided Neighbor Selection (FGNS), a post hoc method that enhances interpretability by selecting class-representative examples using both local and global feature importance. In a user study (N = 98) evaluating Kannada script classifications, FGNS significantly improved non-experts’ ability to identify model errors while maintaining appropriate agreement with correct predictions. Participants made faster and more accurate decisions compared to those given traditional k-NN explanations. Quantitative analysis shows that FGNS selects neighbors that better reflect class characteristics rather than merely minimizing feature-space distance, leading to more consistent selection and tighter clustering around class prototypes. These results support FGNS as a step toward more human-aligned model assessment, although further work is needed to address the gap between explanation quality and perceived trust.

[296] Multi-Agent LLMs as Ethics Advocates for AI-Based Systems

Asma Yamani, Malak Baslyman, Moataz Ahmed

Main category: cs.AI

TL;DR: A framework using multi-agent LLM with an ethics advocate agent to automatically generate ethics requirements drafts, capturing most manual requirements while identifying additional ones but requiring human validation.

Details

Motivation: Manual ethics requirement elicitation is challenging due to time/resource constraints and low priority, needing automated solutions to incorporate ethics into systems development.

Method: Proposes a multi-agent LLM framework with an ethics advocate agent that critiques system descriptions and generates ethics requirements drafts through automated analysis.

Result: Framework captures majority of ethics requirements from 30-minute interviews and identifies additional relevant requirements, but shows reliability issues requiring human feedback.

Conclusion: The automated framework facilitates ethics integration in requirements engineering but needs human oversight, potentially leading to more ethically aligned products.

Abstract: Incorporating ethics into the requirement elicitation process is essential for creating ethically aligned systems. Although eliciting manual ethics requirements is effective, it requires diverse input from multiple stakeholders, which can be challenging due to time and resource constraints. Moreover, it is often given a low priority in the requirements elicitation process. This study proposes a framework for generating ethics requirements drafts by introducing an ethics advocate agent in a multi-agent LLM setting. This agent critiques and provides input on ethical issues based on the system description. The proposed framework is evaluated through two case studies from different contexts, demonstrating that it captures the majority of ethics requirements identified by researchers during 30-minute interviews and introduces several additional relevant requirements. However, it also highlights reliability issues in generating ethics requirements, emphasizing the need for human feedback in this sensitive domain. We believe this work can facilitate the broader adoption of ethics in the requirements engineering process, ultimately leading to more ethically aligned products.

[297] Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld

Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu

Main category: cs.AI

TL;DR: AWorld framework uses a multi-agent system with profile-aware supervision to improve LLM reliability by identifying and correcting specific failure patterns rather than just reacting to immediate errors.

Details

Motivation: Large language models' reliance on external tools introduces challenges with extended contexts and noisy outputs that undermine system reliability, requiring more robust supervision mechanisms.

Method: Dynamic Multi-Agent System with Execution Agent supervised by Guard Agent, enhanced by System Identification methodology to create performance fingerprints of agent weaknesses for targeted interventions.

Result: Significantly improved effectiveness and stability on GAIA dataset, outperforming single-agent systems and naive counterparts, achieving first place among open-source projects on GAIA leaderboard.

Conclusion: Building trustworthy intelligent systems requires deep empirical understanding of each agent’s unique capabilities and limitations, not just collaboration.

Abstract: The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, this reliance introduces new challenges, as extended contexts and noisy tool outputs can undermine system reliability. To address this, we propose a dynamic Multi-Agent System (MAS) in our AWorld framework, where an Execution Agent is supervised by a Guard Agent that provides on-demand dynamic maneuvering, verifying and correcting the reasoning process to improve robustness over single-agent systems. To move beyond this generic supervision, we enhance the architecture with a methodology inspired by System Identification from control theory. This method first profiles the Execution Agent offline on a benchmark dataset to create a “performance fingerprint” of its unique weaknesses. The Guard Agent then leverages this fingerprint online to deliver profile-aware supervision, making targeted interventions based on known failure patterns rather than merely reacting to immediate logical flaws. Extensive experiments on the GAIA dataset demonstrate that this profile-aware MAS significantly improves both effectiveness and stability, outperforming not only single-agent systems but also its naive counterpart. This superior performance led our system to achieve first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight that building truly trustworthy intelligent systems requires not just collaboration, but a deep, empirically-grounded understanding of each agent’s unique capabilities and limitations.

[298] Response and Prompt Evaluation to Prevent Parasocial Relationships with Chatbots

Emma Rath, Stuart Armstrong, Rebecca Gorman

Main category: cs.AI

TL;DR: A framework using state-of-the-art language models to detect parasocial relationship cues in AI conversations in real-time, showing promising results in early detection without false positives.

Details

Motivation: Parasocial relationships with AI agents can have severe negative impacts on human well-being, but preventing them is challenging as these cues emerge gradually in private conversations and not all emotional engagement is harmful.

Method: Repurposed a state-of-the-art language model to create a response evaluation framework that assesses ongoing conversations for parasocial cues. Tested with a synthetic dataset of 30 dialogues covering parasocial, sycophantic, and neutral conversations using iterative evaluation with five-stage testing.

Result: The framework successfully identified all parasocial conversations while avoiding false positives under a tolerant unanimity rule. Detection typically occurred within the first few exchanges of conversation.

Conclusion: Evaluation agents show promise as a viable solution for preventing parasocial relationships with AI, providing preliminary evidence that real-time detection of harmful parasocial cues is feasible.

Abstract: The development of parasocial relationships with AI agents has severe, and in some cases, tragic effects for human well-being. Yet preventing such dynamics is challenging: parasocial cues often emerge gradually in private conversations, and not all forms of emotional engagement are inherently harmful. We address this challenge by introducing a simple response evaluation framework, created by repurposing a state-of-the-art language model, that evaluates ongoing conversations for parasocial cues in real time. To test the feasibility of this approach, we constructed a small synthetic dataset of thirty dialogues spanning parasocial, sycophantic, and neutral conversations. Iterative evaluation with five stage testing successfully identified all parasocial conversations while avoiding false positives under a tolerant unanimity rule, with detection typically occurring within the first few exchanges. These findings provide preliminary evidence that evaluation agents can provide a viable solution for the prevention of parasocial relations.

[299] Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment

Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar

Main category: cs.AI

TL;DR: A healthcare framework using a single vision-language model for both routing medical images to specialist models and performing multiple specialty-specific tasks, reducing fragmentation and improving efficiency in clinical workflows.

Details

Motivation: Clinical workflows are fragmented with multiple scripts and task-specific networks, lacking streamlined data science pipelines, data-driven model identification, and standardized output delivery, which reduces efficiency and increases operational costs.

Method: Uses a single vision-language model in two roles: 1) as an aware model-card matcher that routes images through a three-stage workflow (modality -> abnormality -> model-card ID) with stagewise prompts and answer selection, and 2) fine-tuned on specialty-specific datasets to handle multiple tasks within each specialty.

Result: The single-model deployment matches or approaches specialized baselines across gastroenterology, hematology, ophthalmology, and pathology specialties.

Conclusion: One VLM can both decide (route) and do (perform tasks), reducing data scientist effort, shortening monitoring, increasing transparency of model selection, and lowering integration overhead compared to multi-agent pipelines.

Abstract: Clinical workflows are fragmented as a patchwork of scripts and task-specific networks that often handle triage, task selection, and model deployment. These pipelines are rarely streamlined for data science pipeline, reducing efficiency and raising operational costs. Workflows also lack data-driven model identification (from imaging/tabular inputs) and standardized delivery of model outputs. In response, we present a practical, healthcare-first framework that uses a single vision-language model (VLM) in two complementary roles. First (Solution 1), the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model via a three-stage workflow (modality -> primary abnormality -> model-card id). Checks are provided by (i) stagewise prompts that allow early exit via None/Normal/Other and (ii) a stagewise answer selector that arbitrates between the top-2 candidates at each stage, reducing the chance of an incorrect selection and aligning the workflow with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on specialty-specific datasets ensuring a single model covers multiple downstream tasks within each specialty, maintaining performance while simplifying deployment. Across gastroenterology, hematology, ophthalmology, and pathology, our single-model deployment matches or approaches specialized baselines. Compared with pipelines composed of many task-specific agents, this approach shows that one VLM can both decide and do. It may reduce effort by data scientists, shorten monitoring, increase the transparency of model selection (with per-stage justifications), and lower integration overhead.

[300] ST-Raptor: LLM-Powered Semi-Structured Table Question Answering

Zirui Tang, Boyu Niu, Xuanhe Zhou, Boxiu Li, Wei Zhou, Jiannan Wang, Guoliang Li, Xinyi Zhang, Fan Wu

Main category: cs.AI

TL;DR: ST-Raptor is a tree-based framework using LLMs for semi-structured table QA, outperforming baselines by up to 20% accuracy with hierarchical tree modeling and verification mechanisms.

Details

Motivation: Existing methods struggle with semi-structured tables (financial reports, medical records) due to information loss from conversion to structured formats and inability to handle complex layouts like hierarchical headers and merged cells.

Method: Proposes Hierarchical Orthogonal Tree (HO-Tree) to model table layouts, defines tree operations for LLMs, decomposes questions into sub-questions with operation pipelines, and uses two-stage verification (forward and backward validation).

Result: Outperforms nine baselines by up to 20% in answer accuracy on SSTQA dataset containing 764 questions over 102 real-world semi-structured tables.

Conclusion: ST-Raptor effectively handles complex semi-structured table layouts through tree-based modeling and verification, providing significant accuracy improvements over existing methods for table question answering.

Abstract: Semi-structured tables, widely used in real-world applications (e.g., financial reports, medical records, transactional orders), often involve flexible and complex layouts (e.g., hierarchical headers and merged cells). These tables generally rely on human analysts to interpret table layouts and answer relevant natural language questions, which is costly and inefficient. To automate the procedure, existing methods face significant challenges. First, methods like NL2SQL require converting semi-structured tables into structured ones, which often causes substantial information loss. Second, methods like NL2Code and multi-modal LLM QA struggle to understand the complex layouts of semi-structured tables and cannot accurately answer corresponding questions. To this end, we propose ST-Raptor, a tree-based framework for semi-structured table question answering using large language models. First, we introduce the Hierarchical Orthogonal Tree (HO-Tree), a structural model that captures complex semi-structured table layouts, along with an effective algorithm for constructing the tree. Second, we define a set of basic tree operations to guide LLMs in executing common QA tasks. Given a user question, ST-Raptor decomposes it into simpler sub-questions, generates corresponding tree operation pipelines, and conducts operation-table alignment for accurate pipeline execution. Third, we incorporate a two-stage verification mechanism: forward validation checks the correctness of execution steps, while backward validation evaluates answer reliability by reconstructing queries from predicted answers. To benchmark the performance, we present SSTQA, a dataset of 764 questions over 102 real-world semi-structured tables. Experiments show that ST-Raptor outperforms nine baselines by up to 20% in answer accuracy. The code is available at https://github.com/weAIDB/ST-Raptor.

cs.SD

[301] H-PRM: A Pluggable Hotword Pre-Retrieval Module for Various Speech Recognition Systems

Huangyu Dai, Lingtao Mao, Ben Chen, Zihan Wang, Zihan Liang, Ying Han, Chenyi Lei, Han Li

Main category: cs.SD

TL;DR: A novel hotword customization system using hotword pre-retrieval module (H-PRM) that improves ASR accuracy for domain-specific terms by measuring acoustic similarity between hotwords and speech segments.

Details

Motivation: Existing ASR models struggle with large-scale hotwords as recognition rates drop dramatically when the number of hotwords increases, limiting effective domain-specific customization.

Method: Introduces H-PRM module that identifies relevant hotword candidates through acoustic similarity measurement. This plug-and-play solution integrates with traditional models like SeACo-Paraformer and Audio LLMs via prompt-based approach.

Result: Extensive testing shows H-PRM outperforms existing methods, significantly enhancing hotwords post-recall rate (PRR) and enabling seamless hotword customization.

Conclusion: H-PRM provides a new direction for hotword customization in ASR, offering effective plug-and-play solution that works with both traditional models and Audio LLMs.

Abstract: Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword customization system that utilizes a hotword pre-retrieval module (H-PRM) to identify the most relevant hotword candidate by measuring the acoustic similarity between the hotwords and the speech segment. This plug-and-play solution can be easily integrated into traditional models such as SeACo-Paraformer, significantly enhancing hotwords post-recall rate (PRR). Additionally, we incorporate H-PRM into Audio LLMs through a prompt-based approach, enabling seamless customization of hotwords. Extensive testing validates that H-PRM can outperform existing methods, showing a new direction for hotword customization in ASR.

[302] SwiftF0: Fast and Accurate Monophonic Pitch Detection

Lars Nieradzik

Main category: cs.SD

TL;DR: SwiftF0 is a lightweight neural model that achieves state-of-the-art monophonic pitch estimation with high accuracy in noisy conditions, requiring only 95,842 parameters and running 42x faster than CREPE on CPU.

Details

Motivation: Accurate real-time pitch estimation on resource-constrained devices in noisy environments remains challenging, and existing datasets lack perfectly accurate ground truth pitch annotations.

Method: Developed SwiftF0 neural model trained on diverse speech, music, and synthetic datasets with extensive data augmentation. Created SpeechSynth synthetic speech dataset with exact ground-truth pitch curves using phoneme-level TTS model. Proposed unified evaluation metric combining six performance measures.

Result: Achieved 91.80% harmonic mean at 10 dB SNR, outperforming CREPE by over 12 percentage points. Only 2.3% degradation from clean audio. Requires 95,842 parameters and runs 42x faster than CREPE on CPU.

Conclusion: SwiftF0 provides efficient, real-time pitch estimation suitable for resource-constrained devices, with robust generalization across acoustic domains and superior performance in noisy conditions.

Abstract: Accurate and real-time monophonic pitch estimation in noisy conditions, particularly on resource-constrained devices, remains an open challenge in audio processing. We present \emph{SwiftF0}, a novel, lightweight neural model that sets a new state-of-the-art for monophonic pitch estimation. Through training on diverse speech, music, and synthetic datasets with extensive data augmentation, SwiftF0 achieves robust generalization across acoustic domains while maintaining computational efficiency. SwiftF0 achieves a 91.80% harmonic mean (HM) at 10 dB SNR, outperforming baselines like CREPE by over 12 percentage points and degrading by only 2.3 points from clean audio. SwiftF0 requires only 95,842 parameters and runs approximately 42x faster than CREPE on CPU, making it ideal for efficient, real-time deployment. To address the critical lack of perfectly accurate ground truth pitch in speech corpora (which typically rely on algorithmic estimators or laryngograph signals), we introduce \emph{SpeechSynth}. This synthetic speech dataset, generated by a phoneme-level TTS model, provides exact, on-demand ground-truth pitch curves, enabling more robust model training and evaluation. Furthermore, we propose a unified metric, combining six complementary performance measures for comprehensive and reliable pitch evaluation, and release an open-source pitch benchmark suite. A live demo of SwiftF0 is available at https://swift-f0.github.io/, the source code at https://github.com/lars76/swift-f0, and the benchmark framework at https://github.com/lars76/pitch-benchmark.

[303] Cross-Learning Fine-Tuning Strategy for Dysarthric Speech Recognition Via CDSD database

Qing Xiao, Yingshan Peng, PeiPei Zhang

Main category: cs.SD

TL;DR: Multi-speaker fine-tuning on dysarthric speech outperforms single-speaker approaches by learning broader pathological features, reducing overfitting and data dependence while improving accuracy.

Details

Motivation: Dysarthric speech recognition faces challenges due to severity variations and disparities from normal speech. Conventional single-speaker fine-tuning approaches are limited and may cause feature conflicts.

Method: Multi-speaker fine-tuning approach where ASR models pre-trained on normal speech are simultaneously fine-tuned on multiple dysarthric speakers instead of individual per-patient fine-tuning.

Result: Achieves up to 13.15% lower Word Error Rate (WER) compared to single-speaker fine-tuning, with improved generalization, reduced speaker-specific overfitting, and lower per-patient data requirements.

Conclusion: Counter-intuitively, multi-speaker fine-tuning on dysarthric speech enhances individual speech pattern recognition by enabling broader pathological feature learning and provides superior performance over conventional single-speaker approaches.

Abstract: Dysarthric speech recognition faces challenges from severity variations and disparities relative to normal speech. Conventional approaches individually fine-tune ASR models pre-trained on normal speech per patient to prevent feature conflicts. Counter-intuitively, experiments reveal that multi-speaker fine-tuning (simultaneously on multiple dysarthric speakers) improves recognition of individual speech patterns. This strategy enhances generalization via broader pathological feature learning, mitigates speaker-specific overfitting, reduces per-patient data dependence, and improves target-speaker accuracy - achieving up to 13.15% lower WER versus single-speaker fine-tuning.

[304] SegReConcat: A Data Augmentation Method for Voice Anonymization Attack

Ridwan Arefeen, Xiaoxiao Miao, Rong Tong, Aik Beng Ng, Simon See

Main category: cs.SD

TL;DR: SegReConcat is a data augmentation method that segments and rearranges anonymized speech at word level to disrupt contextual cues, improving speaker de-anonymization in voice privacy systems.

Details

Motivation: Anonymized voice often retains residual speaker cues that pose privacy risks, requiring methods to enhance attacker capabilities for better privacy assessment.

Method: SegReConcat segments anonymized speech at word level, rearranges segments using random or similarity-based strategies to disrupt long-term contextual cues, and concatenates them with original utterances to provide multiple perspectives for learning speaker traits.

Result: Evaluated in VoicePrivacy Attacker Challenge 2024 across seven anonymization systems, SegReConcat improved de-anonymization performance on five out of seven systems.

Conclusion: SegReConcat effectively enhances attacker-side capabilities for speaker verification by disrupting residual contextual cues in anonymized speech, demonstrating improved de-anonymization across multiple systems.

Abstract: Anonymization of voice seeks to conceal the identity of the speaker while maintaining the utility of speech data. However, residual speaker cues often persist, which pose privacy risks. We propose SegReConcat, a data augmentation method for attacker-side enhancement of automatic speaker verification systems. SegReConcat segments anonymized speech at the word level, rearranges segments using random or similarity-based strategies to disrupt long-term contextual cues, and concatenates them with the original utterance, allowing an attacker to learn source speaker traits from multiple perspectives. The proposed method has been evaluated in the VoicePrivacy Attacker Challenge 2024 framework across seven anonymization systems, SegReConcat improves de-anonymization on five out of seven systems.

cs.LG

[305] Reasoning Steps as Curriculum: Using Depth of Thought as a Difficulty Signal for Tuning LLMs

Jeesu Jung, Sangkeun Jung

Main category: cs.LG

TL;DR: Proposes using depth of thought (DoT) - counting reasoning steps in teacher model traces - as a scalable difficulty signal for curriculum learning in LLMs, with three testable hypotheses about its effectiveness.

Details

Motivation: Need for difficulty signals that align with reasoning capabilities while remaining scalable and interpretable for curriculum learning in large language models.

Method: Define difficulty as depth of thought (DoT) by counting discrete steps in teacher model’s reasoning traces (e.g., Chain-of-Thought), then train with shallow to deep curriculum ordered by this DoT.

Result: Proposes three testable hypotheses: (i) DoT correlates with conventional difficulty, (ii) DoT-ordered curricula outperform other methods, (iii) difficulty is robust across teacher models with light formatting controls.

Conclusion: Aims to move toward cognitively grounded, interpretable curricula for reasoning-centric training with practical evaluation framework and mitigation strategies for validity threats.

Abstract: Curriculum learning for training LLMs requires a difficulty signal that aligns with reasoning while remaining scalable and interpretable. We propose a simple premise: tasks that demand deeper depth of thought for humans should also be harder for models. Accordingly, we define difficulty as depth of thought (DoT) and operationalize it by counting the discrete steps in a teacher model’s reasoning trace (e.g., Chain-of-Thought). We then train with a shallow to deep curriculum ordered by this DoT and outline how to derive, validate, and schedule it at scale. Our position yields three testable hypotheses: (i) DoT correlates with conventional difficulty on reasoning benchmarks, (ii) DoT-ordered curricula outperform length- or judge-scored curricula under matched budgets, and (iii) the difficulty is robust across teacher models given light formatting controls. We propose an evaluation framework and discuss threats to validity (teacher style, length confounds) alongside practical mitigations. Taken together, we aim to move toward cognitively grounded, interpretable curricula for reasoning-centric training.

Rahmat K. Adesunkanmi, Alexander W. Brandt, Masoud Deylami, Gustavo A. Giraldo Echeverri, Hamidreza Karbasian, Adel Alaeddini

Main category: cs.LG

TL;DR: Multi-modal ML framework combining physics simulations, environmental data, and language embeddings for accurate maritime object drift prediction across multiple time horizons.

Details

Motivation: Accurate prediction of leeway object drift is critical for search and rescue operations, but remains challenging due to complex maritime dynamics and time-sensitive requirements.

Method: Integrated Sentence Transformer embeddings with attention-based seq2seq architectures. Collected experimental environmental data, used Navier-Stokes simulations to estimate drag/lift coefficients via CNN, and combined physical forces with textual descriptions for multi-modal input to LSTM and Transformer models.

Result: Multi-modal models performed comparably to traditional methods while enabling longer-term forecasting (1-10 seconds) instead of single-step prediction, with good generalization across different objects.

Conclusion: Multi-modal modeling strategy provides accurate and adaptable predictions of leeway object drift in dynamic maritime conditions, demonstrating the value of integrating physical simulations with language embeddings for complex environmental forecasting.

Abstract: Accurately predicting the drift (displacement) of leeway objects in maritime environments remains a critical challenge, particularly in time-sensitive scenarios such as search and rescue operations. In this study, we propose a multi-modal machine learning framework that integrates Sentence Transformer embeddings with attention-based sequence-to-sequence architectures to predict the drift of leeway objects in water. We begin by experimentally collecting environmental and physical data, including water current and wind velocities, object mass, and surface area, for five distinct leeway objects. Using simulated data from a Navier-Stokes-based model to train a convolutional neural network on geometrical image representations, we estimate drag and lift coefficients of the leeway objects. These coefficients are then used to derive the net forces responsible for driving the objects’ motion. The resulting time series, comprising physical forces, environmental velocities, and object-specific features, combined with textual descriptions encoded via a language model, are inputs to attention-based sequence-to-sequence long-short-term memory and Transformer models, to predict future drift trajectories. We evaluate the framework across multiple time horizons ($1$, $3$, $5$, and $10$ seconds) and assess its generalization across different objects. We compare our approach against a fitted physics-based model and traditional machine learning methods, including recurrent neural networks and temporal convolutional neural networks. Our results show that these multi-modal models perform comparably to traditional models while also enabling longer-term forecasting in place of single-step prediction. Overall, our findings demonstrate the ability of a multi-modal modeling strategy to provide accurate and adaptable predictions of leeway object drift in dynamic maritime conditions.

[307] Data-driven models for production forecasting and decision supporting in petroleum reservoirs

Mateus A. Fernandes, Michael M. Furlanetti, Eduardo Gildin, Marcio A. Sampaio

Main category: cs.LG

TL;DR: Machine learning approach for petroleum production forecasting using production/injection data without complex geological models, addressing concept drift and tested on synthetic and real Brazilian pre-salt data.

Details

Motivation: To develop reliable production forecasting that anticipates changes in rock-fluid systems without depending on complex geological models, fluid properties, or well completion details.

Method: Data-driven approach using supervised learning methods (regressions and Neural Networks) with relevance analysis of production/injection variables, handling concept drift through observation windows and retraining periodicity.

Result: Methodology evaluated on synthetic data from UNISIM III compositional simulation model and applied to real Brazilian pre-salt cases, aiming for reliable reservoir dynamics prediction.

Conclusion: Expected to design a rapid-response predictor capable of handling practical difficulties, supporting reservoir management, optimizing production/injection, and maximizing oil recovery through probabilistic event analysis.

Abstract: Forecasting production reliably and anticipating changes in the behavior of rock-fluid systems are the main challenges in petroleum reservoir engineering. This project proposes to deal with this problem through a data-driven approach and using machine learning methods. The objective is to develop a methodology to forecast production parameters based on simple data as produced and injected volumes and, eventually, gauges located in wells, without depending on information from geological models, fluid properties or details of well completions and flow systems. Initially, we performed relevance analyses of the production and injection variables, as well as conditioning the data to suit the problem. As reservoir conditions change over time, concept drift is a priority concern and require special attention to those observation windows and the periodicity of retraining, which are also objects of study. For the production forecasts, we study supervised learning methods, such as those based on regressions and Neural Networks, to define the most suitable for our application in terms of performance and complexity. In a first step, we evaluate the methodology using synthetic data generated from the UNISIM III compositional simulation model. Next, we applied it to cases of real plays in the Brazilian pre-salt. The expected result is the design of a reliable predictor for reproducing reservoir dynamics, with rapid response, capability of dealing with practical difficulties such as restrictions in wells and processing units, and that can be used in actions to support reservoir management, including the anticipation of deleterious behaviors, optimization of production and injection parameters and the analysis of the effects of probabilistic events, aiming to maximize oil recovery.

[308] A Fast and Minimal System to Identify Depression Using Smartphones: Explainable Machine Learning-Based Approach

Md Sabbir Ahmed, Nova Ahmed

Main category: cs.LG

TL;DR: A fast depression detection system using 7 days of app usage data collected in 1 second, achieving 82.4% accuracy in identifying depressed students with machine learning models.

Details

Motivation: Existing depression detection systems require long-term data collection, which is not suitable for early detection scenarios where quick identification is crucial.

Method: Developed a tool to retrieve 7 days’ app usage data in 1 second, collected data from 100 Bangladeshi students, and built diverse ML models with feature selection using stable, Boruta, and other FS approaches.

Result: Light gradient boosting machine achieved 82.4% accuracy in identifying depressed students (precision=75%, F1-score=78.5%). Parsimonious stacking model showed maximum precision of 77.4% with balanced accuracy of 77.9%. SHAP analysis revealed behavioral markers related to depression.

Conclusion: The fast and minimalistic system can contribute to depression identification in underdeveloped regions, and the findings can facilitate development of less resource-intensive systems for understanding depressed students.

Abstract: Background: Existing robust, pervasive device-based systems developed in recent years to detect depression require data collected over a long period and may not be effective in cases where early detection is crucial. Objective: Our main objective was to develop a minimalistic system to identify depression using data retrieved in the fastest possible time. Methods: We developed a fast tool that retrieves the past 7 days’ app usage data in 1 second (mean 0.31, SD 1.10 seconds). A total of 100 students from Bangladesh participated in our study, and our tool collected their app usage data. To identify depressed and nondepressed students, we developed a diverse set of ML models. We selected important features using the stable approach, along with 3 main types of feature selection (FS) approaches. Results: Leveraging only the app usage data retrieved in 1 second, our light gradient boosting machine model used the important features selected by the stable FS approach and correctly identified 82.4% (n=42) of depressed students (precision=75%, F1-score=78.5%). Moreover, after comprehensive exploration, we presented a parsimonious stacking model where around 5 features selected by the all-relevant FS approach Boruta were used in each iteration of validation and showed a maximum precision of 77.4% (balanced accuracy=77.9%). A SHAP analysis of our best models presented behavioral markers that were related to depression. Conclusions: Due to our system’s fast and minimalistic nature, it may make a worthwhile contribution to identifying depression in underdeveloped and developing regions. In addition, our detailed discussion about the implication of our findings can facilitate the development of less resource-intensive systems to better understand students who are depressed.

[309] The Sound of Risk: A Multimodal Physics-Informed Acoustic Model for Forecasting Market Volatility and Enhancing Market Interpretability

Xiaoliang Chen, Xin Yu, Le Chang, Teng Jing, Jiashuai He, Ze Wang, Yangjun Luo, Xingyu Chen, Jiayue Liang, Yuchen Wang, Jiaying Xie

Main category: cs.LG

TL;DR: A multimodal framework combining textual sentiment and acoustic analysis of executive voices in earnings calls predicts stock volatility but not returns, with acoustic features from voice dynamics providing significant predictive power.

Details

Motivation: Information asymmetry in financial markets is exacerbated by strategically crafted corporate narratives, making conventional textual analysis insufficient for accurate risk assessment.

Method: Physics-Informed Acoustic Model (PIAM) applies nonlinear acoustics to extract emotional signatures from executive vocal tract dynamics in earnings calls, projecting both acoustic and textual features onto a 3D Affective State Label space (Tension, Stability, Arousal).

Result: Multimodal features explain up to 43.8% of out-of-sample variance in 30-day realized volatility, driven by emotional dynamics during executive transitions from scripted to spontaneous speech. CFOs show reduced textual stability and heightened acoustic instability, while CEOs exhibit significant arousal variability.

Conclusion: The multimodal approach substantially outperforms financials-only baselines, providing investors and regulators with a powerful tool for decoding latent uncertainty markers from biometric signals to enhance market interpretability.

Abstract: Information asymmetry in financial markets, often amplified by strategically crafted corporate narratives, undermines the effectiveness of conventional textual analysis. We propose a novel multimodal framework for financial risk assessment that integrates textual sentiment with paralinguistic cues derived from executive vocal tract dynamics in earnings calls. Central to this framework is the Physics-Informed Acoustic Model (PIAM), which applies nonlinear acoustics to robustly extract emotional signatures from raw teleconference sound subject to distortions such as signal clipping. Both acoustic and textual emotional states are projected onto an interpretable three-dimensional Affective State Label (ASL) space-Tension, Stability, and Arousal. Using a dataset of 1,795 earnings calls (approximately 1,800 hours), we construct features capturing dynamic shifts in executive affect between scripted presentation and spontaneous Q&A exchanges. Our key finding reveals a pronounced divergence in predictive capacity: while multimodal features do not forecast directional stock returns, they explain up to 43.8% of the out-of-sample variance in 30-day realized volatility. Importantly, volatility predictions are strongly driven by emotional dynamics during executive transitions from scripted to spontaneous speech, particularly reduced textual stability and heightened acoustic instability from CFOs, and significant arousal variability from CEOs. An ablation study confirms that our multimodal approach substantially outperforms a financials-only baseline, underscoring the complementary contributions of acoustic and textual modalities. By decoding latent markers of uncertainty from verifiable biometric signals, our methodology provides investors and regulators a powerful tool for enhancing market interpretability and identifying hidden corporate uncertainty.

Jueqi Wang, Zachary Jacokes, John Darrell Van Horn, Michael C. Schatz, Kevin A. Pelphrey, Archana Venkataraman

Main category: cs.LG

TL;DR: NeuroPathX is an explainable deep learning framework that uses cross-attention mechanisms to analyze brain-gene interactions in neurological disorders, outperforming traditional methods while providing biological interpretability.

Details

Motivation: Traditional imaging-genetics methods are limited to simplistic linear models or black-box techniques that lack interpretability, hindering understanding of complex brain-gene interactions in neurological disorders.

Method: Uses early fusion strategy with cross-attention mechanisms to capture interactions between brain structural variations (MRI) and biological pathways (genetics). Introduces two loss functions: sparsity loss for salient interactions and pathway similarity loss for consistent representations across cohorts.

Result: Outperforms competing baseline approaches on autism spectrum disorder and Alzheimer’s disease. Reveals biologically plausible associations linked to disorders.

Conclusion: NeuroPathX demonstrates potential to advance understanding of complex brain disorders through explainable deep learning that provides interpretable biological insights.

Abstract: While imaging-genetics holds great promise for unraveling the complex interplay between brain structure and genetic variation in neurological disorders, traditional methods are limited to simplistic linear models or to black-box techniques that lack interpretability. In this paper, we present NeuroPathX, an explainable deep learning framework that uses an early fusion strategy powered by cross-attention mechanisms to capture meaningful interactions between structural variations in the brain derived from MRI and established biological pathways derived from genetics data. To enhance interpretability and robustness, we introduce two loss functions over the attention matrix - a sparsity loss that focuses on the most salient interactions and a pathway similarity loss that enforces consistent representations across the cohort. We validate NeuroPathX on both autism spectrum disorder and Alzheimer’s disease. Our results demonstrate that NeuroPathX outperforms competing baseline approaches and reveals biologically plausible associations linked to the disorder. These findings underscore the potential of NeuroPathX to advance our understanding of complex brain disorders. Code is available at https://github.com/jueqiw/NeuroPathX .

[311] SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds

Wuxinlin Cheng, Yupeng Cao, Jinwen Wu, Koduvayur Subbalakshmi, Tian Han, Zhuo Feng

Main category: cs.LG

TL;DR: SALMAN is a unified robustness framework that evaluates transformer model stability using Distance Mapping Distortion (DMD) measure without parameter modification or complex perturbation heuristics.

Details

Motivation: As transformer models grow larger and more deployed, their robustness under input perturbations becomes critical, but existing methods are labor-intensive and diverge between small and large models.

Method: Proposes a novel Distance Mapping Distortion (DMD) measure that compares input-to-output distance mappings with near-linear complexity to rank sample susceptibility.

Result: Demonstrates significant gains in attack efficiency and robust training, providing a practical model-agnostic solution.

Conclusion: SALMAN framework serves as a practical, model-agnostic tool for advancing the reliability of transformer-based NLP systems.

Abstract: Recent strides in pretrained transformer-based language models have propelled state-of-the-art performance in numerous NLP tasks. Yet, as these models grow in size and deployment, their robustness under input perturbations becomes an increasingly urgent question. Existing robustness methods often diverge between small-parameter and large-scale models (LLMs), and they typically rely on labor-intensive, sample-specific adversarial designs. In this paper, we propose a unified, local (sample-level) robustness framework (SALMAN) that evaluates model stability without modifying internal parameters or resorting to complex perturbation heuristics. Central to our approach is a novel Distance Mapping Distortion (DMD) measure, which ranks each sample’s susceptibility by comparing input-to-output distance mappings in a near-linear complexity manner. By demonstrating significant gains in attack efficiency and robust training, we position our framework as a practical, model-agnostic tool for advancing the reliability of transformer-based NLP systems.

[312] Composition and Alignment of Diffusion Models using Constrained Learning

Shervin Khalafi, Ignacio Hounie, Dongsheng Ding, Alejandro Ribeiro

Main category: cs.LG

TL;DR: A constrained optimization framework that unifies alignment and composition of diffusion models to ensure generated samples satisfy multiple reward constraints while remaining close to pre-trained models.

Details

Motivation: Existing methods for improving diffusion model outputs through alignment and composition cannot guarantee that generated samples faithfully possess all desired properties, especially when dealing with competing rewards or multiple models.

Method: Proposes a constrained optimization framework with theoretical characterization of solutions and develops a Lagrangian-based primal-dual training algorithm to enforce reward constraints and proximity to pre-trained models.

Result: Empirical demonstrations in image generation show the approach effectively satisfies constraints and outperforms equally-weighted methods for both alignment and composition tasks.

Conclusion: The proposed framework successfully addresses the limitations of existing methods by providing a principled approach to ensure diffusion models generate samples that meet multiple desired properties simultaneously.

Abstract: Diffusion models have become prevalent in generative modeling due to their ability to sample from complex distributions. To improve the quality of generated samples and their compliance with user requirements, two commonly used methods are: (i) Alignment, which involves fine-tuning a diffusion model to align it with a reward; and (ii) Composition, which combines several pre-trained diffusion models, each emphasizing a desirable attribute in the generated outputs. However, trade-offs often arise when optimizing for multiple rewards or combining multiple models, as they can often represent competing properties. Existing methods cannot guarantee that the resulting model faithfully generates samples with all the desired properties. To address this gap, we propose a constrained optimization framework that unifies alignment and composition of diffusion models by enforcing that the aligned model satisfies reward constraints and/or remains close to (potentially multiple) pre-trained models. We provide a theoretical characterization of the solutions to the constrained alignment and composition problems and develop a Lagrangian-based primal-dual training algorithm to approximate these solutions. Empirically, we demonstrate the effectiveness and merits of our proposed approach in image generation, applying it to alignment and composition, and show that our aligned or composed model satisfies constraints effectively, and improves on the equally-weighted approach. Our implementation can be found at https://github.com/shervinkhalafi/constrained_comp_align.

[313] Learning Spatio-Temporal Dynamics via Operator-Valued RKHS and Kernel Koopman Methods

Mahishanka Withanachchi

Main category: cs.LG

TL;DR: A unified framework combining operator-valued RKHS with kernel Koopman methods for learning spatio-temporal dynamics of vector-valued functions, enabling nonparametric data-driven estimation with theoretical guarantees.

Details

Motivation: To develop a theoretically grounded framework for learning complex time-evolving vector fields while preserving both spatial and temporal structure, supporting forecasting, control, and uncertainty quantification in spatio-temporal machine learning.

Method: Combines operator-valued reproducing kernel Hilbert spaces (OV-RKHS) with kernel-based Koopman operator methods for nonparametric and data-driven estimation of spatio-temporal dynamics.

Result: Establishes representer theorems for time-dependent OV-RKHS interpolation, derives Sobolev-type approximation bounds for smooth vector fields, and provides spectral convergence guarantees for kernel Koopman operator approximations.

Conclusion: The framework enables efficient reduced order modeling and long-term prediction of high-dimensional nonlinear systems, offering theoretically sound tools for spatio-temporal machine learning applications.

Abstract: We introduce a unified framework for learning the spatio-temporal dynamics of vector valued functions by combining operator valued reproducing kernel Hilbert spaces (OV-RKHS) with kernel based Koopman operator methods. The approach enables nonparametric and data driven estimation of complex time evolving vector fields while preserving both spatial and temporal structure. We establish representer theorems for time dependent OV-RKHS interpolation, derive Sobolev type approximation bounds for smooth vector fields, and provide spectral convergence guarantees for kernel Koopman operator approximations. This framework supports efficient reduced order modeling and long term prediction of high dimensional nonlinear systems, offering theoretically grounded tools for forecasting, control, and uncertainty quantification in spatio-temporal machine learning.

[314] CoPE: A Lightweight Complex Positional Encoding

Avinash Amballa

Main category: cs.LG

TL;DR: CoPE introduces complex-valued positional encoding where real part captures content and imaginary part encodes position, achieving better performance with less computation than existing methods.

Details

Motivation: Traditional positional encodings have limitations in effectively modeling dependencies across sequence positions. Recent studies show position encoding effectiveness in transformers, but existing methods may suffer from long-term decay or computational complexity.

Method: Replace traditional positional encodings with complex embeddings. Use phase-aware attention in first transformer layer to capture position-dependent patterns, followed by standard attention layers. Real part captures semantic content, imaginary part encodes positional information.

Result: CoPE doesn’t exhibit long term decay and is compatible with linear attention. Experimental evaluation on GLUE benchmark shows superior performance with less computational complexity compared to RoPE, Sinusoidal and Learned positional encodings.

Conclusion: Complex positional encoding (CoPE) provides an effective and efficient alternative to traditional positional encoding methods, offering better performance while reducing computational requirements in transformer architectures.

Abstract: Recent studies have demonstrated the effectiveness of position encoding in transformer architectures. By incorporating positional information, this approach provides essential guidance for modeling dependencies between elements across different sequence positions. We introduce CoPE (a lightweight Complex Positional Encoding), a novel architecture that leverages complex-valued encoding to encode both content and positional information. Our approach replaces traditional positional encodings with complex embeddings where the real part captures semantic content and the imaginary part encodes positional information. We introduce phase-aware attention in the first layer of the transformer model to capture position-dependent patterns, followed by standard attention layers for higher-levels. We show that CoPE doesn’t exhibit long term decay and is compatible with linear attention. Experimental evaluation on the GLUE benchmark suggest that our approach achieves superior performance with less computational complexity, compared to RoPE, Sinusoidal and Learned positional encodings.

[315] What Matters in Data for DPO?

Yu Pan, Zhongze Cai, Guanting Chen, Huaiyang Zhong, Chonghuan Wang

Main category: cs.LG

TL;DR: DPO’s performance depends more on chosen response quality than rejected response quality. Theoretical and empirical analysis shows contrastive learning primarily improves chosen samples, and online DPO reduces to supervised fine-tuning on chosen responses.

Details

Motivation: To systematically study how preference data distribution influences Direct Preference Optimization (DPO) performance, particularly what characteristics of preference data are most critical for effective LLM alignment.

Method: Combined theoretical analysis characterizing optimal response distribution under DPO with extensive empirical experiments across diverse tasks, including studying online DPO settings and mixing on-policy data.

Result: Quality of chosen responses plays dominant role in optimizing DPO objective, while rejected responses have limited impact. Improving chosen response quality consistently boosts performance regardless of rejected response quality.

Conclusion: The study reveals DPO’s mechanism, explains widely adopted strategies, and provides practical insights for constructing high-impact preference datasets for LLM alignment, emphasizing the critical importance of chosen response quality.

Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental question remains open: what characteristics of preference data are most critical for DPO performance? In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. We show that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses may have relatively limited impact. Our theoretical analysis characterizes the optimal response distribution under DPO and reveals how contrastiveness between responses helps primarily by improving the chosen samples. We further study an online DPO setting and show it effectively reduces to supervised fine-tuning on the chosen responses. Extensive experiments across diverse tasks confirm our findings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the benefit of mixing the on-policy data. Our results interpret the mechanism behind some widely adopted strategies and offer practical insights for constructing high-impact preference datasets for LLM alignment.

[316] ProtoEHR: Hierarchical Prototype Learning for EHR-based Healthcare Predictions

Zi Cai, Yu Liu, Zhiyao Luo, Tingting Zhu

Main category: cs.LG

TL;DR: ProtoEHR is an interpretable hierarchical prototype learning framework that leverages multi-level EHR data structure for improved healthcare predictions across mortality, readmission, length of stay, drug recommendation, and phenotype tasks.

Details

Motivation: Existing studies often focus on isolated components of EHR data, limiting predictive performance and interpretability. There's a need to fully exploit the rich, multi-level structure of EHR data to enhance healthcare predictions.

Method: ProtoEHR models relationships within and across three hierarchical levels (medical codes, hospital visits, patients). It uses LLMs to extract semantic relationships among medical codes to construct a medical knowledge graph, then designs a hierarchical representation learning framework with prototype information at each level.

Result: ProtoEHR demonstrates accurate, robust, and interpretable predictions compared to baselines across five clinically significant tasks on two public datasets. It provides interpretable insights at code, visit, and patient levels.

Conclusion: ProtoEHR successfully addresses the limitations of existing approaches by fully leveraging EHR’s hierarchical structure, achieving superior performance while maintaining interpretability across multiple healthcare prediction tasks.

Abstract: Digital healthcare systems have enabled the collection of mass healthcare data in electronic healthcare records (EHRs), allowing artificial intelligence solutions for various healthcare prediction tasks. However, existing studies often focus on isolated components of EHR data, limiting their predictive performance and interpretability. To address this gap, we propose ProtoEHR, an interpretable hierarchical prototype learning framework that fully exploits the rich, multi-level structure of EHR data to enhance healthcare predictions. More specifically, ProtoEHR models relationships within and across three hierarchical levels of EHRs: medical codes, hospital visits, and patients. We first leverage large language models to extract semantic relationships among medical codes and construct a medical knowledge graph as the knowledge source. Building on this, we design a hierarchical representation learning framework that captures contextualized representations across three levels, while incorporating prototype information within each level to capture intrinsic similarities and improve generalization. To perform a comprehensive assessment, we evaluate ProtoEHR in two public datasets on five clinically significant tasks, including prediction of mortality, prediction of readmission, prediction of length of stay, drug recommendation, and prediction of phenotype. The results demonstrate the ability of ProtoEHR to make accurate, robust, and interpretable predictions compared to baselines in the literature. Furthermore, ProtoEHR offers interpretable insights on code, visit, and patient levels to aid in healthcare prediction.

[317] Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing

Rodrigo Tertulino

Main category: cs.LG

TL;DR: Federated learning model predicts at-risk distance education students with 85% AUC using early academic performance and digital engagement data while preserving privacy.

Details

Motivation: High dropout rates in distance education require proactive identification of at-risk students, but data privacy concerns and institutional silos often prevent effective early-warning systems.

Method: Developed machine learning models (Logistic Regression vs. Deep Neural Network) using Federated Learning framework on OULAD dataset, comparing model complexity and data balancing techniques.

Result: The federated model achieved strong predictive performance with approximately 85% ROC AUC score in identifying at-risk students.

Conclusion: Federated learning provides a practical, scalable solution for building early-warning systems that enable proactive student support while inherently respecting data privacy constraints.

Abstract: High dropout and failure rates in distance education pose a significant challenge for academic institutions, making the proactive identification of at-risk students crucial for providing timely support. This study develops and evaluates a machine learning model based on early academic performance and digital engagement patterns from the large-scale OULAD dataset to predict student risk at a UK university. To address the practical challenges of data privacy and institutional silos that often hinder such initiatives, we implement the model using a Federated Learning (FL) framework. We compare model complexity (Logistic Regression vs. a Deep Neural Network) and data balancing. The final federated model demonstrates strong predictive capability, achieving an ROC AUC score of approximately 85% in identifying at-risk students. Our findings show that this federated approach provides a practical and scalable solution for institutions to build effective early-warning systems, enabling proactive student support while inherently respecting data privacy.

[318] ZTFed-MAS2S: A Zero-Trust Federated Learning Framework with Verifiable Privacy and Trust-Aware Aggregation for Wind Power Data Imputation

Yang Li, Hanjie Wang, Yuanzheng Li, Jiazheng Li, Zhaoyang Dong

Main category: cs.LG

TL;DR: ZTFed-MAS2S is a zero-trust federated learning framework that combines secure parameter transmission with accurate wind power data imputation using multi-head attention sequence-to-sequence models.

Details

Motivation: Wind power data often has missing values due to sensor faults and unstable transmission, and federated learning needs protection against anomalous updates and privacy leakage in open industrial environments.

Method: Integrates verifiable differential privacy with zero-knowledge proofs, confidentiality/integrity verification, dynamic trust-aware aggregation via similarity graphs, and MAS2S model for sequence imputation with compression techniques.

Result: Extensive experiments on real-world wind farm datasets show superior performance in both federated learning and missing data imputation compared to other methods.

Conclusion: ZTFed-MAS2S provides a secure, efficient solution for practical energy sector applications by ensuring privacy preservation and accurate data imputation in zero-trust environments.

Abstract: Wind power data often suffers from missing values due to sensor faults and unstable transmission at edge sites. While federated learning enables privacy-preserving collaboration without sharing raw data, it remains vulnerable to anomalous updates and privacy leakage during parameter exchange. These challenges are amplified in open industrial environments, necessitating zero-trust mechanisms where no participant is inherently trusted. To address these challenges, this work proposes ZTFed-MAS2S, a zero-trust federated learning framework that integrates a multi-head attention-based sequence-to-sequence imputation model. ZTFed integrates verifiable differential privacy with non-interactive zero-knowledge proofs and a confidentiality and integrity verification mechanism to ensure verifiable privacy preservation and secure model parameters transmission. A dynamic trust-aware aggregation mechanism is employed, where trust is propagated over similarity graphs to enhance robustness, and communication overhead is reduced via sparsity- and quantization-based compression. MAS2S captures long-term dependencies in wind power data for accurate imputation. Extensive experiments on real-world wind farm datasets validate the superiority of ZTFed-MAS2S in both federated learning performance and missing data imputation, demonstrating its effectiveness as a secure and efficient solution for practical applications in the energy sector.

[319] Linear cost mutual information estimation and independence test of similar performance as HSIC

Jarek Duda, Jagoda Bracha, Adrian Przybysz

Main category: cs.LG

TL;DR: HCR (Hierarchical Correlation Reconstruction) is a linear-cost alternative to HSIC for statistical dependency evaluation, offering higher sensitivity and joint distribution modeling through mixed moments.

Details

Motivation: HSIC requires O(n^2.37) computational complexity for n-sized data samples, making it impractical for large datasets. There's a need for a more efficient method that maintains or improves dependency detection sensitivity.

Method: HCR uses hierarchical correlation reconstruction with mixed moments as features. It calculates single dependence features in O(n) linear time, starting with correlation and homoscedasticity, and allows approximation of mutual information through sum of squares of nontrivial mixed moments.

Result: HCR provides linear computational cost (O(n)) compared to HSIC’s O(n^2.37), offers higher dependence sensitivity in tests, and provides actual joint distribution models through dependency description via mixed moments.

Conclusion: HCR is a practical, efficient alternative to HSIC for statistical dependency evaluation, enabling analysis of large datasets with linear computational complexity while maintaining or improving detection sensitivity and providing richer distribution modeling capabilities.

Abstract: Evaluation of statistical dependencies between two data samples is a basic problem of data science/machine learning, and HSIC (Hilbert-Schmidt Information Criterion)\cite{HSIC} is considered the state-of-art method. However, for size $n$ data sample it requires multiplication of $n\times n$ matrices, what currently needs $\sim O(n^{2.37})$ computational complexity\cite{mult}, making it impractical for large data samples. We discuss HCR (Hierarchical Correlation Reconstruction) as its linear cost practical alternative of even higher dependence sensitivity in tests, and additionally providing actual joint distribution model by description of dependencies through features being mixed moments, starting with correlation and homoscedasticity, also allowing to approximate mutual information as just sum of squares of such nontrivial mixed moments between two data samples. Such single dependence describing feature is calculated in $O(n)$ linear time. Their number to test varies with dimension $d$ - requiring $O(d^2)$ for pairwise dependencies, $O(d^3)$ if wanting to also consider more subtle triplewise, and so on.

[320] DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction

Weilin Cai, Le Qin, Shwai He, Junwei Cui, Ang Li, Jiayi Huang

Main category: cs.LG

TL;DR: DualSparse-MoE introduces post-training expert partitioning to exploit dual sparsity at tensor and neuron levels in MoE models, achieving significant computational speedups with minimal accuracy loss through dynamic computation dropping and static neuron reconstruction.

Details

Motivation: Mixture of Experts (MoE) models face efficiency challenges despite their sparse activation patterns due to massive computational scale and unpredictable activation patterns. The authors identify dual sparsity at tensor and neuron levels as key to improving both accuracy and efficiency.

Method: Proposes post-training expert partitioning to induce tensor-level sparsity without retraining, preserving mathematical consistency. Combines dynamic tensor-level computation dropping with static neuron-level reconstruction in the DualSparse-MoE inference system.

Result: Enforcing ~25% drop rate reduces accuracy by only 0.08%-0.28% across three MoE models, with proportional computational speedups. Load-imbalance aware expert parallelism achieves 1.41x speedup with 0.5% accuracy degradation.

Conclusion: The approach successfully leverages dual sparsity through post-training transformations, enabling efficient MoE deployment with minimal accuracy impact while providing consistent computational benefits.

Abstract: Mixture of Experts (MoE) has become a mainstream architecture for building Large Language Models (LLMs) by reducing per-token computation while enabling model scaling. It can be viewed as partitioning a large Feed-Forward Network (FFN) at the tensor level into fine-grained sub-FFNs, or experts, and activating only a sparse subset for each input. While this sparsity improves efficiency, MoE still faces substantial challenges due to their massive computational scale and unpredictable activation patterns. To enable efficient MoE deployment, we identify dual sparsity at the tensor and neuron levels in pre-trained MoE modules as a key factor for both accuracy and efficiency. Unlike prior work that increases tensor-level sparsity through finer-grained expert design during pre-training, we introduce post-training expert partitioning to induce such sparsity without retraining. This preserves the mathematical consistency of model transformations and enhances both efficiency and accuracy in subsequent fine-tuning and inference. Building upon this, we propose DualSparse-MoE, an inference system that integrates dynamic tensor-level computation dropping with static neuron-level reconstruction to deliver significant efficiency gains with minimal accuracy loss. Experimental results show that enforcing an approximate 25% drop rate with our approach reduces average accuracy by only 0.08%-0.28% across three prevailing MoE models, while nearly all degrees of computation dropping consistently yield proportional computational speedups. Furthermore, incorporating load-imbalance awareness into expert parallelism achieves a 1.41x MoE module speedup with just 0.5% average accuracy degradation.

[321] Low-Rank Tensor Decompositions for the Theory of Neural Networks

Ricardo Borsoi, Konstantin Usevich, Marianne Clausel

Main category: cs.LG

TL;DR: Low-rank tensor decompositions provide mathematical foundations for deep learning theory, explaining neural networks’ expressivity, learnability, generalization, and identifiability through their connections to tensor methods.

Details

Motivation: To bridge the gap between deep neural networks' empirical success and mathematical theory by leveraging low-rank tensor decompositions, which offer strong uniqueness guarantees and polynomial-time algorithms.

Method: Review and synthesis of existing approaches from computer science and mathematics communities, using low-rank tensor decompositions to analyze neural networks through their tensor representations.

Result: Demonstrates that tensor methods play a fundamental role in theoretically explaining various aspects of deep neural network performance, including expressivity, algorithmic learnability, computational hardness, generalization, and identifiability.

Conclusion: Low-rank tensor decompositions serve as a core mathematical tool for understanding deep neural networks, providing a unified framework for theoretical analysis and opening broader perspectives for future research in deep learning theory.

Abstract: The groundbreaking performance of deep neural networks (NNs) promoted a surge of interest in providing a mathematical basis to deep learning theory. Low-rank tensor decompositions are specially befitting for this task due to their close connection to NNs and their rich theoretical results. Different tensor decompositions have strong uniqueness guarantees, which allow for a direct interpretation of their factors, and polynomial time algorithms have been proposed to compute them. Through the connections between tensors and NNs, such results supported many important advances in the theory of NNs. In this review, we show how low-rank tensor methods–which have been a core tool in the signal processing and machine learning communities–play a fundamental role in theoretically explaining different aspects of the performance of deep NNs, including their expressivity, algorithmic learnability and computational hardness, generalization, and identifiability. Our goal is to give an accessible overview of existing approaches (developed by different communities, ranging from computer science to mathematics) in a coherent and unified way, and to open a broader perspective on the use of low-rank tensor decompositions for the theory of deep NNs.

[322] LLM-Driven Intrinsic Motivation for Sparse Reward Reinforcement Learning

André Quadros, Cassio Silva, Ronnie Alves

Main category: cs.LG

TL;DR: Combining VAE-based state novelty rewards (VSIMR) with LLM-derived intrinsic rewards significantly improves RL agent performance in sparse reward environments like MiniGrid DoorKey.

Details

Motivation: Traditional reinforcement learning struggles in environments with extreme sparse rewards where positive feedback is infrequent, requiring better exploration and guidance strategies.

Method: Integrated Variational State as Intrinsic Reward (VSIMR) using VAEs to reward state novelty with LLM-derived intrinsic rewards based on environment and goal descriptions, implemented with Actor-Critic (A2C) agent in MiniGrid DoorKey environment.

Result: The combined strategy significantly increased agent performance and sampling efficiency compared to individual strategies or standard A2C, which failed to learn. Learning curves show VSIMR drives exploration while LLM rewards facilitate goal-oriented exploitation.

Conclusion: Combining VAE-based exploration (VSIMR) with LLM-guided exploitation creates an effective synergy for sparse reward RL problems, with each component complementing different aspects of the environment and task.

Abstract: This paper explores the combination of two intrinsic motivation strategies to improve the efficiency of reinforcement learning (RL) agents in environments with extreme sparse rewards, where traditional learning struggles due to infrequent positive feedback. We propose integrating Variational State as Intrinsic Reward (VSIMR), which uses Variational AutoEncoders (VAEs) to reward state novelty, with an intrinsic reward approach derived from Large Language Models (LLMs). The LLMs leverage their pre-trained knowledge to generate reward signals based on environment and goal descriptions, guiding the agent. We implemented this combined approach with an Actor-Critic (A2C) agent in the MiniGrid DoorKey environment, a benchmark for sparse rewards. Our empirical results show that this combined strategy significantly increases agent performance and sampling efficiency compared to using each strategy individually or a standard A2C agent, which failed to learn. Analysis of learning curves indicates that the combination effectively complements different aspects of the environment and task: VSIMR drives exploration of new states, while the LLM-derived rewards facilitate progressive exploitation towards goals.

[323] Enhancing Trust-Region Bayesian Optimization via Newton Methods

Quanlin Chen, Yiyu Chen, Jing Huo, Tianyu Ding, Yang Gao, Yuetong Chen

Main category: cs.LG

TL;DR: Proposes an enhanced Bayesian Optimization method that uses local quadratic models from a global GP’s gradients/Hessians to improve sampling efficiency while maintaining heterogeneous modeling in high-dimensional spaces.

Details

Motivation: Scaling Bayesian Optimization to high-dimensional spaces is challenging. Existing TuRBO method uses local GPs which reduces sampling efficiency compared to global GPs, and GP gradients vanish in high dimensions.

Method: Construct multiple local quadratic models using gradients and Hessians from a global GP, then select new sample points by solving bound-constrained quadratic programs. Addresses vanishing gradient issues in high-dimensional GPs.

Result: Method enhances TuRBO efficacy and outperforms various high-dimensional BO techniques on both synthetic functions and real-world applications.

Conclusion: The proposed approach successfully improves sampling efficiency while preserving heterogeneous modeling capabilities, making high-dimensional Bayesian Optimization more effective.

Abstract: Bayesian Optimization (BO) has been widely applied to optimize expensive black-box functions while retaining sample efficiency. However, scaling BO to high-dimensional spaces remains challenging. Existing literature proposes performing standard BO in multiple local trust regions (TuRBO) for heterogeneous modeling of the objective function and avoiding over-exploration. Despite its advantages, using local Gaussian Processes (GPs) reduces sampling efficiency compared to a global GP. To enhance sampling efficiency while preserving heterogeneous modeling, we propose to construct multiple local quadratic models using gradients and Hessians from a global GP, and select new sample points by solving the bound-constrained quadratic program. Additionally, we address the issue of vanishing gradients of GPs in high-dimensional spaces. We provide a convergence analysis and demonstrate through experimental results that our method enhances the efficacy of TuRBO and outperforms a wide range of high-dimensional BO techniques on synthetic functions and real-world applications.

[324] VERIRL: Boosting the LLM-based Verilog Code Generation via Reinforcement Learning

Fu Teng, Miao Pan, Xuhong Zhang, Zhezhi He, Yiyao Yang, Xinyi Chai, Mengnan Qi, Liqiang Lu, Jianwei Yin

Main category: cs.LG

TL;DR: A reinforcement learning framework for Verilog code generation that addresses challenges in hardware description languages through a curated dataset, trace-back rescore mechanism, and sample-balanced weighting strategy, achieving state-of-the-art performance.

Details

Motivation: Hardware description languages like Verilog remain underexplored in code generation due to their concurrency semantics, syntactic rigidity, and simulation complexity, creating a need for specialized approaches.

Method: Developed Veribench-53K dataset from 700K+ Verilog problems, introduced trace-back rescore mechanism for better reward signals, implemented sample-balanced weighting strategy to prevent catastrophic forgetting, and created iterative RL pipeline co-evolving policy and reward models.

Result: Achieved state-of-the-art performance on Verilog generation tasks with substantial gains in test pass rate, functional correctness, and compilation robustness, outperforming methods like CraftRTL and DeepSeek-style approaches.

Conclusion: RL-driven approaches show strong potential for structured code generation in hardware-centric domains, demonstrating that high-quality datasets combined with RL optimization can outperform methods relying on large-scale closed-source model distillation.

Abstract: Recent advancements in code generation have shown remarkable success across software domains, yet hardware description languages (HDLs) such as Verilog remain underexplored due to their concurrency semantics, syntactic rigidity, and simulation complexity. In this work, we address these challenges by introducing a reinforcement learning (RL) framework tailored for Verilog code generation. We first construct Veribench-53K, a high-quality dataset curated from over 700K Verilog problems, enriched with structured prompts, complexity labels, and diverse testbenches. To tackle the problem of sparse and noisy reward signals, we propose a Trace-back based Rescore mechanism that leverages reasoning paths and iterative refinement to enhance feedback reliability and support reward model training. Furthermore, to mitigate catastrophic forgetting and overfitting during RL fine-tuning, we introduce a sample-balanced weighting strategy that adaptively balances learning dynamics based on reward-probability distributions. These innovations are integrated into an iterative RL pipeline that co-evolves the policy and reward models. In contrast to recent work such as CraftRTL, which relies on large-scale closed-source model distillation, and DeepSeek-style approaches that struggle with sparse feedback, our method demonstrates superior performance using a smaller but high-quality dataset combined with RL optimization. Experiments on Verilog generation tasks demonstrate state-of-the-art performance, with substantial gains in test pass rate, functional correctness, and compilation robustness. Our findings highlight the potential of RL-driven approaches for structured code generation in hardware-centric domains. VERIRL is publicly available at https://github.com/omniAI-Lab/VeriRL.

[325] DRTA: Dynamic Reward Scaling for Reinforcement Learning in Time Series Anomaly Detection

Bahareh Golchin, Banafsheh Rekabdar, Kunpeng Liu

Main category: cs.LG

TL;DR: Proposes DRTA, a reinforcement learning framework with dynamic reward shaping, VAE, and active learning for time series anomaly detection that outperforms state-of-the-art methods on benchmark datasets.

Details

Motivation: Traditional anomaly detection methods struggle with limited labeled data, high false-positive rates, and poor generalization to novel anomaly types in time series data across finance, healthcare, and industrial applications.

Method: Uses reinforcement learning with dynamic reward shaping that balances VAE-based reconstruction error and classification rewards through an adaptive mechanism, combined with active learning for effective anomaly detection in low-label systems.

Result: Experimental results on Yahoo A1 and A2 benchmark datasets show consistent outperformance over state-of-the-art unsupervised and semi-supervised approaches, achieving high precision and recall.

Conclusion: The DRTA framework provides a scalable and efficient solution for real-world anomaly detection tasks, effectively addressing challenges of limited labeled data and novel anomaly types.

Abstract: Anomaly detection in time series data is important for applications in finance, healthcare, sensor networks, and industrial monitoring. Traditional methods usually struggle with limited labeled data, high false-positive rates, and difficulty generalizing to novel anomaly types. To overcome these challenges, we propose a reinforcement learning-based framework that integrates dynamic reward shaping, Variational Autoencoder (VAE), and active learning, called DRTA. Our method uses an adaptive reward mechanism that balances exploration and exploitation by dynamically scaling the effect of VAE-based reconstruction error and classification rewards. This approach enables the agent to detect anomalies effectively in low-label systems while maintaining high precision and recall. Our experimental results on the Yahoo A1 and Yahoo A2 benchmark datasets demonstrate that the proposed method consistently outperforms state-of-the-art unsupervised and semi-supervised approaches. These findings show that our framework is a scalable and efficient solution for real-world anomaly detection tasks.

[326] Data Augmentation Improves Machine Unlearning

Andreza M. C. Falcao, Filipe R. Cordeiro

Main category: cs.LG

TL;DR: Data augmentation design significantly improves machine unlearning effectiveness, reducing performance gap to retrained models by up to 40.12% with TrivialAug.

Details

Motivation: To investigate the under-explored role of systematic augmentation design in machine unlearning and its impact on improving unlearning effectiveness while preserving model performance.

Method: Experiments on CIFAR-10 and CIFAR-100 datasets using various unlearning methods (SalUn, Random Label, Fine-Tuning) with different data augmentation strategies under varying forget rates.

Result: Proper augmentation design significantly improves unlearning effectiveness, with TrivialAug reducing the Average Gap unlearning Metric by up to 40.12% compared to retrained models.

Conclusion: Data augmentation not only helps reduce memorization but is crucial for achieving privacy-preserving and efficient machine unlearning.

Abstract: Machine Unlearning (MU) aims to remove the influence of specific data from a trained model while preserving its performance on the remaining data. Although a few works suggest connections between memorisation and augmentation, the role of systematic augmentation design in MU remains under-investigated. In this work, we investigate the impact of different data augmentation strategies on the performance of unlearning methods, including SalUn, Random Label, and Fine-Tuning. Experiments conducted on CIFAR-10 and CIFAR-100, under varying forget rates, show that proper augmentation design can significantly improve unlearning effectiveness, reducing the performance gap to retrained models. Results showed a reduction of up to 40.12% of the Average Gap unlearning Metric, when using TrivialAug augmentation. Our results suggest that augmentation not only helps reduce memorization but also plays a crucial role in achieving privacy-preserving and efficient unlearning.

[327] Breaking Through Barren Plateaus: Reinforcement Learning Initializations for Deep Variational Quantum Circuits

Yifeng Peng, Xinyi Li, Zhemin Zhang, Samuel Yen-Chi Chen, Zhiding Liang, Ying Wang

Main category: cs.LG

TL;DR: RL-based initialization strategy for VQAs to mitigate barren plateau problem by pre-training circuit parameters using reinforcement learning before gradient optimization.

Details

Motivation: Variational Quantum Algorithms suffer from barren plateau problem where gradients vanish exponentially with system size/circuit depth, hindering training effectiveness.

Method: Use reinforcement learning algorithms (DPG, SAC, PPO) to generate optimal initial circuit parameters that minimize VQA cost function before applying standard gradient-based optimization methods.

Result: RL-based initialization significantly improves convergence speed and final solution quality across various noise conditions and tasks, with multiple RL algorithms achieving comparable performance gains.

Conclusion: RL-driven parameter initialization offers a flexible and robust approach to accelerate VQA scalability and practical deployment, providing a promising integration of machine learning techniques into quantum algorithm design.

Abstract: Variational Quantum Algorithms (VQAs) have gained prominence as a viable framework for exploiting near-term quantum devices in applications ranging from optimization and chemistry simulation to machine learning. However, the effectiveness of VQAs is often constrained by the so-called barren plateau problem, wherein gradients diminish exponentially as system size or circuit depth increases, thereby hindering training. In this work, we propose a reinforcement learning (RL)-based initialization strategy to alleviate the barren plateau issue by reshaping the initial parameter landscape to avoid regions prone to vanishing gradients. In particular, we explore several RL algorithms (Deterministic Policy Gradient, Soft Actor-Critic, and Proximal Policy Optimization, etc.) to generate the circuit parameters (treated as actions) that minimize the VQAs cost function before standard gradient-based optimization. By pre-training with RL in this manner, subsequent optimization using methods such as gradient descent or Adam proceeds from a more favorable initial state. Extensive numerical experiments under various noise conditions and tasks consistently demonstrate that the RL-based initialization method significantly enhances both convergence speed and final solution quality. Moreover, comparisons among different RL algorithms highlight that multiple approaches can achieve comparable performance gains, underscoring the flexibility and robustness of our method. These findings shed light on a promising avenue for integrating machine learning techniques into quantum algorithm design, offering insights into how RL-driven parameter initialization can accelerate the scalability and practical deployment of VQAs. Opening up a promising path for the research community in machine learning for quantum, especially barren plateau problems in VQAs.

[328] Quantifying The Limits of AI Reasoning: Systematic Neural Network Representations of Algorithms

Anastasis Kratsios, Dennis Zvigelsky, Bradd Hart

Main category: cs.LG

TL;DR: Neural networks can exactly emulate any computational circuit (Boolean, tropical, arithmetic, etc.) using ReLU activations, demonstrating they can perform any reasoning task without approximation when perfectly trained.

Details

Motivation: To quantify what forms of reasoning neural networks can perform when perfectly trained, addressing the open question about their reasoning capabilities beyond approximation.

Method: A meta-algorithm that converts any circuit into a feedforward neural network by iteratively replacing each gate with a canonical ReLU MLP emulator, preserving exact computation without approximation.

Result: Neural networks can exactly emulate circuits including Boolean logic, dynamic programming, symbolic math, and even randomized circuits, with parametric complexity scaling with circuit complexity.

Conclusion: No reasoning task lies beyond neural networks’ reach - they can emulate any circuit exactly, trading algorithmic runtime for space complexity (number of neurons), making them strictly more powerful than classical universal approximation.

Abstract: A main open question in contemporary AI research is quantifying the forms of reasoning neural networks can perform when perfectly trained. This paper answers this by interpreting reasoning tasks as circuit emulation, where the gates define the type of reasoning; e.g. Boolean gates for predicate logic, tropical circuits for dynamic programming, arithmetic and analytic gates for symbolic mathematical representation, and hybrids thereof for deeper reasoning; e.g. higher-order logic. We present a systematic meta-algorithm that converts essentially any circuit into a feedforward neural network (NN) with ReLU activations by iteratively replacing each gate with a canonical ReLU MLP emulator. We show that, on any digital computer, our construction emulates the circuit exactly–no approximation, no rounding, modular overflow included–demonstrating that no reasoning task lies beyond the reach of neural networks. The number of neurons in the resulting network (parametric complexity) scales with the circuit’s complexity, and the network’s computational graph (structure) mirrors that of the emulated circuit. This formalizes the folklore that NNs networks trade algorithmic run-time (circuit runtime) for space complexity (number of neurons). We derive a range of applications of our main result, from emulating shortest-path algorithms on graphs with cubic–size NNs, to simulating stopped Turing machines with roughly quadratically–large NNs, and even the emulation of randomized Boolean circuits. Lastly, we demonstrate that our result is strictly more powerful than a classical universal approximation theorem: any universal function approximator can be encoded as a circuit and directly emulated by a NN.

[329] BTW: A Non-Parametric Variance Stabilization Framework for Multimodal Model Integration

Jun Hou, Le Wang, Xuan Wang

Main category: cs.LG

TL;DR: BTW is a parameter-free bi-level weighting framework that uses KL divergence and mutual information to dynamically adjust modality importance in multimodal MoE models, improving performance when additional modalities introduce noise.

Details

Motivation: Existing approaches like Partial Information Decomposition don't scale beyond two modalities and lack instance-level control, making them ineffective when additional modalities introduce more noise than useful information.

Method: BTW combines instance-level KL divergence (measuring divergence between unimodal and multimodal predictions) and modality-level mutual information (estimating global alignment) to dynamically weight modalities without additional parameters.

Result: Extensive experiments on sentiment regression and clinical classification show significant improvements in regression performance and multiclass classification accuracy.

Conclusion: BTW provides an effective, scalable solution for handling noisy modalities in multimodal MoE models through dynamic weighting based on both instance-specific and global modality alignment.

Abstract: Mixture-of-Experts (MoE) models have become increasingly powerful in multimodal learning by enabling modular specialization across modalities. However, their effectiveness remains unclear when additional modalities introduce more noise than complementary information. Existing approaches, such as the Partial Information Decomposition, struggle to scale beyond two modalities and lack the resolution needed for instance-level control. We propose Beyond Two-modality Weighting (BTW), a bi-level, non-parametric weighting framework that combines instance-level Kullback-Leibler (KL) divergence and modality-level mutual information (MI) to dynamically adjust modality importance during training. Our method does not require additional parameters and can be applied to an arbitrary number of modalities. Specifically, BTW computes per-example KL weights by measuring the divergence between each unimodal and the current multimodal prediction, and modality-wide MI weights by estimating global alignment between unimodal and multimodal outputs. Extensive experiments on sentiment regression and clinical classification demonstrate that our method significantly improves regression performance and multiclass classification accuracy.

[330] Enhancing Chemical Explainability Through Counterfactual Masking

Łukasz Janisiów, Marek Kochańczyk, Bartosz Zieliński, Tomasz Danel

Main category: cs.LG

TL;DR: Counterfactual masking framework replaces masked molecular substructures with chemically reasonable fragments from generative models, providing more realistic and actionable explanations than traditional masking methods.

Details

Motivation: Existing explainable AI methods for molecular property prediction rely on masking strategies that remove atoms/features, but these often fail to adhere to molecular distributions and yield unintuitive explanations.

Method: Proposes counterfactual masking that replaces masked substructures with chemically reasonable fragments sampled from generative models trained to complete molecular graphs, evaluating predictions against counterfactual molecules from the data distribution.

Result: The method provides molecular realism for robust explanations and meaningful counterfactuals that indicate how structural modifications affect properties, demonstrating effectiveness across multiple datasets and property prediction tasks.

Conclusion: The approach bridges explainability and molecular design, offering a principled generative path toward explainable machine learning in chemistry with more actionable insights.

Abstract: Molecular property prediction is a crucial task that guides the design of new compounds, including drugs and materials. While explainable artificial intelligence methods aim to scrutinize model predictions by identifying influential molecular substructures, many existing approaches rely on masking strategies that remove either atoms or atom-level features to assess importance via fidelity metrics. These methods, however, often fail to adhere to the underlying molecular distribution and thus yield unintuitive explanations. In this work, we propose counterfactual masking, a novel framework that replaces masked substructures with chemically reasonable fragments sampled from generative models trained to complete molecular graphs. Rather than evaluating masked predictions against implausible zeroed-out baselines, we assess them relative to counterfactual molecules drawn from the data distribution. Our method offers two key benefits: (1) molecular realism underpinning robust and distribution-consistent explanations, and (2) meaningful counterfactuals that directly indicate how structural modifications may affect predicted properties. We demonstrate that counterfactual masking is well-suited for benchmarking model explainers and yields more actionable insights across multiple datasets and property prediction tasks. Our approach bridges the gap between explainability and molecular design, offering a principled and generative path toward explainable machine learning in chemistry.

[331] A Note on Graphon-Signal Analysis of Graph Neural Networks

Levi Rauchwerger, Ron Levie

Main category: cs.LG

TL;DR: This paper extends previous graphon-signal analysis of MPNNs by addressing limitations in multidimensional signals, readout functions, generalization bounds, and non-symmetric graphons.

Details

Motivation: The previous paper by Levie on graphon-signal analysis of MPNNs had several limitations that restricted its practical applicability in graph machine learning settings.

Method: The authors introduce four key refinements: 1) extending to multidimensional signals, 2) extending Lipschitz continuity to MPNNs with readout functions, 3) improving generalization bounds using robustness-type bounds, and 4) extending analysis to non-symmetric graphons and kernels.

Result: The paper provides more comprehensive theoretical foundations for MPNNs that better align with practical graph machine learning applications by addressing the identified limitations.

Conclusion: These extensions make the graphon-signal analysis framework more applicable to real-world graph learning scenarios by supporting multidimensional features, readout operations, improved generalization guarantees, and asymmetric graph structures.

Abstract: A recent paper, ``A Graphon-Signal Analysis of Graph Neural Networks’’, by Levie, analyzed message passing graph neural networks (MPNNs) by embedding the input space of MPNNs, i.e., attributed graphs (graph-signals), to a space of attributed graphons (graphon-signals). Based on extensions of standard results in graphon analysis to graphon-signals, the paper proved a generalization bound and a sampling lemma for MPNNs. However, there are some missing ingredients in that paper, limiting its applicability in practical settings of graph machine learning. In the current paper, we introduce several refinements and extensions to existing results that address these shortcomings. In detail, 1) we extend the main results in the paper to graphon-signals with multidimensional signals (rather than 1D signals), 2) we extend the Lipschitz continuity to MPNNs with readout with respect to cut distance (rather than MPNNs without readout with respect to cut metric), 3) we improve the generalization bound by utilizing robustness-type generalization bounds, and 4) we extend the analysis to non-symmetric graphons and kernels.

[332] Improving Long-term Autoregressive Spatiotemporal Predictions: A Proof of Concept with Fluid Dynamics

Hao Zhou, Sibo Cheng

Main category: cs.LG

TL;DR: SPF framework enables multi-step learning with one-step training, reducing memory usage while improving long-term accuracy compared to autoregressive methods.

Details

Motivation: Address error accumulation in long-term forecasting and high GPU memory demands of autoregressive training while maintaining short-term performance.

Method: Stochastic PushForward (SPF) builds supplementary dataset from model predictions, combines with ground truth via stochastic acquisition strategy, and precomputes multi-step predictions between epochs.

Result: SPF achieves higher long-term accuracy than autoregressive methods on Burgers’ equation and Shallow Water benchmark while lowering memory requirements.

Conclusion: SPF is a promising approach for resource-limited and complex simulations, balancing short- and long-term performance with reduced memory usage.

Abstract: Data-driven methods are emerging as efficient alternatives to traditional numerical forecasting, offering fast inference and lower computational cost. Yet, for complex systems, long-term accuracy often deteriorates due to error accumulation, and autoregressive training (though effective) demands large GPU memory and may sacrifice short-term performance. We propose the Stochastic PushForward (SPF) framework, which retains one-step-ahead training while enabling multi-step learning. SPF builds a supplementary dataset from model predictions and combines it with ground truth via a stochastic acquisition strategy, balancing short- and long-term performance while reducing overfitting. Multi-step predictions are precomputed between epochs, keeping memory usage stable without storing full unrolled sequences. Experiments on the Burgers’ equation and the Shallow Water benchmark show that SPF achieves higher long-term accuracy than autoregressive methods while lowering memory requirements, making it promising for resource-limited and complex simulations.

[333] Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design

Darin Tsui, Kunal Talreja, Amirali Aghazadeh

Main category: cs.LG

TL;DR: Sparse autoencoders (SAEs) trained on protein language model embeddings outperform baseline models in low-data protein function prediction and enable more effective protein design by extracting interpretable biological features.

Details

Motivation: Protein function prediction from sequence is challenging in data-scarce regimes, and while protein language models provide useful embeddings, the effectiveness of sparse autoencoders for low-N function prediction and design hasn't been systematically studied.

Method: Evaluated SAEs trained on fine-tuned ESM2 embeddings across diverse fitness extrapolation and protein engineering tasks, comparing performance against ESM2 baselines with as few as 24 sequences.

Result: SAEs consistently outperform or compete with ESM2 baselines in fitness prediction, and steering predictive latents yields top-fitness variants in 83% of cases compared to designing with ESM2 alone.

Conclusion: SAEs provide compact, biologically meaningful representations that generalize effectively from limited data and enable more successful protein design by exploiting biological motifs in language model representations.

Abstract: Predicting protein function from amino acid sequence remains a central challenge in data-scarce (low-$N$) regimes, limiting machine learning-guided protein design when only small amounts of assay-labeled sequence-function data are available. Protein language models (pLMs) have advanced the field by providing evolutionary-informed embeddings and sparse autoencoders (SAEs) have enabled decomposition of these embeddings into interpretable latent variables that capture structural and functional features. However, the effectiveness of SAEs for low-$N$ function prediction and protein design has not been systematically studied. Herein, we evaluate SAEs trained on fine-tuned ESM2 embeddings across diverse fitness extrapolation and protein engineering tasks. We show that SAEs, with as few as 24 sequences, consistently outperform or compete with their ESM2 baselines in fitness prediction, indicating that their sparse latent space encodes compact and biologically meaningful representations that generalize more effectively from limited data. Moreover, steering predictive latents exploits biological motifs in pLM representations, yielding top-fitness variants in 83% of cases compared to designing with ESM2 alone.

[334] DrugReasoner: Interpretable Drug Approval Prediction with a Reasoning-augmented Language Model

Mohammadreza Ghaffarzadeh-Esfahani, Ali Motahharynia, Nahid Yousefian, Navid Mazrouei, Jafar Ghaisari, Yousof Gheisari

Main category: cs.LG

TL;DR: DrugReasoner is a reasoning-based LLM that predicts small-molecule drug approval likelihood by integrating molecular descriptors with comparative reasoning against similar compounds, achieving robust performance while providing interpretable rationales.

Details

Motivation: Early prediction of drug approval outcomes is critical for optimizing research investments, but existing ML/DL methods have limited interpretability, constraining their impact in drug discovery.

Method: Built on LLaMA architecture and fine-tuned with group relative policy optimization (GRPO), DrugReasoner integrates molecular descriptors with comparative reasoning against structurally similar approved/unapproved compounds, generating predictions with step-by-step rationales and confidence scores.

Result: Achieved AUC of 0.732 and F1 score of 0.729 on validation set, 0.725 and 0.718 on test set, outperforming conventional baselines. On external dataset, achieved AUC of 0.728 and F1-score of 0.774, outperforming both baseline and ChemAP model while maintaining high precision and balanced sensitivity.

Conclusion: DrugReasoner delivers competitive predictive accuracy while enhancing transparency through reasoning outputs, addressing a key bottleneck in AI-assisted drug discovery and demonstrating the potential of reasoning-augmented LLMs as interpretable tools for pharmaceutical decision-making.

Abstract: Drug discovery is a complex and resource-intensive process, making early prediction of approval outcomes critical for optimizing research investments. While classical machine learning and deep learning methods have shown promise in drug approval prediction, their limited interpretability constraints their impact. Here, we present DrugReasoner, a reasoning-based large language model (LLM) built on the LLaMA architecture and fine-tuned with group relative policy optimization (GRPO) to predict the likelihood of small-molecule approval. DrugReasoner integrates molecular descriptors with comparative reasoning against structurally similar approved and unapproved compounds, generating predictions alongside step-by-step rationales and confidence scores. DrugReasoner achieved robust performance with an AUC of 0.732 and an F1 score of 0.729 on the validation set and 0.725 and 0.718 on the test set, respectively. These results outperformed conventional baselines, including logistic regression, support vector machine, and k-nearest neighbors and had competitive performance relative to XGBoost. On an external independent dataset, DrugReasoner outperformed both baseline and the recently developed ChemAP model, achieving an AUC of 0.728 and an F1-score of 0.774, while maintaining high precision and balanced sensitivity, demonstrating robustness in real-world scenarios. These findings demonstrate that DrugReasoner not only delivers competitive predictive accuracy but also enhances transparency through its reasoning outputs, thereby addressing a key bottleneck in AI-assisted drug discovery. This study highlights the potential of reasoning-augmented LLMs as interpretable and effective tools for pharmaceutical decision-making.

[335] History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL

Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, Haibo Chen

Main category: cs.LG

TL;DR: RhymeRL is a novel RL system for LLMs that addresses GPU underutilization by leveraging historical rollout similarity through speculative decoding and two-tier scheduling, achieving 2.6x performance improvement without accuracy loss.

Details

Motivation: Current RL systems for LLMs suffer from significant GPU underutilization due to rollout stage dominance and rollout length imbalances, with existing solutions compromising accuracy for efficiency.

Method: Introduces RhymeRL with two key innovations: HistoSpec (speculative decoding using historical rollout token similarity for accurate drafts) and HistoPipe (two-tier scheduling leveraging historical rollout distribution similarity for workload balancing).

Result: RhymeRL demonstrates scalability from dozens to thousands of GPUs and achieves 2.6x performance improvement over existing methods while maintaining accuracy.

Conclusion: The system successfully addresses GPU underutilization in LLM RL training by exploiting historical rollout similarity, providing significant performance gains without modifying the RL paradigm or compromising accuracy.

Abstract: With the rapid advancement of large language models (LLMs), reinforcement learning (RL) has emerged as a pivotal methodology for enhancing the reasoning capabilities of LLMs. Unlike traditional pre-training approaches, RL encompasses multiple stages: rollout, reward, and training, which necessitates collaboration among various worker types. However, current RL systems continue to grapple with substantial GPU underutilization, due to two primary factors: (1) The rollout stage dominates the overall RL process due to test-time scaling; (2) Imbalances in rollout lengths (within the same batch) result in GPU bubbles. While prior solutions like asynchronous execution and truncation offer partial relief, they may compromise training accuracy for efficiency. Our key insight stems from a previously overlooked observation: rollout responses exhibit remarkable similarity across adjacent training epochs. Based on the insight, we introduce RhymeRL, an LLM RL system designed to accelerate RL training with two key innovations. First, to enhance rollout generation, we present HistoSpec, a speculative decoding inference engine that utilizes the similarity of historical rollout token sequences to obtain accurate drafts. Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier scheduling strategy that leverages the similarity of historical rollout distributions to balance workload among rollout workers. We have evaluated RhymeRL within a real production environment, demonstrating scalability from dozens to thousands of GPUs. Experimental results demonstrate that RhymeRL achieves a 2.6x performance improvement over existing methods, without compromising accuracy or modifying the RL paradigm.

[336] Linear Trading Position with Sparse Spectrum

Zhao-Rong Lai, Haisheng Yang

Main category: cs.LG

TL;DR: Proposes a sparse spectrum linear trading position method with fixed-point optimization algorithm to improve principal portfolio trading robustness and diversification.

Details

Motivation: Principal portfolio approaches in signal-based trading lack diversification and robustness across different market situations, limiting their ability to explore key features of prediction matrices.

Method: Developed a novel linear trading position with sparse spectrum to explore larger spectral regions of prediction matrix, and implemented Krasnosel’skiï-Mann fixed-point algorithm for optimization with descent property and linear convergence rate.

Result: The proposed method achieves good and robust performance across various market situations as demonstrated through extensive experiments.

Conclusion: The sparse spectrum trading position with fixed-point optimization provides improved diversification, robustness, and theoretical convergence guarantees for principal portfolio trading strategies.

Abstract: The principal portfolio approach is an emerging method in signal-based trading. However, these principal portfolios may not be diversified to explore the key features of the prediction matrix or robust to different situations. To address this problem, we propose a novel linear trading position with sparse spectrum that can explore a larger spectral region of the prediction matrix. We also develop a Krasnosel’ski\u \i-Mann fixed-point algorithm to optimize this trading position, which possesses the descent property and achieves a linear convergence rate in the objective value. This is a new theoretical result for this type of algorithms. Extensive experiments show that the proposed method achieves good and robust performance in various situations.

[337] Uncertainty Awareness on Unsupervised Domain Adaptation for Time Series Data

Weide Liu, Xiaoyang Zhong, Lu Wang, Jingwen Hou, Yuemei Luo, Jiebin Yan, Yuming Fang

Main category: cs.LG

TL;DR: Proposes multi-scale feature extraction with uncertainty estimation for unsupervised domain adaptation in time series data, achieving state-of-the-art performance and better calibration.

Details

Motivation: Address distribution shifts between training and testing datasets in time series data to improve generalization on unlabeled test data.

Method: Multi-scale mixed input architecture for feature extraction at different scales, combined with uncertainty awareness mechanism using evidential learning with Dirichlet prior for both prediction and uncertainty estimation.

Result: Achieves state-of-the-art performance across multiple benchmark datasets with significantly lower Expected Calibration Error (ECE), indicating better-calibrated prediction confidence.

Conclusion: The combined approach of mixed input architecture with uncertainty awareness mechanism is highly effective for unsupervised domain adaptation in time series data, improving both performance and robustness.

Abstract: Unsupervised domain adaptation methods seek to generalize effectively on unlabeled test data, especially when encountering the common challenge in time series data that distribution shifts occur between training and testing datasets. In this paper, we propose incorporating multi-scale feature extraction and uncertainty estimation to improve the model’s generalization and robustness across domains. Our approach begins with a multi-scale mixed input architecture that captures features at different scales, increasing training diversity and reducing feature discrepancies between the training and testing domains. Based on the mixed input architecture, we further introduce an uncertainty awareness mechanism based on evidential learning by imposing a Dirichlet prior on the labels to facilitate both target prediction and uncertainty estimation. The uncertainty awareness mechanism enhances domain adaptation by aligning features with the same labels across different domains, which leads to significant performance improvements in the target domain. Additionally, our uncertainty-aware model demonstrates a much lower Expected Calibration Error (ECE), indicating better-calibrated prediction confidence. Our experimental results show that this combined approach of mixed input architecture with the uncertainty awareness mechanism achieves state-of-the-art performance across multiple benchmark datasets, underscoring its effectiveness in unsupervised domain adaptation for time series data.

[338] STRATA-TS: Selective Knowledge Transfer for Urban Time Series Forecasting with Retrieval-Guided Reasoning

Yue Jiang, Chenxi Liu, Yile Chen, Qin Chao, Shuai Liu, Gao Cong

Main category: cs.LG

TL;DR: STRATA-TS is a selective transfer learning framework that uses target-aware retrieval and LLM reasoning to improve time series forecasting in data-scarce cities by identifying and transferring only relevant patterns from data-rich cities.

Details

Motivation: Urban forecasting models suffer from data imbalance where few cities have dense records while many have incomplete histories. Direct transfer learning risks negative transfer and noise introduction when indiscriminately transferring patterns from data-rich to data-scarce cities.

Method: STRATA-TS combines domain-adapted retrieval with large language models: 1) patch-based temporal encoder identifies semantically aligned source subsequences, 2) retrieval-guided reasoning stage where LLM performs structured inference, 3) distillation into compact open model via supervised fine-tuning for efficient deployment.

Result: Extensive experiments on three parking availability datasets across Singapore, Nottingham, and Glasgow show STRATA-TS consistently outperforms strong forecasting and transfer baselines while providing interpretable knowledge transfer pathways.

Conclusion: STRATA-TS successfully addresses data imbalance in urban forecasting through selective transfer via target-aware retrieval and LLM reasoning, demonstrating superior performance over existing methods while maintaining interpretability and efficiency.

Abstract: Urban forecasting models often face a severe data imbalance problem: only a few cities have dense, long-span records, while many others expose short or incomplete histories. Direct transfer from data-rich to data-scarce cities is unreliable because only a limited subset of source patterns truly benefits the target domain, whereas indiscriminate transfer risks introducing noise and negative transfer. We present STRATA-TS (Selective TRAnsfer via TArget-aware retrieval for Time Series), a framework that combines domain-adapted retrieval with reasoning-capable large models to improve forecasting in scarce data regimes. STRATA-TS employs a patch-based temporal encoder to identify source subsequences that are semantically and dynamically aligned with the target query. These retrieved exemplars are then injected into a retrieval-guided reasoning stage, where an LLM performs structured inference over target inputs and retrieved support. To enable efficient deployment, we distill the reasoning process into a compact open model via supervised fine-tuning. Extensive experiments on three parking availability datasets across Singapore, Nottingham, and Glasgow demonstrate that STRATA-TS consistently outperforms strong forecasting and transfer baselines, while providing interpretable knowledge transfer pathways.

[339] Biologically Disentangled Multi-Omic Modeling Reveals Mechanistic Insights into Pan-Cancer Immunotherapy Resistance

Ifrah Tariq, Ernest Fraenkel

Main category: cs.LG

TL;DR: BDVAE is a biologically structured deep learning model that integrates multi-omics data to predict immune checkpoint inhibitor response and uncover resistance mechanisms with high accuracy and interpretability.

Details

Motivation: Current machine learning models for predicting immune checkpoint inhibitor responses lack interpretability and fail to effectively utilize the biological structure of multi-omics data, limiting understanding of resistance mechanisms.

Method: Developed Biologically Disentangled Variational Autoencoder (BDVAE) with modality- and pathway-specific encoders using variational inference to learn biologically meaningful latent features from transcriptomic and genomic data.

Result: Achieved AUC-ROC of 0.94 on test data, identified key resistance mechanisms (immune suppression, metabolic shifts, neuronal signaling), and showed resistance exists on a continuous biological spectrum rather than binary states.

Conclusion: BDVAE demonstrates the value of biologically structured machine learning for generating interpretable, clinically relevant insights into complex resistance patterns and guiding precision immunotherapy strategies.

Abstract: Immune checkpoint inhibitors (ICIs) have transformed cancer treatment, yet patient responses remain highly variable, and the biological mechanisms underlying resistance are poorly understood. While machine learning models hold promise for predicting responses to ICIs, most existing methods lack interpretability and do not effectively leverage the biological structure inherent to multi-omics data. Here, we introduce the Biologically Disentangled Variational Autoencoder (BDVAE), a deep generative model that integrates transcriptomic and genomic data through modality- and pathway-specific encoders. Unlike existing rigid, pathway-informed models, BDVAE employs a modular encoder architecture combined with variational inference to learn biologically meaningful latent features associated with immune, genomic, and metabolic processes. Applied to a pan-cancer cohort of 366 patients across four cancer types treated with ICIs, BDVAE accurately predicts treatment response (AUC-ROC = 0.94 on unseen test data) and uncovers critical resistance mechanisms, including immune suppression, metabolic shifts, and neuronal signaling. Importantly, BDVAE reveals that resistance spans a continuous biological spectrum rather than strictly binary states, reflecting gradations of tumor dysfunction. Several latent features correlate with survival outcomes and known clinical subtypes, demonstrating BDVAE’s capability to generate interpretable, clinically relevant insights. These findings underscore the value of biologically structured machine learning in elucidating complex resistance patterns and guiding precision immunotherapy strategies.

[340] FFT-MoE: Efficient Federated Fine-Tuning for Foundation Models via Large-scale Sparse MoE under Heterogeneous Edge

Gang Hu, Yinglei Teng, Pengfei Wu, Nan Wang

Main category: cs.LG

TL;DR: FFT MoE replaces LoRA with sparse Mixture of Experts adapters for federated fine-tuning, addressing structural incompatibility and non-IID data challenges through personalized expert selection and heterogeneity-aware routing regularization.

Details

Motivation: Address limitations of LoRA-based Federated Fine-Tuning in heterogeneous FL environments, including structural incompatibility across clients with varying configurations and poor adaptability to non-IID data distributions.

Method: Proposes FFT MoE framework that uses sparse Mixture of Experts adapters instead of LoRA. Each client trains lightweight gating network to activate personalized expert subsets. Introduces heterogeneity-aware auxiliary loss to balance expert utilization and ensure diversity.

Result: Extensive experiments show FFT MoE consistently outperforms state-of-the-art FFT baselines in both IID and non-IID conditions for generalization performance and training efficiency.

Conclusion: FFT MoE effectively addresses structural and data heterogeneity challenges in federated fine-tuning, providing better convergence, generalization, and resource efficiency compared to existing approaches.

Abstract: As FMs drive progress toward Artificial General Intelligence (AGI), fine-tuning them under privacy and resource constraints has become increasingly critical particularly when highquality training data resides on distributed edge devices. Federated Learning (FL) offers a compelling solution through Federated Fine-Tuning (FFT), which enables collaborative model adaptation without sharing raw data. Recent approaches incorporate Parameter-Efficient Fine-Tuning (PEFT) techniques such as Low Rank Adaptation (LoRA) to reduce computational overhead. However, LoRA-based FFT faces two major limitations in heterogeneous FL environments: structural incompatibility across clients with varying LoRA configurations and limited adaptability to non-IID data distributions, which hinders convergence and generalization. To address these challenges, we propose FFT MoE, a novel FFT framework that replaces LoRA with sparse Mixture of Experts (MoE) adapters. Each client trains a lightweight gating network to selectively activate a personalized subset of experts, enabling fine-grained adaptation to local resource budgets while preserving aggregation compatibility. To further combat the expert load imbalance caused by device and data heterogeneity, we introduce a heterogeneity-aware auxiliary loss that dynamically regularizes the routing distribution to ensure expert diversity and balanced utilization. Extensive experiments spanning both IID and non-IID conditions demonstrate that FFT MoE consistently outperforms state of the art FFT baselines in generalization performance and training efficiency.

[341] Auditing Approximate Machine Unlearning for Differentially Private Models

Yuechun Gu, Jiajie He, Keke Chen

Main category: cs.LG

TL;DR: Existing approximate machine unlearning methods may compromise retained data privacy in differentially private models, requiring new differentially private unlearning algorithms.

Details

Motivation: Current machine unlearning methods assume retained data remains unaffected, but the privacy onion effect suggests this assumption may be incorrect, especially for differentially private models.

Method: Proposed holistic auditing approach for both unlearned and retained samples using differential privacy and membership inference attacks criteria. Developed efficient A-LiRA MIA with data augmentation to reduce shadow model training costs.

Result: Experimental findings show existing approximate unlearning algorithms can inadvertently compromise privacy of retained samples in differentially private models.

Conclusion: Differentially private unlearning algorithms are needed to properly protect retained data privacy when applying machine unlearning techniques.

Abstract: Approximate machine unlearning aims to remove the effect of specific data from trained models to ensure individuals’ privacy. Existing methods focus on the removed records and assume the retained ones are unaffected. However, recent studies on the \emph{privacy onion effect} indicate this assumption might be incorrect. Especially when the model is differentially private, no study has explored whether the retained ones still meet the differential privacy (DP) criterion under existing machine unlearning methods. This paper takes a holistic approach to auditing both unlearned and retained samples’ privacy risks after applying approximate unlearning algorithms. We propose the privacy criteria for unlearned and retained samples, respectively, based on the perspectives of DP and membership inference attacks (MIAs). To make the auditing process more practical, we also develop an efficient MIA, A-LiRA, utilizing data augmentation to reduce the cost of shadow model training. Our experimental findings indicate that existing approximate machine unlearning algorithms may inadvertently compromise the privacy of retained samples for differentially private models, and we need differentially private unlearning algorithms. For reproducibility, we have pubished our code: https://anonymous.4open.science/r/Auditing-machine-unlearning-CB10/README.md

[342] Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

Main category: cs.LG

TL;DR: MoE models show different scaling behaviors: memorization improves with total parameters while reasoning performance saturates or regresses despite increasing parameters and training loss improvements.

Details

Motivation: Current scaling laws don't account for MoE sparsity effects, and there's a need to understand how sparsity impacts different capability regimes (memorization vs reasoning) in large language models.

Method: Trained families of MoE Transformers varying total parameters, active parameters, and top-k routing while keeping compute fixed. Measured pre-training loss, downstream task loss, and accuracy to separate generalization gaps.

Result: Memorization benchmarks improve with total parameters, mirroring training loss. Reasoning performance saturates and can regress despite parameter increases. Hyperparameters affect generalization similarly to sparsity, and neither RLHF nor extra compute fixes reasoning deficits in overly sparse models.

Conclusion: MoE sparsity has distinct effects on memorization vs reasoning capabilities, with reasoning showing saturation effects that can’t be rescued by standard techniques, highlighting the need for specialized scaling approaches for different cognitive capabilities.

Abstract: Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

[343] Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding

Chufan Gao, Jintai Chen, Jimeng Sun

Main category: cs.LG

TL;DR: LRTab is a novel prompting-based approach that combines training data learning with chain-of-thought reasoning by retrieving relevant prompt conditions from training data to improve tabular reasoning performance.

Details

Motivation: Existing methods either fine-tune LLMs (losing generalizability) or use training-free prompting (not fully utilizing training data). LRTab aims to integrate the benefits of both approaches.

Method: First obtain CoT responses from training data, then prompt LLM to predict Prompt Conditions to avoid errors from incorrect CoTs, validate conditions, and retrieve relevant conditions at inference for additional context.

Result: Comprehensive experiments on WikiTQ and Tabfact show LRTab outperforms previous baselines in tabular reasoning while being interpretable and cost-efficient.

Conclusion: LRTab successfully integrates training data learning with prompting-based reasoning, demonstrating improved performance in tabular understanding tasks compared to existing methods.

Abstract: Automated tabular understanding and reasoning are essential tasks for data scientists. Recently, Large language models (LLMs) have become increasingly prevalent in tabular reasoning tasks. Previous work focuses on (1) finetuning LLMs using labeled data or (2) Training-free prompting LLM agents using chain-of-thought (CoT). Finetuning offers dataset-specific learning at the cost of generalizability. Training-free prompting is highly generalizable but does not take full advantage of training data. In this paper, we propose a novel prompting-based reasoning approach, Learn then Retrieve: LRTab, which integrates the benefits of both by retrieving relevant information learned from training data. We first use prompting to obtain CoT responses over the training data. For incorrect CoTs, we prompt the LLM to predict Prompt Conditions to avoid the error, learning insights from the data. We validate the effectiveness of Prompt Conditions using validation data. Finally, at inference time, we retrieve the most relevant Prompt Conditions for additional context for table understanding. We provide comprehensive experiments on WikiTQ and Tabfact, showing that LRTab is interpretable, cost-efficient, and can outperform previous baselines in tabular reasoning.

[344] End to End Autoencoder MLP Framework for Sepsis Prediction

Hejiang Cai, Di Wu, Ji Xu, Xiang Liu, Yiziting Zhu, Xin Shu, Yujie Li, Bin Yi

Main category: cs.LG

TL;DR: End-to-end deep learning framework for sepsis detection using autoencoder feature extraction and MLP classifier, outperforming traditional ML methods with 74.6-93.5% accuracy across ICU cohorts.

Details

Motivation: Traditional ML methods struggle with irregular, incomplete time-series EHR data and require manual feature engineering for sepsis detection in ICU settings.

Method: Unsupervised autoencoder for automatic feature extraction + multilayer perceptron classifier, with customized down sampling strategy and non-overlapping dynamic sliding window for real-time inference. Time series data represented as fixed dimension vectors with missingness indicators.

Result: Achieved accuracies of 74.6%, 80.6%, and 93.5% across three ICU cohorts, consistently outperforming traditional ML baselines (Naive Bayes, SVM, Random Forest, XGBoost).

Conclusion: The framework demonstrates superior robustness, generalizability, and clinical utility for early sepsis detection across heterogeneous ICU environments.

Abstract: Sepsis is a life threatening condition that requires timely detection in intensive care settings. Traditional machine learning approaches, including Naive Bayes, Support Vector Machine (SVM), Random Forest, and XGBoost, often rely on manual feature engineering and struggle with irregular, incomplete time-series data commonly present in electronic health records. We introduce an end-to-end deep learning framework integrating an unsupervised autoencoder for automatic feature extraction with a multilayer perceptron classifier for binary sepsis risk prediction. To enhance clinical applicability, we implement a customized down sampling strategy that extracts high information density segments during training and a non-overlapping dynamic sliding window mechanism for real-time inference. Preprocessed time series data are represented as fixed dimension vectors with explicit missingness indicators, mitigating bias and noise. We validate our approach on three ICU cohorts. Our end-to-end model achieves accuracies of 74.6 percent, 80.6 percent, and 93.5 percent, respectively, consistently outperforming traditional machine learning baselines. These results demonstrate the framework’s superior robustness, generalizability, and clinical utility for early sepsis detection across heterogeneous ICU environments.

[345] Natural Image Classification via Quasi-Cyclic Graph Ensembles and Random-Bond Ising Models at the Nishimori Temperature

V. S. Usatyuk, D. A. Sapoznikov, S. I. Egorov

Main category: cs.LG

TL;DR: A physics-inspired framework combining statistical physics, coding theory, and algebraic topology for efficient multi-class image classification using compressed feature embeddings.

Details

Motivation: To develop highly efficient image classification by leveraging insights from statistical physics and topology to compress high-dimensional features while maintaining performance.

Method: Interpret feature vectors as spins on a sparse Multi-Edge Type quasi-cyclic LDPC graph forming a Random-Bond Ising Model, operate at Nishimori temperature for maximum separability, and use topology-guided graph design to suppress harmful trapping sets.

Result: Achieved 98.7% accuracy on ImageNet-10 and 82.7% on ImageNet-100 with 40x parameter compression (1280D to 32/64D features), plus 6x speed-up in temperature estimation.

Conclusion: Topology-guided graph design enables highly efficient physics-inspired embeddings with state-of-the-art performance despite massive compression.

Abstract: We present a unified framework combining statistical physics, coding theory, and algebraic topology for efficient multi-class image classification. High-dimensional feature vectors from a frozen MobileNetV2 backbone are interpreted as spins on a sparse Multi-Edge Type quasi-cyclic LDPC (MET-QC-LDPC) graph, forming a Random-Bond Ising Model (RBIM). We operate this RBIM at its Nishimori temperature, $\beta_N$, where the smallest eigenvalue of the Bethe-Hessian matrix vanishes, maximizing class separability. Our theoretical contribution establishes a correspondence between local trapping sets in the code’s graph and topological invariants (Betti numbers, bordism classes) of the feature manifold. A practical algorithm estimates $\beta_N$ efficiently with a quadratic interpolant and Newton correction, achieving a six-fold speed-up over bisection. Guided by topology, we design spherical and toroidal MET-QC-LDPC graph ensembles, using permanent bounds to suppress harmful trapping sets. This compresses 1280-dimensional features to 32 or 64 dimensions for ImageNet-10 and -100 subsets. Despite massive compression (40x fewer parameters), we achieve 98.7% accuracy on ImageNet-10 and 82.7% on ImageNet-100, demonstrating that topology-guided graph design yields highly efficient, physics-inspired embeddings with state-of-the-art performance.

[346] Beyond Tokens: Enhancing RTL Quality Estimation via Structural Graph Learning

Yi Liu, Hongji Zhang, Yiwen Wang, Dimitris Tsaras, Lei Chen, Mingxuan Yuan, Qiang Xu

Main category: cs.LG

TL;DR: StructRTL is a novel structure-aware graph self-supervised learning framework that uses control data flow graphs (CDFGs) to improve RTL design quality estimation, outperforming previous methods by incorporating structural semantics and knowledge distillation from post-mapping netlists.

Details

Motivation: Existing LLM-based approaches for RTL quality estimation overlook structural semantics, while CDFGs provide richer structural characteristics that are essential for accurate quality metrics like area and delay estimation without time-consuming logic synthesis.

Method: Proposes a structure-aware graph self-supervised learning framework that learns structure-informed representations from CDFGs, combined with a knowledge distillation strategy that transfers insights from post-mapping netlists into the CDFG predictor.

Result: Significantly outperforms prior methods on various RTL design quality estimation tasks and establishes new state-of-the-art results.

Conclusion: The combination of structural learning through CDFGs with cross-stage supervision via knowledge distillation is highly effective for RTL design quality estimation, demonstrating the importance of incorporating structural semantics in electronic design automation workflows.

Abstract: Estimating the quality of register transfer level (RTL) designs is crucial in the electronic design automation (EDA) workflow, as it enables instant feedback on key metrics like area and delay without the need for time-consuming logic synthesis. While recent approaches have leveraged large language models (LLMs) to derive embeddings from RTL code and achieved promising results, they overlook the structural semantics essential for accurate quality estimation. In contrast, the control data flow graph (CDFG) view exposes the design’s structural characteristics more explicitly, offering richer cues for representation learning. In this work, we introduce a novel structure-aware graph self-supervised learning framework, StructRTL, for improved RTL design quality estimation. By learning structure-informed representations from CDFGs, our method significantly outperforms prior art on various quality estimation tasks. To further boost performance, we incorporate a knowledge distillation strategy that transfers low-level insights from post-mapping netlists into the CDFG predictor. Experiments show that our approach establishes new state-of-the-art results, demonstrating the effectiveness of combining structural learning with cross-stage supervision.

[347] FLAegis: A Two-Layer Defense Framework for Federated Learning Against Poisoning Attacks

Enrique Mármol Campos, Aurora González Vidal, José Luis Hernández Ramos, Antonio Skarmeta

Main category: cs.LG

TL;DR: FLAegis - a two-stage defensive framework for FL that detects Byzantine clients using symbolic time series transformation and spectral clustering, with FFT-based aggregation for robustness against poisoning attacks.

Details

Motivation: Federated Learning's decentralized nature makes it vulnerable to Byzantine clients that can poison training through false model updates, requiring robust defense mechanisms.

Method: Two-stage framework: 1) Uses SAX (symbolic time series transformation) to amplify differences between benign/malicious models, 2) Employs spectral clustering for adversarial detection, and 3) Incorporates FFT-based aggregation as final defense layer.

Result: Outperforms state-of-the-art defenses in detection precision and final model accuracy across five poisoning attacks (including label flipping and adaptive optimization-based strategies), maintaining high performance under strong adversarial conditions.

Conclusion: FLAegis provides an effective defense framework that enhances FL robustness against Byzantine attacks through innovative detection techniques and aggregation methods.

Abstract: Federated Learning (FL) has become a powerful technique for training Machine Learning (ML) models in a decentralized manner, preserving the privacy of the training datasets involved. However, the decentralized nature of FL limits the visibility of the training process, relying heavily on the honesty of participating clients. This assumption opens the door to malicious third parties, known as Byzantine clients, which can poison the training process by submitting false model updates. Such malicious clients may engage in poisoning attacks, manipulating either the dataset or the model parameters to induce misclassification. In response, this study introduces FLAegis, a two-stage defensive framework designed to identify Byzantine clients and improve the robustness of FL systems. Our approach leverages symbolic time series transformation (SAX) to amplify the differences between benign and malicious models, and spectral clustering, which enables accurate detection of adversarial behavior. Furthermore, we incorporate a robust FFT-based aggregation function as a final layer to mitigate the impact of those Byzantine clients that manage to evade prior defenses. We rigorously evaluate our method against five poisoning attacks, ranging from simple label flipping to adaptive optimization-based strategies. Notably, our approach outperforms state-of-the-art defenses in both detection precision and final model accuracy, maintaining consistently high performance even under strong adversarial conditions.

[348] Stability and Generalization for Bellman Residuals

Enoch H. Kang, Kyoungseok Jang

Main category: cs.LG

TL;DR: This paper provides statistical analysis of Bellman residual minimization (BRM) in offline reinforcement learning, achieving O(1/n) excess risk bound without additional regularization or restrictive assumptions.

Details

Motivation: Current offline RL and inverse RL methods struggle with Bellman consistency, and while BRM with stochastic gradient descent-ascent shows promise, its statistical behavior in offline settings remains poorly understood.

Method: The analysis introduces a single Lyapunov potential that couples SGDA runs on neighboring datasets, yielding O(1/n) on-average argument-stability bound for convex-concave saddle problems.

Result: Achieves O(1/n) excess risk bound for BRM without variance reduction, extra regularization, or restrictive independence assumptions on minibatch sampling, doubling the best known sample-complexity exponent.

Conclusion: The paper closes the statistical gap in BRM analysis, providing strong theoretical guarantees for standard neural-network parameterizations and minibatch SGD in offline reinforcement learning settings.

Abstract: Offline reinforcement learning and offline inverse reinforcement learning aim to recover near-optimal value functions or reward models from a fixed batch of logged trajectories, yet current practice still struggles to enforce Bellman consistency. Bellman residual minimization (BRM) has emerged as an attractive remedy, as a globally convergent stochastic gradient descent-ascent based method for BRM has been recently discovered. However, its statistical behavior in the offline setting remains largely unexplored. In this paper, we close this statistical gap. Our analysis introduces a single Lyapunov potential that couples SGDA runs on neighbouring datasets and yields an O(1/n) on-average argument-stability bound-doubling the best known sample-complexity exponent for convex-concave saddle problems. The same stability constant translates into the O(1/n) excess risk bound for BRM, without variance reduction, extra regularization, or restrictive independence assumptions on minibatch sampling. The results hold for standard neural-network parameterizations and minibatch SGD.

Jiajun Li, Ran Hou, Yu Ding, Yixuan Li, Shisi Guan, Jiahui Duan, Xiongwei Han, Tao Zhong, Vincent Chau, Weiwei Wu, Wanyuan Wang

Main category: cs.LG

TL;DR: Novel constraint-based model reduction approach for MILP problems that transforms inequality constraints to equalities, achieving 50% better solution quality and 17.47% faster computation.

Details

Motivation: Existing model reduction methods focus on variable reduction, while constraint reduction has been largely ignored despite its potential to reduce MILP complexity from a dual perspective.

Method: Proposes multi-modal representation technique using instance-level and abstract-level MILP formulations to identify critical tight-constraints, with heuristic selection of critical constraints labeled from optimal solutions.

Result: Improves solution quality by over 50% and reduces computation time by 17.47% compared to state-of-the-art methods.

Conclusion: Constraint-based model reduction is effective for accelerating MILP solving, demonstrating significant improvements in both solution quality and computational efficiency.

Abstract: Model reduction, which aims to learn a simpler model of the original mixed integer linear programming (MILP), can solve large-scale MILP problems much faster. Most existing model reduction methods are based on variable reduction, which predicts a solution value for a subset of variables. From a dual perspective, constraint reduction that transforms a subset of inequality constraints into equalities can also reduce the complexity of MILP, but has been largely ignored. Therefore, this paper proposes a novel constraint-based model reduction approach for the MILP. Constraint-based MILP reduction has two challenges: 1) which inequality constraints are critical such that reducing them can accelerate MILP solving while preserving feasibility, and 2) how to predict these critical constraints efficiently. To identify critical constraints, we first label these tight-constraints at the optimal solution as potential critical constraints and design a heuristic rule to select a subset of critical tight-constraints. To learn the critical tight-constraints, we propose a multi-modal representation technique that leverages information from both instance-level and abstract-level MILP formulations. The experimental results show that, compared to the state-of-the-art methods, our method improves the quality of the solution by over 50% and reduces the computation time by 17.47%.

[350] UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao

Main category: cs.LG

TL;DR: UltraMemV2 achieves performance parity with 8-expert MoE models while significantly reducing memory access costs, with particular improvements on memory-intensive tasks.

Details

Motivation: Mixture of Experts (MoE) models suffer from high memory access costs during inference, and previous memory-layer architectures like UltraMem only matched 2-expert MoE performance, falling short of state-of-the-art 8-expert configurations.

Method: Five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios.

Result: UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but with significantly lower memory access. Shows superior performance on memory-intensive tasks: +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. Validated at scale with models up to 2.5B activated parameters from 120B total parameters.

Conclusion: UltraMemV2 brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation, with activation density having greater impact on performance than total sparse parameter count.

Abstract: While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

[351] Governance-as-a-Service: A Multi-Agent Framework for AI System Compliance and Policy Enforcement

Helen Pervez, Suyash Gaurav, Jukka Heikkonen, Jatin Chaudhary

Main category: cs.LG

TL;DR: GaaS is a modular governance system that enforces policies on AI agents at runtime without requiring agent cooperation, using declarative rules and trust scoring to block high-risk behaviors while maintaining system throughput.

Details

Motivation: As AI systems evolve into distributed ecosystems with autonomous execution and multi-agent coordination, existing oversight mechanisms are reactive, brittle, and embedded within agent architectures, making them non-auditable and hard to generalize.

Method: Governance-as-a-Service (GaaS) employs declarative rules and a Trust Factor mechanism that scores agents based on compliance and severity-weighted violations. It enables coercive, normative, and adaptive interventions with graduated enforcement and dynamic trust modulation.

Result: Evaluation across three simulation regimes with open-source models (LLaMA3, Qwen3, DeepSeek-R1) shows GaaS reliably blocks or redirects high-risk behaviors while preserving throughput. Trust scores effectively track rule adherence and isolate untrustworthy components.

Conclusion: GaaS establishes infrastructure-level alignment for interoperable agent ecosystems by positioning governance as a runtime service, enforcing ethics rather than teaching them to agents.

Abstract: As AI systems evolve into distributed ecosystems with autonomous execution, asynchronous reasoning, and multi-agent coordination, the absence of scalable, decoupled governance poses a structural risk. Existing oversight mechanisms are reactive, brittle, and embedded within agent architectures, making them non-auditable and hard to generalize across heterogeneous deployments. We introduce Governance-as-a-Service (GaaS): a modular, policy-driven enforcement layer that regulates agent outputs at runtime without altering model internals or requiring agent cooperation. GaaS employs declarative rules and a Trust Factor mechanism that scores agents based on compliance and severity-weighted violations. It enables coercive, normative, and adaptive interventions, supporting graduated enforcement and dynamic trust modulation. To evaluate GaaS, we conduct three simulation regimes with open-source models (LLaMA3, Qwen3, DeepSeek-R1) across content generation and financial decision-making. In the baseline, agents act without governance; in the second, GaaS enforces policies; in the third, adversarial agents probe robustness. All actions are intercepted, evaluated, and logged for analysis. Results show that GaaS reliably blocks or redirects high-risk behaviors while preserving throughput. Trust scores track rule adherence, isolating and penalizing untrustworthy components in multi-agent systems. By positioning governance as a runtime service akin to compute or storage, GaaS establishes infrastructure-level alignment for interoperable agent ecosystems. It does not teach agents ethics; it enforces them.

[352] Predicting Drug-Drug Interactions Using Heterogeneous Graph Neural Networks: HGNN-DDI

Hongbo Liu, Siyi Li, Zheng Yu

Main category: cs.LG

TL;DR: HGNN-DDI is a heterogeneous graph neural network model that predicts drug-drug interactions by integrating multiple drug-related data sources, outperforming existing methods in accuracy and robustness.

Details

Motivation: Drug-drug interactions are a major clinical concern causing reduced efficacy or adverse effects, and traditional computational approaches struggle to capture complex relationships among drugs, targets, and biological entities.

Method: HGNN-DDI uses heterogeneous graph neural networks with graph representation learning to model biomedical networks, enabling effective information propagation across diverse node and edge types.

Result: Experimental results on benchmark DDI datasets show HGNN-DDI outperforms state-of-the-art baselines in prediction accuracy and robustness.

Conclusion: HGNN-DDI demonstrates potential to support safer drug development and precision medicine through improved DDI prediction capabilities.

Abstract: Drug-drug interactions (DDIs) are a major concern in clinical practice, as they can lead to reduced therapeutic efficacy or severe adverse effects. Traditional computational approaches often struggle to capture the complex relationships among drugs, targets, and biological entities. In this work, we propose HGNN-DDI, a heterogeneous graph neural network model designed to predict potential DDIs by integrating multiple drug-related data sources. HGNN-DDI leverages graph representation learning to model heterogeneous biomedical networks, enabling effective information propagation across diverse node and edge types. Experimental results on benchmark DDI datasets demonstrate that HGNN-DDI outperforms state-of-the-art baselines in prediction accuracy and robustness, highlighting its potential to support safer drug development and precision medicine.

[353] Federated Learning with Heterogeneous and Private Label Sets

Adam Breitholtz, Edvin Listo Zec, Fredrik D. Johansson

Main category: cs.LG

TL;DR: This paper investigates federated learning with heterogeneous client label sets, comparing public vs private label settings and proposing adaptations of standard FL methods that maintain performance while increasing privacy.

Details

Motivation: Heterogeneous client label sets are common in real-world FL applications but rarely studied, especially in private label settings where clients don't share label information with each other, only with the central server.

Method: Applied classical classifier combination methods to FL with centralized tuning, adapted common FL methods for private label set setting, and conducted experiments comparing public vs private label scenarios.

Result: Reducing available labels per client substantially harms performance. Centralized tuning helps but increases variance. Proposed FL adaptations perform similarly in private setting as standard methods in public setting.

Conclusion: Clients can achieve increased privacy through private label sharing with little cost to model accuracy, as adapted FL methods maintain performance comparable to public label settings.

Abstract: Although common in real-world applications, heterogeneous client label sets are rarely investigated in federated learning (FL). Furthermore, in the cases they are, clients are assumed to be willing to share their entire label sets with other clients. Federated learning with private label sets, shared only with the central server, adds further constraints on learning algorithms and is, in general, a more difficult problem to solve. In this work, we study the effects of label set heterogeneity on model performance, comparing the public and private label settings – when the union of label sets in the federation is known to clients and when it is not. We apply classical methods for the classifier combination problem to FL using centralized tuning, adapt common FL methods to the private label set setting, and discuss the justification of both approaches under practical assumptions. Our experiments show that reducing the number of labels available to each client harms the performance of all methods substantially. Centralized tuning of client models for representational alignment can help remedy this, but often at the cost of higher variance. Throughout, our proposed adaptations of standard FL methods perform well, showing similar performance in the private label setting as the standard methods achieve in the public setting. This shows that clients can enjoy increased privacy at little cost to model accuracy.

[354] SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation

Junyu Yan, Feng Chen, Yuyang Xue, Yuning Du, Konstantinos Vilouras, Sotirios A. Tsaftaris, Steven McDonagh

Main category: cs.LG

TL;DR: SWiFT is a debiasing framework that improves fairness while preserving model performance with minimal data and training requirements, outperforming state-of-the-art methods on medical imaging tasks.

Details

Motivation: Machine learning models exhibit bias in healthcare applications, risking unfairness and social discrimination. Existing debiasing methods require extensive retraining and show trade-offs between fairness and performance.

Method: Soft-Mask Weight Fine-Tuning (SWiFT) identifies parameter contributions to bias vs performance, then uses two-step fine-tuning with different gradient flows based on parameter importance.

Result: SWiFT consistently reduces bias across gender, skin tone, and age attributes in dermatological and chest X-ray datasets, achieving competitive or superior accuracy while improving generalization on OOD datasets.

Conclusion: SWiFT provides an efficient debiasing solution that requires minimal external data and few training epochs, effectively addressing bias without sacrificing model performance in medical applications.

Abstract: Recent studies have shown that Machine Learning (ML) models can exhibit bias in real-world scenarios, posing significant challenges in ethically sensitive domains such as healthcare. Such bias can negatively affect model fairness, model generalization abilities and further risks amplifying social discrimination. There is a need to remove biases from trained models. Existing debiasing approaches often necessitate access to original training data and need extensive model retraining; they also typically exhibit trade-offs between model fairness and discriminative performance. To address these challenges, we propose Soft-Mask Weight Fine-Tuning (SWiFT), a debiasing framework that efficiently improves fairness while preserving discriminative performance with much less debiasing costs. Notably, SWiFT requires only a small external dataset and only a few epochs of model fine-tuning. The idea behind SWiFT is to first find the relative, and yet distinct, contributions of model parameters to both bias and predictive performance. Then, a two-step fine-tuning process updates each parameter with different gradient flows defined by its contribution. Extensive experiments with three bias sensitive attributes (gender, skin tone, and age) across four dermatological and two chest X-ray datasets demonstrate that SWiFT can consistently reduce model bias while achieving competitive or even superior diagnostic accuracy under common fairness and accuracy metrics, compared to the state-of-the-art. Specifically, we demonstrate improved model generalization ability as evidenced by superior performance on several out-of-distribution (OOD) datasets.

[355] DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift

Shae McFadden, Myles Foley, Mario D’Onghia, Chris Hicks, Vasilios Mavroudis, Nicola Paoletti, Fabio Pierazzi

Main category: cs.LG

TL;DR: DRL-based malware detection agent (DRMD) that jointly optimizes classification and rejection for manual labeling, achieving better concept drift resilience than traditional methods in Android malware detection.

Details

Motivation: Traditional malware classifiers struggle with concept drift and lack mechanisms to optimize when to defer decisions to manual labeling. Real-world malware detection needs to handle evolving threats with limited labeling budgets and uncertain predictions.

Method: Formulated malware detection as a one-step Markov Decision Process and trained a deep reinforcement learning agent to simultaneously optimize classification performance and reject high-risk samples for manual labeling.

Result: DRMD agent achieved significant AUT performance improvements: 5.18±5.44% (classification only), 14.49±12.86% (classification with rejection), and 10.06±10.81% (classification with rejection and active learning) compared to standard approaches.

Conclusion: DRL can effectively facilitate malware detection and improve resilience to concept drift in dynamic Android malware environments, demonstrating superior performance over traditional classification methods.

Abstract: Malware detection in real-world settings must deal with evolving threats, limited labeling budgets, and uncertain predictions. Traditional classifiers, without additional mechanisms, struggle to maintain performance under concept drift in malware domains, as their supervised learning formulation cannot optimize when to defer decisions to manual labeling and adaptation. Modern malware detection pipelines combine classifiers with monthly active learning (AL) and rejection mechanisms to mitigate the impact of concept drift. In this work, we develop a novel formulation of malware detection as a one-step Markov Decision Process and train a deep reinforcement learning (DRL) agent, simultaneously optimizing sample classification performance and rejecting high-risk samples for manual labeling. We evaluated the joint detection and drift mitigation policy learned by the DRL-based Malware Detection (DRMD) agent through time-aware evaluations on Android malware datasets subject to realistic drift requiring multi-year performance stability. The policies learned under these conditions achieve a higher Area Under Time (AUT) performance compared to standard classification approaches used in the domain, showing improved resilience to concept drift. Specifically, the DRMD agent achieved a $5.18\pm5.44$, $14.49\pm12.86$, and $10.06\pm10.81$ average AUT performance improvement for the classification only, classification with rejection, and classification with rejection and AL settings, respectively. Our results demonstrate for the first time that DRL can facilitate effective malware detection and improved resiliency to concept drift in the dynamic environment of the Android malware domain.

[356] Recycling History: Efficient Recommendations from Contextual Dueling Bandits

Suryanarayana Sankagiri, Jalal Etesami, Pouria Fatemi, Matthias Grossglauser

Main category: cs.LG

TL;DR: A new bandit model where users compare recommended items with past consumed items, enabling better feedback without additional regret cost, achieving O(√T) regret.

Details

Motivation: Traditional contextual duelling bandits only capture implicit choices but don't utilize post-consumption comparisons, which provide more reliable feedback from users after they've actually experienced items.

Method: Proposes an algorithm that recommends one item per time step, then asks users to compare it with an item from their consumption history. Uses initial random exploration to build diverse history, then leverages matrix concentration bounds for analysis.

Result: Achieves O(√T) regret guarantee. Simulations show significantly lower regret compared to methods that only compare simultaneously recommended items, demonstrating the benefit of reusing past items for comparisons.

Conclusion: Leveraging user consumption history for post-consumption comparisons provides more reliable feedback and leads to better performance than traditional approaches, with strong theoretical guarantees.

Abstract: The contextual duelling bandit problem models adaptive recommender systems, where the algorithm presents a set of items to the user, and the user’s choice reveals their preference. This setup is well suited for implicit choices users make when navigating a content platform, but does not capture other possible comparison queries. Motivated by the fact that users provide more reliable feedback after consuming items, we propose a new bandit model that can be described as follows. The algorithm recommends one item per time step; after consuming that item, the user is asked to compare it with another item chosen from the user’s consumption history. Importantly, in our model, this comparison item can be chosen without incurring any additional regret, potentially leading to better performance. However, the regret analysis is challenging because of the temporal dependency in the user’s history. To overcome this challenge, we first show that the algorithm can construct informative queries provided the history is rich, i.e., satisfies a certain diversity condition. We then show that a short initial random exploration phase is sufficient for the algorithm to accumulate a rich history with high probability. This result, proven via matrix concentration bounds, yields $O(\sqrt{T})$ regret guarantees. Additionally, our simulations show that reusing past items for comparisons can lead to significantly lower regret than only comparing between simultaneously recommended items.

[357] C-Flat++: Towards a More Efficient and Powerful Framework for Continual Learning

Wei Li, Hangjie Yuan, Zixiang Zhao, Yifan Zhu, Aojun Lu, Tao Feng, Yanan Sun

Main category: cs.LG

TL;DR: C-Flat is a plug-and-play continual learning method that promotes flatter loss landscapes to improve stability and performance across various CL settings, with an efficient variant C-Flat++ that reduces computational costs.

Details

Motivation: Current sharpness-aware minimization methods in continual learning may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions.

Method: Proposes C-Flat, a method that promotes flatter loss landscapes tailored for continual learning, with plug-and-play compatibility. Also introduces C-Flat++ framework with selective flatness-driven promotion to reduce update costs.

Result: C-Flat consistently improves performance across a wide range of continual learning settings, methods, datasets, and scenarios. C-Flat++ significantly reduces the update cost while maintaining effectiveness.

Conclusion: The proposed C-Flat and C-Flat++ approaches demonstrate effectiveness and efficiency in continual learning by promoting flatter loss landscapes, offering easy integration and improved performance across diverse CL paradigms.

Abstract: Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbf{C}ontinual \textbf{Flat}ness (\textbf{C-Flat}), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at https://github.com/WanNaa/C-Flat.

[358] MOCHA: Discovering Multi-Order Dynamic Causality in Temporal Point Processes

Yunyang Cao, Juekai Lin, Wenhao Li, Bo Jin

Main category: cs.LG

TL;DR: MOCHA is a novel framework for discovering multi-order dynamic causality in temporal point processes that models time-varying causal structures as multi-hop paths on a latent evolving graph, achieving state-of-the-art event prediction while providing interpretable causal insights.

Details

Motivation: Existing methods for causal discovery in temporal point processes rely on static or first-order causal structures, which overlook the multi-order and time-varying nature of real-world causal relationships in event sequences.

Method: MOCHA introduces a time-varying directed acyclic graph (DAG) with learnable structural weights, enforcing acyclicity and sparsity constraints. It uses an end-to-end differentiable framework that jointly models causal discovery and TPP dynamics through multi-hop causal paths on a latent time-evolving graph.

Result: Extensive experiments on real-world datasets show that MOCHA achieves state-of-the-art performance in event prediction while also revealing meaningful and interpretable causal structures that capture dynamic multi-order dependencies.

Conclusion: MOCHA successfully addresses the limitations of existing methods by modeling multi-order dynamic causality in temporal point processes, providing both accurate event prediction and interpretable causal insights through its novel time-varying DAG framework.

Abstract: Discovering complex causal dependencies in temporal point processes (TPPs) is critical for modeling real-world event sequences. Existing methods typically rely on static or first-order causal structures, overlooking the multi-order and time-varying nature of causal relationships. In this paper, we propose MOCHA, a novel framework for discovering multi-order dynamic causality in TPPs. MOCHA characterizes multi-order influences as multi-hop causal paths over a latent time-evolving graph. To model such dynamics, we introduce a time-varying directed acyclic graph (DAG) with learnable structural weights, where acyclicity and sparsity constraints are enforced to ensure structural validity. We design an end-to-end differentiable framework that jointly models causal discovery and TPP dynamics, enabling accurate event prediction and revealing interpretable structures. Extensive experiments on real-world datasets demonstrate that MOCHA not only achieves state-of-the-art performance in event prediction, but also reveals meaningful and interpretable causal structures.

[359] HAEPO: History-Aggregated Exploratory Policy Optimization

Gaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa

Main category: cs.LG

TL;DR: HAEPO is a new policy optimization method that uses cumulative logarithmic likelihoods and Plackett-Luce softmax to enable better exploration in long-horizon tasks compared to DPO and GRPO.

Details

Motivation: Existing methods like DPO and GRPO often limit exploration on long-horizon tasks by either using full sequence log-likelihoods or aggregating per-token ratios, which restricts thorough exploration.

Method: HAEPO compresses trajectories into cumulative logarithmic likelihoods, applies Plackett-Luce softmax across trajectories for normalized weights proportional to returns, and adds entropy regularization and soft KL penalty for stability.

Result: HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and performs better or on par with PPO, GRPO, and DPO across diverse tasks.

Conclusion: HAEPO provides a stable and interpretable framework that explicitly leverages full-trajectory history while effectively balancing exploration and stability.

Abstract: Exploration is essential in modern learning, from reinforcement learning environments with small neural policies to large language models (LLMs). Existing work, such as DPO, leverages full sequence log-likelihoods to capture an entire trajectory of the model’s decisions, while methods like GRPO aggregate per-token ratios into a trajectory-level update. However, both often limit exploration on long-horizon tasks. We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat these shortcomings. HAEPO compresses each trajectory into the sum of its logarithmic probabilities (a cumulative logarithmic likelihood), and applies a Plackett-Luce softmax across trajectories to obtain normalized weights proportional to their returns, thus encouraging broader exploration. We add entropy regularization to stabilize the aggressive updates to prevent premature collapse and a soft KL penalty relative to a frozen copy of the previous (reference) policy. Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO across diverse tasks. Thus, HAEPO provides a stable and interpretable framework by explicitly leveraging full-trajectory history while balancing exploration and stability.

[360] pyFAST: A Modular PyTorch Framework for Time Series Modeling with Multi-source and Sparse Data

Zhijin Wang, Senzhen Wu, Yue Hu, Xiufeng Liu

Main category: cs.LG

TL;DR: pyFAST is a PyTorch-based time series framework that decouples data processing from model computation, supports complex data scenarios including irregular/sparse data, and provides modular deep learning architectures for flexible research experimentation.

Details

Motivation: Existing Python time series libraries lack modularity and native support for irregular, multi-source, or sparse data, creating limitations for modern time series research needs.

Method: Develops a framework with decoupled data processing engine supporting multi-source loading, protein sequences, padding, dynamic normalization, and mask-based modeling. Integrates LLM-inspired architectures for sparse data fusion and provides specialized metrics, losses, and training utilities.

Result: Created pyFAST - a comprehensive PyTorch framework with classical and deep learning models (Linears, CNNs, RNNs, Transformers, GNNs) in a modular architecture that supports complex time series scenarios.

Conclusion: pyFAST provides a compact yet powerful MIT-licensed platform that facilitates rapid experimentation and advances time series research through its flexible, efficient, and extensible design.

Abstract: Modern time series analysis demands frameworks that are flexible, efficient, and extensible. However, many existing Python libraries exhibit limitations in modularity and in their native support for irregular, multi-source, or sparse data. We introduce pyFAST, a research-oriented PyTorch framework that explicitly decouples data processing from model computation, fostering a cleaner separation of concerns and facilitating rapid experimentation. Its data engine is engineered for complex scenarios, supporting multi-source loading, protein sequence handling, efficient sequence- and patch-level padding, dynamic normalization, and mask-based modeling for both imputation and forecasting. pyFAST integrates LLM-inspired architectures for the alignment-free fusion of sparse data sources and offers native sparse metrics, specialized loss functions, and flexible exogenous data fusion. Training utilities include batch-based streaming aggregation for evaluation and device synergy to maximize computational efficiency. A comprehensive suite of classical and deep learning models (Linears, CNNs, RNNs, Transformers, and GNNs) is provided within a modular architecture that encourages extension. Released under the MIT license at GitHub, pyFAST provides a compact yet powerful platform for advancing time series research and applications.

[361] Distance-informed Neural Processes

Aishwarya Venkataramanan, Joachim Denzler

Main category: cs.LG

TL;DR: DNP improves Neural Processes by combining global and distance-aware local latent structures with bi-Lipschitz regularization for better uncertainty estimation and calibration.

Details

Motivation: Standard Neural Processes struggle with uncertainty calibration and capturing local data dependencies, needing improved modeling of both global task variations and local input relationships.

Method: Introduces global latent variable for task-level variations and local latent variable with distance-preserving latent space using bi-Lipschitz regularization to bound input relationship distortions.

Result: DNP achieves strong predictive performance and improved uncertainty calibration across regression and classification tasks, better distinguishing in-distribution from out-of-distribution data.

Conclusion: The proposed Distance-informed Neural Process successfully addresses limitations of standard NPs by combining global and local latent structures with distance preservation, leading to superior uncertainty estimation.

Abstract: We propose the Distance-informed Neural Process (DNP), a novel variant of Neural Processes that improves uncertainty estimation by combining global and distance-aware local latent structures. Standard Neural Processes (NPs) often rely on a global latent variable and struggle with uncertainty calibration and capturing local data dependencies. DNP addresses these limitations by introducing a global latent variable to model task-level variations and a local latent variable to capture input similarity within a distance-preserving latent space. This is achieved through bi-Lipschitz regularization, which bounds distortions in input relationships and encourages the preservation of relative distances in the latent space. This modeling approach allows DNP to produce better-calibrated uncertainty estimates and more effectively distinguish in- from out-of-distribution data. Empirical results demonstrate that DNP achieves strong predictive performance and improved uncertainty calibration across regression and classification tasks.

[362] Enhancing Model Privacy in Federated Learning with Random Masking and Quantization

Zhibo Xu, Jianhao Zhu, Jingwen Xu, Changze Lv, Zisu Huang, Xiaohua Wang, Muling Wu, Qi Qian, Xiaoqing Zheng, Xuanjing Huang

Main category: cs.LG

TL;DR: Our approach maintains strong performance in federated learning while providing enhanced model parameter protection compared to baseline methods.

Details

Motivation: To address the need for both performance maintenance and improved security in federated learning environments, where model parameters need protection against potential attacks.

Method: The paper proposes a novel approach for federated learning that focuses on protecting model parameters while maintaining performance, though specific techniques are not detailed in the abstract.

Result: Experimental results across various models and tasks show the approach successfully maintains strong model performance while achieving enhanced protection of model parameters compared to baseline methods.

Conclusion: The proposed method effectively balances performance and security in federated learning, offering superior parameter protection without compromising model effectiveness.

Abstract: Experimental results across various models and tasks demonstrate that our approach not only maintains strong model performance in federated learning settings but also achieves enhanced protection of model parameters compared to baseline methods.

[363] Generalization Bound for a General Class of Neural Ordinary Differential Equations

Madhusudan Verma, Manoj Kumar

Main category: cs.LG

TL;DR: First generalization bounds for neural ODEs with general nonlinear dynamics, covering both time-dependent and time-independent cases under Lipschitz continuity.

Details

Motivation: Previous work only analyzed linear dynamics or bounds dependent on sampling intervals, leaving a gap for understanding generalization in neural ODEs with general nonlinear dynamics.

Method: Analyzed neural ODEs with Lipschitz continuous nonlinear dynamics functions, showed solutions have bounded variations, and established generalization bounds considering overparameterization and domain constraints.

Result: Derived generalization bounds for neural ODEs with general nonlinear dynamics, demonstrating how overparameterization and domain constraints affect these bounds.

Conclusion: This work provides the first generalization error bounds for neural ODEs with general nonlinear dynamics, advancing theoretical understanding of continuous-depth models’ generalization capabilities.

Abstract: Neural ordinary differential equations (neural ODEs) are a popular type of deep learning model that operate with continuous-depth architectures. To assess how well such models perform on unseen data, it is crucial to understand their generalization error bounds. Previous research primarily focused on the linear case for the dynamics function in neural ODEs - Marion, P. (2023), or provided bounds for Neural Controlled ODEs that depend on the sampling interval Bleistein et al. (2023). In this work, we analyze a broader class of neural ODEs where the dynamics function is a general nonlinear function, either time dependent or time independent, and is Lipschitz continuous with respect to the state variables. We showed that under this Lipschitz condition, the solutions to neural ODEs have solutions with bounded variations. Based on this observation, we establish generalization bounds for both time-dependent and time-independent cases and investigate how overparameterization and domain constraints influence these bounds. To our knowledge, this is the first derivation of generalization bounds for neural ODEs with general nonlinear dynamics.

[364] HierCVAE: Hierarchical Attention-Driven Conditional Variational Autoencoders for Multi-Scale Temporal Modeling

Yao Wu

Main category: cs.LG

TL;DR: HierCVAE integrates hierarchical attention with conditional variational autoencoders for temporal modeling, achieving 15-40% accuracy improvement with better uncertainty calibration.

Details

Motivation: Temporal modeling in complex systems requires capturing multi-scale dependencies while managing uncertainties, which existing methods struggle with.

Method: Three-tier attention structure (local, global, cross-temporal) with multi-modal condition encoding, ResFormer blocks in latent space, and explicit uncertainty quantification via prediction heads.

Result: 15-40% improvement in prediction accuracy and superior uncertainty calibration on energy consumption datasets, excelling in long-term forecasting and complex multi-variate dependencies.

Conclusion: HierCVAE effectively addresses temporal modeling challenges by combining hierarchical attention with variational autoencoders, demonstrating significant performance gains and robust uncertainty handling.

Abstract: Temporal modeling in complex systems requires capturing dependencies across multiple time scales while managing inherent uncertainties. We propose HierCVAE, a novel architecture that integrates hierarchical attention mechanisms with conditional variational autoencoders to address these challenges. HierCVAE employs a three-tier attention structure (local, global, cross-temporal) combined with multi-modal condition encoding to capture temporal, statistical, and trend information. The approach incorporates ResFormer blocks in the latent space and provides explicit uncertainty quantification via prediction heads. Through evaluations on energy consumption datasets, HierCVAE demonstrates a 15-40% improvement in prediction accuracy and superior uncertainty calibration compared to state-of-the-art methods, excelling in long-term forecasting and complex multi-variate dependencies.

[365] Energy-Based Flow Matching for Generating 3D Molecular Structure

Wenyin Zhou, Christopher Iliffe Sprague, Vsevolod Viliuga, Matteo Tadiello, Arne Elofsson, Hossein Azizpour

Main category: cs.LG

TL;DR: Energy-based flow matching for molecular structure generation that iteratively maps random configurations to target structures, outperforming diffusion models and other flow matching baselines on protein docking and backbone generation tasks.

Details

Motivation: Molecular structure generation is crucial for biological applications like molecular docking and protein folding. Recent generative models (diffusion/flow matching) treat molecular conformations as distributions, but there's room for improvement in training and inference efficiency.

Method: Adopts energy-based perspective for flow matching, learning a deep network that iteratively maps random source configurations to target structures. The approach is theoretically justified with connections to idempotency, stability, and AlphaFold’s structure refinement techniques.

Result: Outperforms recent baselines of task-associated flow matching and diffusion models on protein docking and protein backbone generation tasks using similar computational budgets.

Conclusion: The energy-based flow matching approach provides a conceptually simple yet empirically effective framework for molecular structure generation, demonstrating superior performance over existing methods while maintaining computational efficiency.

Abstract: Molecular structure generation is a fundamental problem that involves determining the 3D positions of molecules’ constituents. It has crucial biological applications, such as molecular docking, protein folding, and molecular design. Recent advances in generative modeling, such as diffusion models and flow matching, have made great progress on these tasks by modeling molecular conformations as a distribution. In this work, we focus on flow matching and adopt an energy-based perspective to improve training and inference of structure generation models. Our view results in a mapping function, represented by a deep network, that is directly learned to \textit{iteratively} map random configurations, i.e. samples from the source distribution, to target structures, i.e. points in the data manifold. This yields a conceptually simple and empirically effective flow matching setup that is theoretically justified and has interesting connections to fundamental properties such as idempotency and stability, as well as the empirically useful techniques such as structure refinement in AlphaFold. Experiments on protein docking as well as protein backbone generation consistently demonstrate the method’s effectiveness, where it outperforms recent baselines of task-associated flow matching and diffusion models, using a similar computational budget.

[366] Estimating Conditional Covariance between labels for Multilabel Data

Laurence A. F. Park, Jesse Read

Main category: cs.LG

TL;DR: Comparison of three models (Multivariate Probit, Multivariate Bernoulli, Staged Logit) for measuring conditional label covariance in multilabel data, showing all perform similarly but falsely detect dependent covariance when constant covariance exists, with Multivariate Probit having lowest error.

Details

Motivation: Multilabel data analysis requires understanding label dependence, but current methods like multivariate Probit may not reliably estimate constant vs dependent covariance due to their copula covariance estimation approach.

Method: Compared three statistical models (Multivariate Probit, Multivariate Bernoulli, and Staged Logit) through experiments to observe their measurement of conditional label covariance in multilabel data.

Result: All three models measured constant and dependent covariance equally well depending on covariance strength, but all falsely detected dependent covariance when constant covariance was actually present. Multivariate Probit had the lowest error rate among the three.

Conclusion: While all models can measure conditional covariance, they tend to misidentify constant covariance as dependent covariance. Multivariate Probit performs best but still has limitations in distinguishing between constant and dependent covariance types.

Abstract: Multilabel data should be analysed for label dependence before applying multilabel models. Independence between multilabel data labels cannot be measured directly from the label values due to their dependence on the set of covariates $\vec{x}$, but can be measured by examining the conditional label covariance using a multivariate Probit model. Unfortunately, the multivariate Probit model provides an estimate of its copula covariance, and so might not be reliable in estimating constant covariance and dependent covariance. In this article, we compare three models (Multivariate Probit, Multivariate Bernoulli and Staged Logit) for estimating the constant and dependent multilabel conditional label covariance. We provide an experiment that allows us to observe each model’s measurement of conditional covariance. We found that all models measure constant and dependent covariance equally well, depending on the strength of the covariance, but the models all falsely detect that dependent covariance is present for data where constant covariance is present. Of the three models, the Multivariate Probit model had the lowest error rate.

[367] On the Generalisation of Koopman Representations for Chaotic System Control

Kyriakos Hjikakou, Juan Diego Cardenas Cartagena, Matthia Sabatelli

Main category: cs.LG

TL;DR: Koopman embeddings outperform PCA baselines for chaotic systems, showing transferability across prediction and control tasks with no performance degradation when fine-tuning.

Details

Motivation: To investigate the generalizability of Koopman-based representations for chaotic dynamical systems and their transferability across prediction and control tasks.

Method: Three-stage methodology: 1) learning Koopman embeddings through autoencoding, 2) pre-training transformer on next-state prediction, 3) fine-tuning for safety-critical control using Lorenz system as testbed.

Result: Koopman embeddings outperform both standard and physics-informed PCA baselines, achieving accurate and data-efficient performance. Fixed pre-trained transformer weights during fine-tuning show no performance degradation.

Conclusion: Koopman embeddings capture reusable dynamical structure rather than task-specific patterns, supporting their use as foundation for multi-task learning in physics-informed machine learning.

Abstract: This paper investigates the generalisability of Koopman-based representations for chaotic dynamical systems, focusing on their transferability across prediction and control tasks. Using the Lorenz system as a testbed, we propose a three-stage methodology: learning Koopman embeddings through autoencoding, pre-training a transformer on next-state prediction, and fine-tuning for safety-critical control. Our results show that Koopman embeddings outperform both standard and physics-informed PCA baselines, achieving accurate and data-efficient performance. Notably, fixing the pre-trained transformer weights during fine-tuning leads to no performance degradation, indicating that the learned representations capture reusable dynamical structure rather than task-specific patterns. These findings support the use of Koopman embeddings as a foundation for multi-task learning in physics-informed machine learning. A project page is available at https://kikisprdx.github.io/.

[368] PAX-TS: Model-agnostic multi-granular explanations for time series forecasting via localized perturbations

Tim Kreuzer, Jelena Zdravkovic, Panagiotis Papapetrou

Main category: cs.LG

TL;DR: PAX-TS is a model-agnostic post-hoc explanation method for time series forecasting models that uses localized input perturbations to generate multi-granular explanations and capture cross-channel correlations in multivariate forecasts.

Details

Motivation: Modern time series forecasting models (transformers, LLMs) are opaque and lack explainability, while existing methods like LIME are unsuitable for forecasting contexts, creating a need for specialized explanation techniques.

Method: PAX-TS uses localized input perturbations to create explanations, providing multi-granular insights and characterizing cross-channel correlations for multivariate time series. The method is model-agnostic and works post-hoc.

Result: The method was tested on 7 algorithms and 10 datasets, showing that explanations differ between high/low-performing models, effectively capturing model behavior. 6 distinct pattern classes were identified that correlate with forecasting performance differences.

Conclusion: PAX-TS successfully provides detailed explanations for time series forecasts, reveals model behavior patterns, and demonstrates practical utility for understanding cross-channel correlations and answering forecast-related questions.

Abstract: Time series forecasting has seen considerable improvement during the last years, with transformer models and large language models driving advancements of the state of the art. Modern forecasting models are generally opaque and do not provide explanations for their forecasts, while well-known post-hoc explainability methods like LIME are not suitable for the forecasting context. We propose PAX-TS, a model-agnostic post-hoc algorithm to explain time series forecasting models and their forecasts. Our method is based on localized input perturbations and results in multi-granular explanations. Further, it is able to characterize cross-channel correlations for multivariate time series forecasts. We clearly outline the algorithmic procedure behind PAX-TS, demonstrate it on a benchmark with 7 algorithms and 10 diverse datasets, compare it with two other state-of-the-art explanation algorithms, and present the different explanation types of the method. We found that the explanations of high-performing and low-performing algorithms differ on the same datasets, highlighting that the explanations of PAX-TS effectively capture a model’s behavior. Based on time step correlation matrices resulting from the benchmark, we identify 6 classes of patterns that repeatedly occur across different datasets and algorithms. We found that the patterns are indicators of performance, with noticeable differences in forecasting error between the classes. Lastly, we outline a multivariate example where PAX-TS demonstrates how the forecasting model takes cross-channel correlations into account. With PAX-TS, time series forecasting models’ mechanisms can be illustrated in different levels of detail, and its explanations can be used to answer practical questions on forecasts.

[369] FedProtoKD: Dual Knowledge Distillation with Adaptive Class-wise Prototype Margin for Heterogeneous Federated Learning

Md Anwar Hossen, Fatema Siddika, Wensheng Zhang, Anuj Sharma, Ali Jannesari

Main category: cs.LG

TL;DR: FedProtoKD addresses prototype shrinking in heterogeneous federated learning through dual-knowledge distillation and contrastive learning, achieving significant accuracy improvements.

Details

Motivation: Existing prototype-based HFL methods suffer from sub-optimal global knowledge due to weighted averaging of prototypes, causing prototype shrinking that degrades performance in heterogeneous models and non-IID data scenarios.

Method: Proposes FedProtoKD with enhanced dual-knowledge distillation using clients’ logits and prototype features, contrastive learning-based trainable server prototype with class-wise adaptive margins, and importance assessment of public samples based on prototype closeness.

Result: Achieved average improvements of 1.13% to 34.13% accuracy across various settings, significantly outperforming state-of-the-art HFL methods.

Conclusion: FedProtoKD effectively resolves the prototype margin-shrinking problem and enhances learning performance in heterogeneous federated learning environments.

Abstract: Heterogeneous Federated Learning (HFL) has gained attention for its ability to accommodate diverse models and heterogeneous data across clients. Prototype-based HFL methods emerge as a promising solution to address statistical heterogeneity and privacy challenges, paving the way for new advancements in HFL research. This method focuses on sharing only class-representative prototypes among heterogeneous clients. However, these prototypes are often aggregated on the server using weighted averaging, leading to sub-optimal global knowledge; these cause the shrinking of aggregated prototypes, which negatively affects the model performance in scenarios when models are heterogeneous and data distributions are extremely non-IID. We propose FedProtoKD in a Heterogeneous Federated Learning setting, using an enhanced dual-knowledge distillation mechanism to improve the system performance with clients’ logits and prototype feature representation. We aim to resolve the prototype margin-shrinking problem using a contrastive learning-based trainable server prototype by leveraging a class-wise adaptive prototype margin. Furthermore, we assess the importance of public samples using the closeness of the sample’s prototype to its class representative prototypes, which enhances learning performance. FedProtoKD achieved average improvements of 1.13% up to 34.13% accuracy across various settings and significantly outperforms existing state-of-the-art HFL methods.

[370] STDiff: A State Transition Diffusion Framework for Time Series Imputation in Industrial Systems

Gary Simethy, Daniel Ortiz-Arroyo, Petar Durdevic

Main category: cs.LG

TL;DR: STDiff reframes time series imputation as learning system evolution using a conditional denoising diffusion model with causal bias, outperforming window-based methods especially for long gaps in industrial data.

Details

Motivation: Traditional deep learning methods treat imputation as pattern completion within fixed time windows, which fails in industrial systems with non-stationary dynamics, control actions, and long uninterrupted gaps.

Method: STDiff uses a conditional denoising diffusion model with causal bias aligned to control theory, generating missing values step-by-step based on the most recent known state and relevant control/environmental inputs.

Result: On wastewater treatment dataset with simulated missing blocks, STDiff achieves lowest errors with advantage increasing for longer gaps. On raw industrial data, it produces dynamically plausible trajectories while window-based models flatten or over-smooth.

Conclusion: Dynamics-aware, explicitly conditioned imputation is a robust approach for industrial time series, with STDiff demonstrating superior performance particularly for challenging long-gap scenarios.

Abstract: Most deep learning methods for imputing missing values treat the task as completing patterns within a fixed time window. This assumption often fails in industrial systems, where dynamics are driven by control actions, are highly non-stationary, and can experience long, uninterrupted gaps. We propose STDiff, which reframes imputation as learning how the system evolves from one state to the next. STDiff uses a conditional denoising diffusion model with a causal bias aligned to control theory, generating missing values step-by-step based on the most recent known state and relevant control or environmental inputs. On a public wastewater treatment dataset with simulated missing blocks, STDiff consistently achieves the lowest errors, with its advantage increasing for longer gaps. On a raw industrial dataset with substantial real gaps, it produces trajectories that remain dynamically plausible, in contrast to window-based models that tend to flatten or over-smooth. These results support dynamics-aware, explicitly conditioned imputation as a robust approach for industrial time series, and we discuss computational trade-offs and extensions to broader domains.

[371] Learning with springs and sticks

Luis Mantilla Calderón, Alán Aspuru-Guzik

Main category: cs.LG

TL;DR: A physical system using springs and sticks can approximate any continuous function through piecewise-linear approximation and energy minimization, achieving performance comparable to neural networks while revealing thermodynamic learning barriers.

Details

Motivation: To understand learning as a physical process by creating a simple mechanical system that can perform function approximation and study its thermodynamic properties.

Method: Using sticks to create piecewise-linear approximations of functions, encoding mean squared error loss through spring potential energy, and converging to minimum-energy configurations via dissipation.

Result: The system achieves regression performance comparable to multi-layer perceptrons and reveals a thermodynamic learning barrier where environmental fluctuations prevent learning when free energy change hits a certain threshold.

Conclusion: This simple physical model provides insights into learning systems from a thermodynamic perspective, showing that physical constraints and environmental fluctuations create fundamental learning barriers.

Abstract: Learning is a physical process. Here, we aim to study a simple dynamical system composed of springs and sticks capable of arbitrarily approximating any continuous function. The main idea of our work is to use the sticks to mimic a piecewise-linear approximation of the given function, use the potential energy of springs to encode a desired mean squared error loss function, and converge to a minimum-energy configuration via dissipation. We apply the proposed simulation system to regression tasks and show that its performance is comparable to that of multi-layer perceptrons. In addition, we study the thermodynamic properties of the system and find a relation between the free energy change of the system and its ability to learn an underlying data distribution. We empirically find a \emph{thermodynamic learning barrier} for the system caused by the fluctuations of the environment, whereby the system cannot learn if its change in free energy hits such a barrier. We believe this simple model can help us better understand learning systems from a physical point of view.

[372] Working My Way Back to You: Resource-Centric Next-Activity Prediction

Kelly Kurowski, Xixi Lu, Hajo A Reijers

Main category: cs.LG

TL;DR: Resource-centric next-activity prediction using LightGBM and Transformer models with 2-gram encoding outperforms traditional control-flow approaches, enabling better resource allocation and workforce planning.

Details

Motivation: Existing predictive process monitoring focuses on control-flow perspective, but resource-centric prediction offers additional benefits like improved work organization, workload balancing, and capacity forecasting. Resource information's role in next-activity prediction remains unexplored despite its proven value in process performance analysis.

Method: Evaluated four prediction models (LightGBM, Transformer, Random Forest, and baseline) with three encoding strategies across four real-life datasets. Tested 2-gram activity transitions encoding and combined encoding with activity repetition features.

Result: LightGBM and Transformer models performed best with 2-gram activity transitions encoding. Random Forest benefited most from combined encoding (2-gram transitions + activity repetition features), which achieved the highest average accuracy. Resource-centric approach outperformed baseline.

Conclusion: Resource-centric next-activity prediction shows significant potential for smarter resource allocation, strategic workforce planning, and personalized employee support. The findings open new research directions in predictive process monitoring beyond traditional control-flow approaches.

Abstract: Predictive Process Monitoring (PPM) aims to train models that forecast upcoming events in process executions. These predictions support early bottleneck detection, improved scheduling, proactive interventions, and timely communication with stakeholders. While existing research adopts a control-flow perspective, we investigate next-activity prediction from a resource-centric viewpoint, which offers additional benefits such as improved work organization, workload balancing, and capacity forecasting. Although resource information has been shown to enhance tasks such as process performance analysis, its role in next-activity prediction remains unexplored. In this study, we evaluate four prediction models and three encoding strategies across four real-life datasets. Compared to the baseline, our results show that LightGBM and Transformer models perform best with an encoding based on 2-gram activity transitions, while Random Forest benefits most from an encoding that combines 2-gram transitions and activity repetition features. This combined encoding also achieves the highest average accuracy. This resource-centric approach could enable smarter resource allocation, strategic workforce planning, and personalized employee support by analyzing individual behavior rather than case-level progression. The findings underscore the potential of resource-centric next-activity prediction, opening up new venues for research on PPM.

[373] Metric Matters: A Formal Evaluation of Similarity Measures in Active Learning for Cyber Threat Intelligence

Sidahmed Benabderrahmane, Talal Rahwan

Main category: cs.LG

TL;DR: A novel active learning framework using similarity search and attention-based autoencoder for APT detection, with evaluation showing similarity metric choice significantly impacts detection performance and label efficiency.

Details

Motivation: Address the challenges of stealthy APT behavior and extreme class imbalance in cyber defense datasets through improved anomaly detection with minimal supervision.

Method: Active learning-based anomaly detection framework leveraging similarity search with Attention-Based Autoencoder, using feature-space similarity to identify normal/anomaly instances and iteratively refine decision space.

Result: Experiments on diverse datasets (including DARPA APT traces) show similarity metric choice significantly impacts model convergence, detection accuracy, and label efficiency.

Conclusion: Provides actionable insights for selecting similarity functions in active learning pipelines for threat intelligence and cyber defense applications.

Abstract: Advanced Persistent Threats (APTs) pose a severe challenge to cyber defense due to their stealthy behavior and the extreme class imbalance inherent in detection datasets. To address these issues, we propose a novel active learning-based anomaly detection framework that leverages similarity search to iteratively refine the decision space. Built upon an Attention-Based Autoencoder, our approach uses feature-space similarity to identify normal-like and anomaly-like instances, thereby enhancing model robustness with minimal oracle supervision. Crucially, we perform a formal evaluation of various similarity measures to understand their influence on sample selection and anomaly ranking effectiveness. Through experiments on diverse datasets, including DARPA Transparent Computing APT traces, we demonstrate that the choice of similarity metric significantly impacts model convergence, anomaly detection accuracy, and label efficiency. Our results offer actionable insights for selecting similarity functions in active learning pipelines tailored for threat intelligence and cyber defense.

[374] GRADSTOP: Early Stopping of Gradient Descent via Posterior Sampling

Arash Jamshidi, Lauri Seppäläinen, Katsiaryna Haitsiukevich, Hoang Phuc Hau Luu, Anton Björklund, Kai Puolamäki

Main category: cs.LG

TL;DR: GradStop is a novel early stopping method that uses gradient information instead of validation sets to prevent overfitting, allowing full dataset usage for training.

Details

Motivation: Traditional early stopping requires hold-out validation sets which reduce training data. This is problematic in data-limited settings like transfer learning.

Method: Estimates Bayesian posterior using gradient information, defines early stopping as sampling from this posterior, and uses approximated posterior for stopping criterion.

Result: Achieves small test loss and performs favorably compared to validation-set-based stopping, with minimal computational overhead.

Conclusion: GradStop enables full dataset training without validation sets, making it particularly valuable for data-limited scenarios while maintaining performance.

Abstract: Machine learning models are often learned by minimising a loss function on the training data using a gradient descent algorithm. These models often suffer from overfitting, leading to a decline in predictive performance on unseen data. A standard solution is early stopping using a hold-out validation set, which halts the minimisation when the validation loss stops decreasing. However, this hold-out set reduces the data available for training. This paper presents {\sc gradstop}, a novel stochastic early stopping method that only uses information in the gradients, which are produced by the gradient descent algorithm ``for free.’’ Our main contributions are that we estimate the Bayesian posterior by the gradient information, define the early stopping problem as drawing sample from this posterior, and use the approximated posterior to obtain a stopping criterion. Our empirical evaluation shows that {\sc gradstop} achieves a small loss on test data and compares favourably to a validation-set-based stopping criterion. By leveraging the entire dataset for training, our method is particularly advantageous in data-limited settings, such as transfer learning. It can be incorporated as an optional feature in gradient descent libraries with only a small computational overhead. The source code is available at https://github.com/edahelsinki/gradstop.

[375] Tackling Federated Unlearning as a Parameter Estimation Problem

Antonio Balordi, Lorenzo Manini, Fabio Stella, Alessio Merlo

Main category: cs.LG

TL;DR: Federated Unlearning framework using Hessian information to selectively reset parameters sensitive to forgotten data, enabling efficient data erasure in federated learning without full retraining.

Details

Motivation: Privacy regulations require data erasure from deep learning models, which is challenging in Federated Learning where data remains on clients and full retraining is often infeasible.

Method: Uses second-order Hessian information to identify and selectively reset parameters most sensitive to the data being forgotten, followed by minimal federated retraining. Model-agnostic approach without requiring server access to raw client data.

Result: Strong privacy (MIA success near random, categorical knowledge erased) and high performance (Normalized Accuracy ≈ 0.9 against re-trained benchmarks). Effectively neutralizes backdoor attacks and restores model integrity.

Conclusion: Provides a practical solution for efficient data forgetting in Federated Learning with strong privacy guarantees and high performance while being more efficient than complete retraining.

Abstract: Privacy regulations require the erasure of data from deep learning models. This is a significant challenge that is amplified in Federated Learning, where data remains on clients, making full retraining or coordinated updates often infeasible. This work introduces an efficient Federated Unlearning framework based on information theory, modeling leakage as a parameter estimation problem. Our method uses second-order Hessian information to identify and selectively reset only the parameters most sensitive to the data being forgotten, followed by minimal federated retraining. This model-agnostic approach supports categorical and client unlearning without requiring server access to raw client data after initial information aggregation. Evaluations on benchmark datasets demonstrate strong privacy (MIA success near random, categorical knowledge erased) and high performance (Normalized Accuracy against re-trained benchmarks of $\approx$ 0.9), while aiming for increased efficiency over complete retraining. Furthermore, in a targeted backdoor attack scenario, our framework effectively neutralizes the malicious trigger, restoring model integrity. This offers a practical solution for data forgetting in FL.

[376] When recalling in-context, Transformers are not SSMs

Destiny Okpekpe, Antonio Orvieto

Main category: cs.LG

TL;DR: Modern recurrent models like SSMs show subquadratic complexity but underperform transformers on reasoning tasks. This paper analyzes associative recall performance, revealing critical learning rate sensitivity in recurrent models, contrasting scaling behaviors, and unexpected induction head formation in 1-layer transformers.

Details

Motivation: Recent studies show recurrent models (e.g., state-space models) underperform transformers on reasoning and memorization tasks despite their computational efficiency. The authors investigate associative recall as a key benchmark to understand performance gaps and optimization issues.

Method: The study conducts detailed analysis of associative recall performance across different architectures. It examines learning rate sensitivity in recurrent models, compares scaling effects (width vs depth), analyzes 1-layer transformer training dynamics, and performs architectural ablations on Transformer and Mamba models.

Result: Recurrent models show critical sensitivity to learning rate choice, unlike transformers. Attention-based models struggle with single-layer AR tasks but show surprising induction head formation dynamics. Recurrent and attention models exhibit opposite scaling benefits - recurrent models benefit more from width while attention benefits from depth.

Conclusion: The findings reveal significant optimization stability issues in modern recurrent models that affect reported performance. The contrasting scaling behaviors and unexpected training dynamics in 1-layer transformers suggest need for further research into stabilizing training and understanding architectural differences between recurrent and attention-based approaches.

Abstract: Despite the advantageous subquadratic complexity of modern recurrent deep learning models – such as state-space models (SSMs) – recent studies have highlighted their potential shortcomings compared to transformers on reasoning and memorization tasks. In this paper, we dive deeper into one of such benchmarks: associative recall (AR), which has been shown to correlate well with language modeling performance, and inspect in detail the effects of scaling and optimization issues in recently proposed token mixing strategies. We first demonstrate that, unlike standard transformers, the choice of learning rate plays a critical role in the performance of modern recurrent models: an issue that can severely affect reported performance in previous works and suggests further research is needed to stabilize training. Next, we show that recurrent and attention-based models exhibit contrasting benefits when scaling in width as opposed to depth, with attention being notably unable to solve AR when limited to a single layer. We then further inspect 1-layer transformers, revealing that despite their poor performance, their training dynamics surprisingly resemble the formation of induction heads, a phenomenon previously observed only in their 2-layer counterparts. Finally, through architectural ablations, we study how components affects Transformer and Mamba’s performance and optimization stability.

[377] Dynamic Triangulation-Based Graph Rewiring for Graph Neural Networks

Hugo Attali, Thomas Papastergiou, Nathalie Pernelle, Fragkiskos D. Malliaros

Main category: cs.LG

TL;DR: TRIGON is a novel graph rewiring framework that constructs enriched triangulations by learning to select relevant triangles from multiple graph views, addressing oversquashing and oversmoothing issues in GNNs.

Details

Motivation: Graph Neural Networks suffer from performance limitations due to graph topology issues like oversquashing and oversmoothing, which existing rewiring methods aim to mitigate.

Method: TRIGON constructs non-planar triangulations by jointly optimizing triangle selection from multiple graph views and downstream classification performance.

Result: The method produces rewired graphs with improved structural properties (reduced diameter, increased spectral gap, lower effective resistance) and outperforms state-of-the-art approaches on node classification tasks across homophilic and heterophilic benchmarks.

Conclusion: TRIGON effectively addresses GNN limitations through intelligent graph rewiring via learned triangle selection, demonstrating superior performance on various graph learning benchmarks.

Abstract: Graph Neural Networks (GNNs) have emerged as the leading paradigm for learning over graph-structured data. However, their performance is limited by issues inherent to graph topology, most notably oversquashing and oversmoothing. Recent advances in graph rewiring aim to mitigate these limitations by modifying the graph topology to promote more effective information propagation. In this work, we introduce TRIGON, a novel framework that constructs enriched, non-planar triangulations by learning to select relevant triangles from multiple graph views. By jointly optimizing triangle selection and downstream classification performance, our method produces a rewired graph with markedly improved structural properties such as reduced diameter, increased spectral gap, and lower effective resistance compared to existing rewiring methods. Empirical results demonstrate that TRIGON outperforms state-of-the-art approaches on node classification tasks across a range of homophilic and heterophilic benchmarks.

[378] Breaking the Black Box: Inherently Interpretable Physics-Informed Machine Learning for Imbalanced Seismic Data

Vemula Sreenath, Filippo Gatti, Pierre Jehel

Main category: cs.LG

TL;DR: A transparent machine learning approach for ground motion prediction using HazBinLoss function to address black-box limitations and data imbalance issues in seismic risk assessment.

Details

Motivation: Traditional ML-based ground motion models are black boxes that lack interpretability and suffer from data imbalances (few large damaging records vs abundant small ones), limiting their use in high-stake seismic decisions.

Method: Developed a transparent ML architecture where each input (magnitude, distance, interaction terms) is processed separately and added linearly to show exact contributions. Used HazBinLoss function to weight critical near-field large magnitude records higher during training.

Result: The model captures known seismological principles and achieves comparable performance with established ground motion models while maintaining full transparency and interpretability.

Conclusion: This framework enables broader adoption of ML approaches for seismic risk assessment by providing both accuracy and transparency, addressing trust issues in high-stake decision-making scenarios.

Abstract: Ground motion models (GMMs) predict how strongly the ground will shake during an earthquake. They are essential for structural analysis, seismic design, and seismic risk assessment studies. Traditional machine learning (ML) approaches are popular to develop GMMs, due to large earthquake databases worldwide. However, they operate as “black boxes,” which are hard to interpret and trust, limiting their use in high-stake decisions. Additionally, these databases suffer from significant data imbalances: fewer large, critically damaging records near the fault compared to abundant, less severely damaging distant records. These two limitations are addressed in this work by developing a transparent ML architecture using the HazBinLoss function. Each input (e.g., magnitude, distance, their interaction term, etc.) is processed separately and added linearly to obtain the output, resulting in exact contribution of each term. The HazBinLoss function assigns higher weights to critical near-field large magnitude records and lower weights to less-critical far-field smaller magnitude records, during training to prevent underprediction of the most damaging scenarios. Our model captures known seismological principles and achieves comparable performance with established GMMs while maintaining transparency. This framework enables broader adoption of ML-based approaches for risk assessment studies and disaster planning.

[379] Automated discovery of finite volume schemes using Graph Neural Networks

Paul Garnier, Jonathan Viquerat, Elie Hachem

Main category: cs.LG

TL;DR: GNNs can extrapolate beyond training data to generate numerical schemes, recovering first-order and second-order finite volume schemes for heat equation through symbolic regression and unsupervised PINN-like training.

Details

Motivation: To demonstrate that GNNs can go beyond traditional approximation roles and actively contribute to developing numerical methods, particularly by generating numerical schemes that extrapolate to out-of-distribution scenarios.

Method: Train GNNs on minimal datasets (two-node graphs) and use symbolic regression to extract analytical formulations. Extend to unsupervised training using PINN-like residual loss without ground-truth data. Test with different GNN architectures (2-hop and 2-layers) to discover higher-order schemes.

Result: GNNs successfully recover first-order finite volume scheme with O(ε) error. In unsupervised setting, they recover first-order scheme using only residual loss. Higher-order GNNs discover second-order correction terms and classic second-order midpoint scheme.

Conclusion: GNNs are not just approximators but active contributors to numerical method development, capable of rediscovering and generating traditional numerical schemes through machine learning approaches.

Abstract: Graph Neural Networks (GNNs) have deeply modified the landscape of numerical simulations by demonstrating strong capabilities in approximating solutions of physical systems. However, their ability to extrapolate beyond their training domain (\textit{e.g.} larger or structurally different graphs) remains uncertain. In this work, we establish that GNNs can serve purposes beyond their traditional role, and be exploited to generate numerical schemes, in conjunction with symbolic regression. First, we show numerically and theoretically that a GNN trained on a dataset consisting solely of two-node graphs can extrapolate a first-order Finite Volume (FV) scheme for the heat equation on out-of-distribution, unstructured meshes. Specifically, if a GNN achieves a loss $\varepsilon$ on such a dataset, it implements the FV scheme with an error of $\mathcal{O}(\varepsilon)$. Using symbolic regression, we show that the network effectively rediscovers the exact analytical formulation of the standard first-order FV scheme. We then extend this approach to an unsupervised context: the GNN recovers the first-order FV scheme using only a residual loss similar to Physics-Informed Neural Networks (PINNs) with no access to ground-truth data. Finally, we push the methodology further by considering higher-order schemes: we train (i) a 2-hop and (ii) a 2-layers GNN using the same PINN loss, that autonomously discover (i) a second-order correction term to the initial scheme using a 2-hop stencil, and (ii) the classic second-order midpoint scheme. These findings follows a recent paradigm in scientific computing: GNNs are not only strong approximators, but can be active contributors to the development of novel numerical methods.

[380] APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang

Main category: cs.LG

TL;DR: APT-LLM is a comprehensive acceleration scheme for arbitrary precision LLMs that achieves significant speedups through novel data formats, matrix multiplication methods, memory management, and kernel optimization.

Details

Motivation: Large language models have enormous computational demands that limit deployment and real-time performance. Existing quantization methods face challenges with GPU Tensor Core support, inefficient memory management, and inflexible kernel optimizations for ultra-low-bit quantized LLMs.

Method: Proposes APT-LLM with: 1) bipolar-INT data format for efficient conversion and parallel computation, 2) bit-level matrix multiplication method for arbitrary precision and Tensor Core optimization, 3) memory management system using shared memory for data recovery, and 4) dynamic kernel mapping for optimal hyperparameter selection.

Result: Achieves up to 3.99× speedup vs FP16 baselines and 2.16× speedup over NVIDIA CUTLASS INT4 on RTX 3090. On RTX 4090 and H800, achieves up to 2.44× speedup over FP16 and 1.65× speedup over CUTLASS integer baselines.

Conclusion: APT-LLM provides an effective solution for accelerating arbitrary precision LLM inference by addressing GPU-specific limitations through comprehensive optimization techniques, demonstrating significant performance improvements across different hardware platforms.

Abstract: Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$\times$ speedup compared to FP16 baselines and a 2.16$\times$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$\times$ speedup over FP16 and 1.65$\times$ speedup over CUTLASS integer baselines.

[381] Active Query Selection for Crowd-Based Reinforcement Learning

Jonathan Erskine, Taku Yamagata, Raúl Santos-Rodríguez

Main category: cs.LG

TL;DR: Novel framework combining probabilistic crowd modeling and active learning for preference-based RL to handle noisy human feedback and reduce annotation costs.

Details

Motivation: Address limitations of preference-based RL where human feedback is costly, scarce, and potentially noisy, especially in domains requiring expert input.

Method: Extend Advise algorithm to support multiple trainers with online reliability estimation, incorporate entropy-based query selection for active learning, and use probabilistic crowd modeling for noisy feedback handling.

Result: Agents trained with feedback on uncertain trajectories show faster learning in most tasks, and outperform baselines in blood glucose control using UVA/Padova simulator.

Conclusion: The proposed framework effectively reduces human annotation burden while improving learning efficiency, particularly demonstrating success in medical applications like diabetes management.

Abstract: Preference-based reinforcement learning has gained prominence as a strategy for training agents in environments where the reward signal is difficult to specify or misaligned with human intent. However, its effectiveness is often limited by the high cost and low availability of reliable human input, especially in domains where expert feedback is scarce or errors are costly. To address this, we propose a novel framework that combines two complementary strategies: probabilistic crowd modelling to handle noisy, multi-annotator feedback, and active learning to prioritize feedback on the most informative agent actions. We extend the Advise algorithm to support multiple trainers, estimate their reliability online, and incorporate entropy-based query selection to guide feedback requests. We evaluate our approach in a set of environments that span both synthetic and real-world-inspired settings, including 2D games (Taxi, Pacman, Frozen Lake) and a blood glucose control task for Type 1 Diabetes using the clinically approved UVA/Padova simulator. Our preliminary results demonstrate that agents trained with feedback on uncertain trajectories exhibit faster learning in most tasks, and we outperform the baselines for the blood glucose control task.

[382] Saddle Hierarchy in Dense Associative Memory

Robin Thériault, Daniele Tantari

Main category: cs.LG

TL;DR: This paper analyzes dense associative memory (DAM) models using statistical mechanics, develops a novel regularization scheme for stable training, and proposes a network-growing algorithm that leverages saddle-point hierarchy to reduce computational costs.

Details

Motivation: DAM models are gaining attention for their robustness to adversarial examples and connections to modern ML paradigms like transformers and diffusion models, but training stability and computational efficiency remain challenges.

Method: Statistical mechanics analysis of three-layer Boltzmann machines with Potts hidden units, derivation of saddle-point equations, development of a novel regularization scheme, and implementation of a network-growing algorithm based on saddle-point hierarchy.

Result: The proposed regularization significantly improves training stability, DAM learns interpretable solutions for both supervised and unsupervised classification, and the network-growing algorithm drastically reduces computational costs.

Conclusion: The statistical mechanics approach provides deep insights into DAM behavior, enabling more stable training and efficient network growth while maintaining interpretability and performance in classification tasks.

Abstract: Dense associative memory (DAM) models have been attracting renewed attention since they were shown to be robust to adversarial examples and closely related to state-of-the-art machine learning paradigms, such as the attention mechanisms in transformers and generative diffusion models. We study a DAM built upon a three-layer Boltzmann machine with Potts hidden units, which represent data clusters and classes. Through a statistical mechanics analysis, we derive saddle-point equations that characterize both the stationary points of DAMs trained on real data and the fixed points of DAMs trained on synthetic data within a teacher-student framework. Based on these results, we propose a novel regularization scheme that makes training significantly more stable. Moreover, we show empirically that our DAM learns interpretable solutions to both supervised and unsupervised classification problems. Pushing our theoretical analysis further, we find that the weights learned by relatively small DAMs correspond to unstable saddle points in larger DAMs. We implement a network-growing algorithm that leverages this saddle-point hierarchy to drastically reduce the computational cost of training dense associative memory.

[383] Get Global Guarantees: On the Probabilistic Nature of Perturbation Robustness

Wenchuan Mu, Kwan Hui Lim

Main category: cs.LG

TL;DR: Proposes tower robustness - a novel hypothesis testing-based metric for efficient and precise probabilistic robustness assessment in safety-critical deep learning applications.

Details

Motivation: Existing robustness assessment methods suffer from significant trade-offs between computational cost and measurement precision, limiting practical utility in safety-critical applications.

Method: Conducts comprehensive comparative analysis of existing robustness definitions and methodologies, then proposes tower robustness - a practical metric based on hypothesis testing to quantitatively evaluate probabilistic robustness.

Result: Extensive comparative evaluation demonstrates advantages and applicability of the proposed tower robustness approach for more rigorous and efficient pre-deployment assessments.

Conclusion: The tower robustness metric advances systematic understanding and enhancement of model robustness in safety-critical deep learning applications by enabling more efficient and precise robustness evaluation.

Abstract: In safety-critical deep learning applications, robustness measures the ability of neural models that handle imperceptible perturbations in input data, which may lead to potential safety hazards. Existing pre-deployment robustness assessment methods typically suffer from significant trade-offs between computational cost and measurement precision, limiting their practical utility. To address these limitations, this paper conducts a comprehensive comparative analysis of existing robustness definitions and associated assessment methodologies. We propose tower robustness to evaluate robustness, which is a novel, practical metric based on hypothesis testing to quantitatively evaluate probabilistic robustness, enabling more rigorous and efficient pre-deployment assessments. Our extensive comparative evaluation illustrates the advantages and applicability of our proposed approach, thereby advancing the systematic understanding and enhancement of model robustness in safety-critical deep learning applications.

[384] Emotions as Ambiguity-aware Ordinal Representations

Jingyao Wu, Matthew Barthet, David Melhart, Georgios N. Yannakakis

Main category: cs.LG

TL;DR: Novel ambiguity-aware ordinal emotion representations that capture both annotation ambiguity and temporal dynamics through rate of change modeling, outperforming conventional methods on unbounded emotion traces.

Details

Motivation: Existing continuous emotion recognition approaches ignore emotion ambiguity or treat it as static, failing to capture the inherently ambiguous and dynamic nature of emotions over time.

Method: Proposed ambiguity-aware ordinal emotion representations that model emotion ambiguity through its rate of change, evaluated on RECOLA and GameVibe corpora for both bounded (arousal, valence) and unbounded (engagement) continuous traces.

Result: Ordinal representations outperformed conventional ambiguity-aware models on unbounded labels with highest CCC and SDA scores, and excelled in SDA for bounded traces, demonstrating superior ability to capture relative changes in emotion dynamics.

Conclusion: The proposed ordinal framework effectively captures both emotion annotation ambiguity and temporal dynamics, showing particular strength in modeling relative changes and outperforming existing approaches, especially for unbounded emotion traces.

Abstract: Emotions are inherently ambiguous and dynamic phenomena, yet existing continuous emotion recognition approaches either ignore their ambiguity or treat ambiguity as an independent and static variable over time. Motivated by this gap in the literature, in this paper we introduce \emph{ambiguity-aware ordinal} emotion representations, a novel framework that captures both the ambiguity present in emotion annotation and the inherent temporal dynamics of emotional traces. Specifically, we propose approaches that model emotion ambiguity through its rate of change. We evaluate our framework on two affective corpora – RECOLA and GameVibe – testing our proposed approaches on both bounded (arousal, valence) and unbounded (engagement) continuous traces. Our results demonstrate that ordinal representations outperform conventional ambiguity-aware models on unbounded labels, achieving the highest Concordance Correlation Coefficient (CCC) and Signed Differential Agreement (SDA) scores, highlighting their effectiveness in modeling the traces’ dynamics. For bounded traces, ordinal representations excel in SDA, revealing their superior ability to capture relative changes of annotated emotion traces.

[385] Understanding Tool-Integrated Reasoning

Heng Lin, Zhongwen Xu

Main category: cs.LG

TL;DR: First formal proof that Tool-Integrated Reasoning (TIR) fundamentally expands LLM capabilities by breaking pure-text limitations through tools like Python interpreters, with new ASPO algorithm improving tool usage behavior.

Details

Motivation: While LLMs with tools show promise, there's been no principled theory explaining why this paradigm is effective. The work aims to provide the first formal explanation for TIR's success.

Method: Introduces Advantage Shaping Policy Optimization (ASPO) algorithm to guide model behavior without compromising training stability. Uses Python interpreter as external tool and conducts experiments on mathematical benchmarks.

Result: TIR model decisively outperforms pure-text counterpart on pass@k metric. Advantage extends beyond computational problems to those requiring abstract insight. Shows improved tool usage with early code invocation and more interactive turns.

Conclusion: Provides first principled explanation for TIR’s success, demonstrating tools enable strict expansion of model capabilities by unlocking otherwise impossible problem-solving strategies, shifting focus from whether tools work to why and how they enable more powerful reasoning.

Abstract: We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM’s capabilities. We demonstrate that tools enable a strict expansion of the model’s empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR’s success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.

[386] Predicting the Order of Upcoming Tokens Improves Language Modeling

Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji

Main category: cs.LG

TL;DR: Token Order Prediction (TOP) is proposed as a better auxiliary objective than Multi-Token Prediction (MTP) for language model training, using a learning-to-rank approach to order upcoming tokens by proximity instead of exact prediction.

Details

Motivation: Multi-Token Prediction shows inconsistent improvements and underperforms in standard NLP benchmarks because exact future token prediction is too difficult as an auxiliary loss.

Method: Propose Token Order Prediction (TOP) which trains models to order upcoming tokens by their proximity using a learning-to-rank loss, requiring only a single additional unembedding layer compared to MTP’s multiple transformer layers.

Result: Pretrained models of 340M, 1.8B, and 7B parameters show that TOP overall outperforms both standard next-token prediction (NTP) and MTP across eight standard NLP benchmarks, even at scale.

Conclusion: TOP is a more effective auxiliary objective than MTP for language model training, providing better performance with simpler architecture requirements.

Abstract: Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP’s exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP’s multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction

[387] Beyond Discriminant Patterns: On the Robustness of Decision Rule Ensembles

Xin Du, Subramanian Ramamoorthy, Wouter Duivesteijn, Jin Tian, Mykola Pechenizkiy

Main category: cs.LG

TL;DR: Proposes a causal knowledge-based method to learn robust local decision rule ensembles that maintain performance under distributional shifts in deployment environments.

Details

Motivation: Local decision rules are considered explainable but lack robustness against distributional shifts, which is critical for high-stake domains like healthcare and finance where models need to perform reliably in different environments.

Method: Leverages causal knowledge by treating distributional shifts as interventions, and proposes two causal-based regularization terms to search for optimal and stable rules that work across different environments.

Result: Experiments on synthetic and benchmark datasets demonstrate the method’s effectiveness and robustness against distributional shifts in multiple environments.

Conclusion: The proposed causal knowledge-based approach successfully creates robust local decision rule ensembles that maintain performance under distributional shifts, addressing a critical gap in deploying ML models in real-world high-stake applications.

Abstract: Local decision rules are commonly understood to be more explainable, due to the local nature of the patterns involved. With numerical optimization methods such as gradient boosting, ensembles of local decision rules can gain good predictive performance on data involving global structure. Meanwhile, machine learning models are being increasingly used to solve problems in high-stake domains including healthcare and finance. Here, there is an emerging consensus regarding the need for practitioners to understand whether and how those models could perform robustly in the deployment environments, in the presence of distributional shifts. Past research on local decision rules has focused mainly on maximizing discriminant patterns, without due consideration of robustness against distributional shifts. In order to fill this gap, we propose a new method to learn and ensemble local decision rules, that are robust both in the training and deployment environments. Specifically, we propose to leverage causal knowledge by regarding the distributional shifts in subpopulations and deployment environments as the results of interventions on the underlying system. We propose two regularization terms based on causal knowledge to search for optimal and stable rules. Experiments on both synthetic and benchmark datasets show that our method is effective and robust against distributional shifts in multiple environments.

[388] Retrieval Enhanced Feedback via In-context Neural Error-book

Jongyeop Hyun, Bumsoo Kim

Main category: cs.LG

TL;DR: REFINE is a teacher-student framework that uses structured error analysis and targeted feedback to improve multimodal reasoning in LLMs, optimizing retrieval efficiency and reducing computational costs.

Details

Motivation: Existing methods for learning from errors in multimodal LLMs lack structured frameworks for error analysis and mitigation, particularly when integrating visual and textual inputs, leading to inefficiencies and poor performance.

Method: Proposes REFINE framework with three systematic queries (Feed-Target, Feed-Check, Feed-Path) to structure errors, provide targeted feedback, prioritize visual information, diagnose failures, and formulate corrective actions while optimizing retrieval efficiency.

Result: Demonstrates substantial speedup, reduced computational costs, successful generalization, and improved multimodal reasoning performance compared to previous approaches with redundant retrievals.

Conclusion: REFINE provides an effective structured framework for error analysis and feedback in multimodal LLMs, offering improved efficiency, scalability, and reasoning capabilities through optimized retrieval and systematic error handling.

Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback – Feed-Target, Feed-Check, and Feed-Path – to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE’s potential for enhancing multimodal reasoning.

[389] Rethinking Distribution Shifts: Empirical Analysis and Inductive Modeling for Tabular Data

Tianyu Wang, Jiashuo Liu, Peng Cui, Hongseok Namkoong

Main category: cs.LG

TL;DR: Empirical analysis of distribution shifts reveals Y|X-shifts are most prevalent (contrary to ML literature’s focus on X-shifts), shows robust algorithms perform no better than vanilla methods, and finds implementation details matter more than theoretical robustness approaches.

Details

Motivation: To address the gap between theoretical robust algorithm development and empirical validation, and to understand what types of distribution shifts actually occur in practice versus what the literature focuses on.

Method: Built an empirical testbed with 8 tabular datasets, 172 distribution pairs, testing 45 methods across 90,000 configurations including ERM and DRO methods, followed by in-depth analysis of implementation details.

Result: Y|X-shifts are most prevalent (not X-shifts as commonly assumed), robust algorithms perform similarly to vanilla methods, and implementation choices (model class, hyperparameters) have greater impact than theoretical robustness parameters.

Conclusion: A data-driven, inductive approach to understanding distribution shifts provides a more effective path for algorithm development than theoretical assumptions, with implementation details being critically important.

Abstract: Different distribution shifts require different interventions, and algorithms must be grounded in the specific shifts they address. However, methodological development for robust algorithms typically relies on structural assumptions that lack empirical validation. Advocating for an empirically grounded data-driven approach to algorithm development, we build an empirical testbed comprising natural shifts across 8 tabular datasets, 172 distribution pairs over 45 methods and 90,000 method configurations encompassing empirical risk minimization and distributionally robust optimization (DRO) methods. We find $Y|X$-shifts are most prevalent in our testbed, in stark contrast to the heavy focus on $X$ (covariate)-shifts in the ML literature, and that the performance of robust algorithms is no better than that of vanilla methods. To understand why, we conduct an in-depth empirical analysis of DRO methods and find that underlooked implementation details – such as the choice of underlying model class (e.g., LightGBM) and hyperparameter selection – have a bigger impact on performance than the ambiguity set or its radius. We illustrate via case studies how a data-driven, inductive understanding of distribution shifts can provide a new approach to algorithm development.

Jongwoo Kim, Seongyeub Chu, Hyeongmin Park, Bryan Wong, Keejun Han, Mun Yong Yi

Main category: cs.LG

TL;DR: MF2Vec introduces multi-faceted paths instead of predefined meta-paths for heterogeneous graph analysis, outperforming existing methods in node classification, link prediction, and clustering tasks.

Details

Motivation: Existing heterogeneous GNN methods rely on domain-specific predefined meta-paths that are coarse-grained and limited to node types, restricting their ability to capture complex interactions in networks.

Method: MF2Vec extracts paths via random walks and generates multi-faceted vectors without predefined schemas, learning diverse aspects of nodes and relationships to construct homogeneous networks for embedding creation.

Result: Extensive experiments demonstrate that MF2Vec outperforms existing methods across various tasks including classification, link prediction, and clustering.

Conclusion: MF2Vec provides a more flexible and comprehensive framework for analyzing complex networks by using fine-grained multi-faceted paths instead of traditional predefined meta-paths.

Abstract: Recent advancements in graph neural networks (GNNs) and heterogeneous GNNs (HGNNs) have advanced node embeddings and relationship learning for various tasks. However, existing methods often rely on domain-specific predefined meta-paths, which are coarse-grained and focus solely on aspects like node type, limiting their ability to capture complex interactions. We introduce MF2Vec, a model that uses multi-faceted (fine-grained) paths instead of predefined meta-paths. MF2Vec extracts paths via random walks and generates multi-faceted vectors, ignoring predefined schemas. This method learns diverse aspects of nodes and their relationships, constructs a homogeneous network, and creates node embeddings for classification, link prediction, and clustering. Extensive experiments show that MF2Vec outperforms existing methods, offering a more flexible and comprehensive framework for analyzing complex networks. The code is available at https://anonymous.4open.science/r/MF2Vec-6ABC.

[391] Overcoming label shift with target-aware federated learning

Edvin Listo Zec, Adam Breitholtz, Fredrik D. Johansson

Main category: cs.LG

TL;DR: FedPALS addresses label shift problems in federated learning by proposing a target-aware model aggregation scheme that leverages server knowledge of label distributions to improve performance when client and target domain label distributions differ.

Details

Motivation: Existing federated learning algorithms assume the target domain shares data distribution with client aggregates, but this is often violated in practice due to label shift, which significantly degrades performance.

Method: FedPALS is a principled model aggregation scheme that adapts to label shifts by leveraging knowledge of label distributions at the central server, ensuring unbiased updates under federated stochastic gradient descent.

Result: Extensive experiments on image classification tasks show FedPALS consistently outperforms baselines by aligning model aggregation with the target domain, especially in cases of extreme label sparsity on clients.

Conclusion: Conventional federated learning methods suffer severely from label shift, highlighting the critical need for target-aware aggregation approaches like FedPALS to ensure robust generalization across clients with diverse, label-shifted data.

Abstract: Federated learning enables multiple actors to collaboratively train models without sharing private data. Existing algorithms are successful and well-justified in this task when the intended target domain, where the trained model will be used, shares data distribution with the aggregate of clients, but this is often violated in practice. A common reason is label shift – that the label distributions differ between clients and the target domain. We demonstrate empirically that this can significantly degrade performance. To address this problem, we propose FedPALS, a principled and practical model aggregation scheme that adapts to label shifts to improve performance in the target domain by leveraging knowledge of label distributions at the central server. Our approach ensures unbiased updates under federated stochastic gradient descent which yields robust generalization across clients with diverse, label-shifted data. Extensive experiments on image classification tasks demonstrate that FedPALS consistently outperforms baselines by aligning model aggregation with the target domain. Our findings reveal that conventional federated learning methods suffer severely in cases of extreme label sparsity on clients, highlighting the critical need for target-aware aggregation as offered by FedPALS.

[392] Secure Reinforcement Learning via Shuffle Privacy Model

Shaojie Bai, Mohammad Sadegh Talebi, Chengcheng Zhao, Peng Cheng, Jiming Chen

Main category: cs.LG

TL;DR: First shuffle model-based RL algorithm (SDP-PE) for CPS that achieves near-optimal regret with strong privacy guarantees, outperforming local DP models.

Details

Motivation: Privacy concerns in RL for Cyber-Physical Systems, with existing DP models being inadequate - centralized requires trusted server (single point of failure), local causes performance degradation unsuitable for control applications.

Method: Shuffle Differentially Private Policy Elimination (SDP-PE) algorithm with novel exponential batching schedule and “forgetting” mechanism to balance privacy and learning performance under shuffle privacy model.

Result: Achieves near-optimal regret bound, demonstrating superior privacy-regret trade-off that significantly outperforms local model.

Conclusion: Establishes viability of shuffle model for secure data-driven control in advanced CPS, providing strong privacy guarantees without centralized trust assumption.

Abstract: Reinforcement learning (RL) is a powerful tool for sequential decision-making, but its application is often hindered by privacy concerns arising from its interaction data. This challenge is particularly acute in advanced Cyber-Physical Systems (CPS), where learning from operational and user data can expose systems to privacy inference attacks. Existing differential privacy (DP) models for RL are often inadequate: the centralized model requires a fully trusted server, creating a single point of failure risk, while the local model incurs significant performance degradation that is unsuitable for many control applications. This paper addresses this gap by leveraging the emerging shuffle model of privacy, an intermediate trust model that provides strong privacy guarantees without a centralized trust assumption. We present Shuffle Differentially Private Policy Elimination (SDP-PE), the first generic policy elimination-based algorithm for episodic RL under the shuffle model. Our method introduces a novel exponential batching schedule and a ``forgetting’' mechanism to balance the competing demands of privacy and learning performance. Our analysis shows that SDP-PE achieves a near-optimal regret bound, demonstrating a superior privacy-regret trade-off that significantly outperforms the local model. This work establishes the viability of the shuffle model for secure data-driven control in advanced CPS.

[393] Hierarchical Object-Oriented POMDP Planning for Object Rearrangement

Rajesh Mangannavar, Alan Fern, Prasad Tadepalli

Main category: cs.LG

TL;DR: Online planning framework for multi-object rearrangement in partially observable multi-room environments using hierarchical POMDP approach with new MultiRoomR benchmark.

Details

Motivation: Current object rearrangement solutions lack adaptability to diverse challenges in partially observable environments, requiring more flexible planning methods.

Method: Hierarchical Object-Oriented POMDP (HOO-POMDP) approach with object-oriented planner generating sub-goals, low-level policies for sub-goal achievement, and abstraction system for continuous-to-abstract representation conversion.

Result: System effectively handles complex multi-room scenarios with 10-30% initial visibility, blocked paths, obstructed goals, and 10-20 objects across 2-4 rooms, maintaining robust performance with imperfect perception.

Conclusion: The proposed HOO-POMDP framework successfully addresses multi-object rearrangement challenges in partially observable multi-room environments and demonstrates promising results on both existing benchmarks and the new MultiRoomR dataset.

Abstract: We present an online planning framework and a new benchmark dataset for solving multi-object rearrangement problems in partially observable, multi-room environments. Current object rearrangement solutions, primarily based on Reinforcement Learning or hand-coded planning methods, often lack adaptability to diverse challenges. To address this limitation, we introduce a novel Hierarchical Object-Oriented Partially Observed Markov Decision Process (HOO-POMDP) planning approach. This approach comprises of (a) an object-oriented POMDP planner generating sub-goals, (b) a set of low-level policies for sub-goal achievement, and (c) an abstraction system converting the continuous low-level world into a representation suitable for abstract planning. To enable rigorous evaluation of rearrangement challenges, we introduce MultiRoomR, a comprehensive benchmark featuring diverse multi-room environments with varying degrees of partial observability (10-30% initial visibility), blocked paths, obstructed goals, and multiple objects (10-20) distributed across 2-4 rooms. Experiments demonstrate that our system effectively handles these complex scenarios while maintaining robust performance even with imperfect perception, achieving promising results across both existing benchmarks and our new MultiRoomR dataset.

[394] Branch and Bound for Piecewise Linear Neural Network Verification

Rudy Bunel, Jingyue Lu, Ilker Turkaslan, Philip H. S. Torr, Pushmeet Kohli, M. Pawan Kumar

Main category: cs.LG

TL;DR: Proposes Branch-and-Bound algorithms for neural network verification using Mixed Integer Linear Programming, achieving state-of-the-art performance and handling high-dimensional convolutional networks.

Details

Motivation: Address the scalability limitations of existing neural network verification methods for safety-critical applications by developing more efficient formal verification techniques.

Method: Uses Mixed Integer Linear Programming formulation with Branch-and-Bound framework, introducing new branching strategies on ReLU non-linearities and combining strengths of multiple existing approaches.

Result: Significant performance improvements over previous state-of-the-art methods, successful verification of high-dimensional convolutional networks where previous methods failed, and comprehensive benchmark datasets.

Conclusion: The Branch-and-Bound framework enables effective neural network verification, provides new insights into verification hardness factors, and establishes a foundation for scalable verification of realistic neural networks.

Abstract: The success of Deep Learning and its potential use in many safety-critical applications has motivated research on formal verification of Neural Network (NN) models. In this context, verification involves proving or disproving that an NN model satisfies certain input-output properties. Despite the reputation of learned NN models as black boxes, and the theoretical hardness of proving useful properties about them, researchers have been successful in verifying some classes of models by exploiting their piecewise linear structure and taking insights from formal methods such as Satisifiability Modulo Theory. However, these methods are still far from scaling to realistic neural networks. To facilitate progress on this crucial area, we exploit the Mixed Integer Linear Programming (MIP) formulation of verification to propose a family of algorithms based on Branch-and-Bound (BaB). We show that our family contains previous verification methods as special cases. With the help of the BaB framework, we make three key contributions. Firstly, we identify new methods that combine the strengths of multiple existing approaches, accomplishing significant performance improvements over previous state of the art. Secondly, we introduce an effective branching strategy on ReLU non-linearities. This branching strategy allows us to efficiently and successfully deal with high input dimensional problems with convolutional network architecture, on which previous methods fail frequently. Finally, we propose comprehensive test data sets and benchmarks which includes a collection of previously released testcases. We use the data sets to conduct a thorough experimental comparison of existing and new algorithms and to provide an inclusive analysis of the factors impacting the hardness of verification problems.

[395] Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data

Jonathan W. Siegel

Main category: cs.LG

TL;DR: Deep ReLU networks require at least Ω(N) parameters to interpolate N datapoints exponentially separated in the unit ball, showing VC dimension techniques don’t apply to irregular spacing.

Details

Motivation: To understand the interpolation capabilities and parameter efficiency of deep ReLU neural networks, particularly for irregularly spaced datapoints in the unit ball.

Method: Theoretical analysis of the minimum number of parameters required for deep ReLU networks to interpolate N datapoints separated by distance δ in the unit ball, focusing on the regime where δ is exponentially small in N.

Result: Ω(N) parameters are necessary in the exponential separation regime, which is tight since O(N) parameters are always sufficient. This also demonstrates that bit-extraction VC dimension techniques cannot be applied to irregularly spaced datapoints.

Conclusion: Deep ReLU networks have fundamental limitations in parameter efficiency for interpolating exponentially separated datapoints, with implications for approximation rates in Sobolev spaces at embedding endpoints.

Abstract: We study the interpolation power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $\delta$. We show that $\Omega(N)$ parameters are required in the regime where $\delta$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints. Finally, as an application we give a lower bound on the approximation rates that deep ReLU neural networks can achieve for Sobolev spaces at the embedding endpoint.

[396] Contraction Properties of the Global Workspace Primitive

Michaela Ennis, Leo Kozachkov, Jean-Jacques Slotine

Main category: cs.LG

TL;DR: This paper expands on provably stable multi-area RNN architectures, proving relaxed stability conditions for global workspace structures and demonstrating empirical success with sparse connectivity between modules.

Details

Motivation: To advance research on multi-area recurrent neural networks by building on existing provably stable RNN architectures and exploring how specialized connectivity structures can improve performance and resilience.

Method: Theoretical analysis of stability conditions for global workspace modular RNN structures, combined with empirical evaluation of Global Workspace Sparse Combo Nets with sparse inter-module connectivity and few trainable parameters.

Result: Achieved strong test performance and greater resilience to subnetwork removal with global workspace topology, while also improving state-of-the-art performance for stable RNNs on benchmark sequence tasks through sparsity exploration.

Conclusion: Specialized graph structures and stability preservation are crucial for successful modular RNN architectures, with global workspace topology and sparse connectivity proving particularly effective for both performance and robustness.

Abstract: To push forward the important emerging research field surrounding multi-area recurrent neural networks (RNNs), we expand theoretically and empirically on the provably stable RNNs of RNNs introduced by Kozachkov et al. in “RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent Neural Networks”. We prove relaxed stability conditions for salient special cases of this architecture, most notably for a global workspace modular structure. We then demonstrate empirical success for Global Workspace Sparse Combo Nets with a small number of trainable parameters, not only through strong overall test performance but also greater resilience to removal of individual subnetworks. These empirical results for the global workspace inter-area topology are contingent on stability preservation, highlighting the relevance of our theoretical work for enabling modular RNN success. Further, by exploring sparsity in the connectivity structure between different subnetwork modules more broadly, we improve the state of the art performance for stable RNNs on benchmark sequence processing tasks, thus underscoring the general utility of specialized graph structures for multi-area RNNs.

[397] Learning Optimal Classification Trees Robust to Distribution Shifts

Nathan Justin, Sina Aghaei, Andrés Gómez, Phebe Vayanos

Main category: cs.LG

TL;DR: A method for learning robust classification trees using mixed-integer robust optimization to handle distribution shifts between training and testing data, achieving up to 12.48% improvement in worst-case accuracy.

Details

Motivation: Address distribution shifts in high-stakes settings like public health and social work where self-reported survey data is sensitive to various factors like question framing, timing, location, and interviewee comfort levels.

Method: Cast the problem as a single-stage mixed-integer robust optimization with nonlinear objective, then reformulate as a two-stage linear robust optimization problem with constraint generation solution procedure.

Result: Up to 12.48% increase in worst-case accuracy and 4.85% increase in average-case accuracy across multiple datasets compared to non-robust optimal trees.

Conclusion: The proposed robust optimization approach effectively handles distribution shifts and significantly improves classification tree performance in real-world scenarios with data collection inconsistencies.

Abstract: We consider the problem of learning classification trees that are robust to distribution shifts between training and testing/deployment data. This problem arises frequently in high stakes settings such as public health and social work where data is often collected using self-reported surveys which are highly sensitive to e.g., the framing of the questions, the time when and place where the survey is conducted, and the level of comfort the interviewee has in sharing information with the interviewer. We propose a method for learning optimal robust classification trees based on mixed-integer robust optimization technology. In particular, we demonstrate that the problem of learning an optimal robust tree can be cast as a single-stage mixed-integer robust optimization problem with a highly nonlinear and discontinuous objective. We reformulate this problem equivalently as a two-stage linear robust optimization problem for which we devise a tailored solution procedure based on constraint generation. We evaluate the performance of our approach on numerous publicly available datasets, and compare the performance to a regularized, non-robust optimal tree. We show an increase of up to 12.48% in worst-case accuracy and of up to 4.85% in average-case accuracy across several datasets and distribution shifts from using our robust solution in comparison to the non-robust one.

[398] Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting

Mingkui Tan, Guohao Chen, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Peilin Zhao, Shuaicheng Niu

Main category: cs.LG

TL;DR: EATA-C is an efficient test-time adaptation method that addresses computational costs and forgetting issues while improving calibration by handling model uncertainty and data uncertainty separately.

Details

Motivation: Existing test-time adaptation methods are computationally expensive (require backpropagation per sample) and suffer from performance degradation on in-distribution data after adaptation (forgetting problem). They also produce overconfident predictions for uncertain samples.

Method: Proposes EATA with active sample selection and Fisher regularization to prevent forgetting. EATA-C adds model uncertainty measurement via prediction divergence between full network and sub-networks, and data uncertainty handling through min-max entropy regularizer based on label disagreement.

Result: Experiments on image classification and semantic segmentation show the methods effectively improve test performance on out-of-distribution data while maintaining in-distribution performance and providing better calibrated predictions.

Conclusion: EATA-C successfully addresses computational efficiency, forgetting, and calibration issues in test-time adaptation by separately handling model and data uncertainties through divergence loss and adaptive entropy regularization.

Abstract: Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample. Although recent TTA has shown promising performance, we still face two key challenges: 1) prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications; 2) while existing TTA can significantly improve the test performance on out-of-distribution data, they often suffer from severe performance degradation on in-distribution data after TTA (known as forgetting). To this end, we have proposed an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples for test-time entropy minimization. To alleviate forgetting, EATA introduces a Fisher regularizer estimated from test samples to constrain important model parameters from drastic changes. However, in EATA, the adopted entropy loss consistently assigns higher confidence to predictions even for samples that are underlying uncertain, leading to overconfident predictions. To tackle this, we further propose EATA with Calibration (EATA-C) to separately exploit the reducible model uncertainty and the inherent data uncertainty for calibrated TTA. Specifically, we measure the model uncertainty by the divergence between predictions from the full network and its sub-networks, on which we propose a divergence loss to encourage consistent predictions instead of overconfident ones. To further recalibrate prediction confidence, we utilize the disagreement among predicted labels as an indicator of the data uncertainty, and then devise a min-max entropy regularizer to selectively increase and decrease prediction confidence for different samples. Experiments on image classification and semantic segmentation verify the effectiveness of our methods.

[399] Provably-Safe Neural Network Training Using Hybrid Zonotope Reachability Analysis

Long Kiu Chung, Shreyas Kousik

Main category: cs.LG

TL;DR: A training method for ReLU neural networks that ensures output safety by avoiding non-convex unsafe regions using reachability analysis with scaled hybrid zonotopes and MILP-based collision checks.

Details

Motivation: Neural networks are increasingly used in safety-critical control applications but lack methods to enforce output constraints and guarantee safety, especially for non-convex sets. Existing verification methods don't effectively correct unsafe networks.

Method: Uses reachability analysis with scaled hybrid zonotopes (modified hybrid zonotope representation) to enable parameterized scaling of non-convex polytopic sets. Employs differentiable collision checks via mixed-integer linear programs (MILPs) to train networks to avoid unsafe regions.

Result: Method proved effective and fast for networks up to 240 neurons. Computational complexity dominated by inverse operations on matrices scaling linearly with neuron count and set complexity. Successfully trained forward-invariant controllers for affine systems and generated safe reach-avoid plans for black-box systems.

Conclusion: Proposed approach enables exact enforcement of safety constraints for non-convex input and unsafe sets in ReLU networks, addressing a critical gap in neural network safety verification and correction for control applications.

Abstract: Even though neural networks are being increasingly deployed in safety-critical control applications, it remains difficult to enforce constraints on their output, meaning that it is hard to guarantee safety in such settings. While many existing methods seek to verify a neural network’s satisfaction of safety constraints, few address how to correct an unsafe network. The handful of works that extract a training signal from verification cannot handle non-convex sets, and are either conservative or slow. To begin addressing these challenges, this work proposes a neural network training method that can encourage the exact image of a non-convex input set for a neural network with rectified linear unit (ReLU) nonlinearities to avoid a non-convex unsafe region. This is accomplished by reachability analysis with scaled hybrid zonotopes, a modification of the existing hybrid zonotope set representation that enables parameterized scaling of non-convex polytopic sets with a differentiable collision check via mixed-integer linear programs (MILPs). The proposed method was shown to be effective and fast for networks with up to 240 neurons, with the computational complexity dominated by inverse operations on matrices that scale linearly in size with the number of neurons and complexity of input and unsafe sets. We demonstrate the practicality of our method by training a forward-invariant neural network controller for an affine dynamical system with a non-convex input set, as well as generating safe reach-avoid plans for a black-box dynamical system.

[400] No-Regret M${}^{\natural}$-Concave Function Maximization: Stochastic Bandit Algorithms and Hardness of Adversarial Full-Information Setting

Taihei Oki, Shinsaku Sakaue

Main category: cs.LG

TL;DR: Online M♮-concave function maximization with bandit feedback, showing positive results for stochastic setting but impossibility in adversarial setting.

Details

Motivation: M♮-concave functions are fundamental in discrete math and economics, but perfect knowledge is often unavailable in practice, requiring interactive optimization based on feedback.

Method: Study online M♮-concave function maximization problems, present O(T^{-1/2})-simple regret and O(T^{2/3})-regret algorithms for stochastic bandit setting using unbiased noisy value oracles, and prove impossibility in adversarial setting via reduction from matroid intersection problem.

Result: Positive results: efficient algorithms achieve sublinear regret in stochastic setting. Negative result: no polynomial-time algorithms can achieve O(T^{1-c}) regret for any c>0 in adversarial setting even with full information.

Conclusion: M♮-concave function maximization is tractable in stochastic online settings but fundamentally hard in adversarial settings, with novel hardness proof approach via matroid intersection reduction.

Abstract: M${}^{\natural}$-concave functions, a.k.a. gross substitute valuation functions, play a fundamental role in many fields, including discrete mathematics and economics. In practice, perfect knowledge of M${}^{\natural}$-concave functions is often unavailable a priori, and we can optimize them only interactively based on some feedback. Motivated by such situations, we study online M${}^{\natural}$-concave function maximization problems, which are interactive versions of the problem studied by Murota and Shioura (1999). For the stochastic bandit setting, we present $O(T^{-1/2})$-simple regret and $O(T^{2/3})$-regret algorithms under $T$ times access to unbiased noisy value oracles of M${}^{\natural}$-concave functions. A key to proving these results is the robustness of the greedy algorithm to local errors in M${}^{\natural}$-concave function maximization, which is one of our main technical results. While we obtain those positive results for the stochastic setting, another main result of our work is an impossibility in the adversarial setting. We prove that, even with full-information feedback, no algorithms that run in polynomial time per round can achieve $O(T^{1-c})$ regret for any constant $c > 0$. Our proof is based on a reduction from the matroid intersection problem for three matroids, which would be a novel approach to establishing the hardness in online learning.

[401] StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel

Dylan Cutler, Arun Kandoor, Nishanth Dikkala, Nikunj Saunshi, Xin Wang, Rina Panigrahy

Main category: cs.LG

TL;DR: StagFormer enables parallel decoding in Transformer models by staggering execution along sequence axis, breaking layer dependencies to allow different model sections to run simultaneously without quality loss.

Details

Motivation: Transformer decoding is inherently sequential - each token must pass through all layers before next token generation, creating latency bottlenecks in autoregressive generation.

Method: Proposes staggered architecture that breaks dependency of token representation at time step i in layer l on representations from layer l-1. Instead allows dependency only on tokens until time step i-1, enabling parallel execution of different model sections.

Result: Achieves potential speedup in decoding while maintaining quality. Enables weight-sharing for memory efficiency, bounded window attention for latency gains, and scalability across multiple sections. Also shows quality gains for short generations with recurrent approximation.

Conclusion: StagFormer successfully enables parallel decoding in Transformers through staggered execution, offering latency improvements without sacrificing quality, with additional benefits from weight-sharing and attention optimizations.

Abstract: Decoding in a Transformer based language model is inherently sequential as a token’s embedding needs to pass through all the layers in the network before the generation of the next token can begin. In this work, we propose a new architecture StagFormer (Staggered Transformer), which staggers execution along the sequence axis and thereby enables parallelizing the decoding process along the depth of the model. We achieve this by breaking the dependency of the token representation at time step $i$ in layer $l$ upon the representations of tokens until time step $i$ from layer $l-1$. Instead, we stagger the execution and only allow a dependency on token representations until time step $i-1$. The later sections of the Transformer still get access to the “rich” representations from the prior section but only from those token positions which are one time step behind. StagFormer allows for different sections of the model to be executed in parallel yielding a potential speedup in decoding while being quality neutral in our simulations. We also explore many natural extensions of this idea. We present how weight-sharing across the different sections being staggered can be more practical in settings with limited memory. We explore the efficacy of using a bounded window attention to pass information from one section to another which helps drive further latency gains for some applications. We also explore the scalability of the staggering idea over more than 2 sections of the Transformer. Finally, we show how one can approximate a recurrent model during inference using weight-sharing. This variant can lead to substantial gains in quality for short generations while being neutral in its latency impact.

[402] TopoBench: A Framework for Benchmarking Topological Deep Learning

Lev Telyatnikov, Guillermo Bernardez, Marco Montagna, Mustafa Hajij, Martin Carrasco, Pavlo Vasylenko, Mathilde Papillon, Ghada Zamzmi, Michael T. Schaub, Jonas Verhellen, Pavel Snopov, Bertran Miquel-Oliver, Manel Gil-Sorribes, Alexis Molina, Victor Guallar, Theodore Long, Julian Suk, Patryk Rygiel, Alexander Nikitin, Giordan Escalona, Michael Banf, Dominik Filipiak, Max Schattauer, Liliya Imasheva, Alvaro Martinez, Halley Fritze, Marissa Masden, Valentina Sánchez, Manuel Lecha, Andrea Cavallo, Claudio Battiloro, Matt Piekenbrock, Mauricio Tec, George Dasoulas, Nina Miolane, Simone Scardapane, Theodore Papamarkou

Main category: cs.LG

TL;DR: TopoBench is an open-source library for standardizing benchmarking in topological deep learning, featuring modular design and support for transformations across topological domains.

Details

Motivation: To accelerate research in topological deep learning by providing standardized benchmarking tools and modular components for data processing and model evaluation.

Method: Decomposes topological deep learning into independent modules for data generation, loading, transformation, processing, model training, optimization, and evaluation. Supports transformations across topological domains including mapping graph topology to higher-order domains like simplicial and cell complexes.

Result: Successfully demonstrated applicability by benchmarking several TDL architectures across diverse tasks and datasets, enabling richer data representations and more fine-grained analyses.

Conclusion: TopoBench provides a flexible, modular framework that facilitates adaptation and optimization of various topological deep learning pipelines, accelerating research in this field through standardized benchmarking.

Abstract: This work introduces TopoBench, an open-source library designed to standardize benchmarking and accelerate research in topological deep learning (TDL). TopoBench decomposes TDL into a sequence of independent modules for data generation, loading, transforming and processing, as well as model training, optimization and evaluation. This modular organization provides flexibility for modifications and facilitates the adaptation and optimization of various TDL pipelines. A key feature of TopoBench is its support for transformations and lifting across topological domains. Mapping the topology and features of a graph to higher-order topological domains, such as simplicial and cell complexes, enables richer data representations and more fine-grained analyses. The applicability of TopoBench is demonstrated by benchmarking several TDL architectures across diverse tasks and datasets.

[403] Large Language Model Aided QoS Prediction for Service Recommendation

Huiying Liu, Zekun Zhang, Honghao Li, Qilin Wu, Yiwen Zhang

Main category: cs.LG

TL;DR: LLMs used for web service recommendation via QoS prediction, overcoming data sparsity issues and outperforming baselines.

Details

Motivation: Leverage LLMs' text understanding capabilities to extract useful features from user/service attributes described in natural language for web service recommendation.

Method: Proposed llmQoS model uses LLMs to extract information from user/service attributes via descriptive sentences, combined with historical QoS values to predict QoS for user-service pairs.

Result: On WSDream dataset, llmQoS overcomes data sparsity issues and consistently outperforms comparable baseline models.

Conclusion: LLMs show practical potential for web service recommendation by effectively extracting and utilizing textual attribute information to improve QoS prediction accuracy.

Abstract: Large language models (LLMs) have seen rapid improvement in the recent years, and have been used in a wider range of applications. After being trained on large text corpus, LLMs obtain the capability of extracting rich features from textual data. Such capability is potentially useful for the web service recommendation task, where the web users and services have intrinsic attributes that can be described using natural language sentences and are useful for recommendation. In this paper, we explore the possibility and practicality of using LLMs for web service recommendation. We propose the large language model aided QoS prediction (llmQoS) model, which use LLMs to extract useful information from attributes of web users and services via descriptive sentences. This information is then used in combination with the QoS values of historical interactions of users and services, to predict QoS values for any given user-service pair. On the WSDream dataset, llmQoS is shown to overcome the data sparsity issue inherent to the QoS prediction problem, and outperforms comparable baseline models consistently.

[404] UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials

Gongbo Zhang, Yanting Li, Renqian Luo, Pipi Hu, Yang Yang, Zeru Zhao, Lingbo Li, Guoqing Liu, Zun Wang, Ran Bi, Kaiyuan Gao, Liya Guo, Yu Xie, Chang Liu, Jia Zhang, Tian Xie, Robert Pinsler, Claudio Zeni, Ziheng Lu, Hongxia Hao, Yingce Xia, Marwin Segler, Maik Riechert, Wei Yang, Hao Jiang, Wen-Bin Zhang, Zhijun Zeng, Yi Zhu, Li Dong, Xiuyuan Hu, Li Yuan, Lei Chen, Haiguang Liu, Tao Qin

Main category: cs.LG

TL;DR: UniGenX is a unified generative foundation model that co-generates sequences and 3D structures under functional objectives across proteins, molecules, and materials, achieving state-of-the-art performance in multi-property conditional generation.

Details

Motivation: Current generative models suffer from limitations: they don't directly target function, optimize sequences and coordinates separately, and under-model conformational ensembles. There's a need for a unified approach that bridges discrete sequences and continuous coordinates for function-aware generation.

Method: UniGenX represents heterogeneous inputs as mixed symbolic and numeric tokens. It uses a decoder-only autoregressive transformer for global context and a conditional diffusion head to generate numeric fields guided by task-specific tokens. The model enables joint training of discrete and continuous representations.

Result: Achieves new SOTAs: 436 crystal candidates meeting triple constraints (11 novel compositions), new benchmarks on 5 chemical property targets, 23x improvement in protein induced fit modeling (RMSD < 2Å), and enhanced enzyme design. Demonstrates successful cross-domain transfer.

Conclusion: UniGenX represents a significant advance from prediction to controllable, function-aware generation. Ablation studies confirm the benefits of joint discrete-continuous training, establishing it as a powerful foundation model for multi-domain functional design.

Abstract: Function in natural systems arises from one-dimensional sequences forming three-dimensional structures with specific properties. However, current generative models suffer from critical limitations: training objectives seldom target function directly, discrete sequences and continuous coordinates are optimized in isolation, and conformational ensembles are under-modeled. We present UniGenX, a unified generative foundation model that addresses these gaps by co-generating sequences and coordinates under direct functional and property objectives across proteins, molecules, and materials. UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens, where a decoder-only autoregressive transformer provides global context and a conditional diffusion head generates numeric fields steered by task-specific tokens. Besides the new high SOTAs on structure prediction tasks, the model demonstrates state-of-the-art or competitive performance for the function-aware generation across domains: in materials, it achieves “conflicted” multi-property conditional generation, yielding 436 crystal candidates meeting triple constraints, including 11 with novel compositions; in chemistry, it sets new benchmarks on five property targets and conformer ensemble generation on GEOM; and in biology, it improves success in modeling protein induced fit (RMSD < 2 {\AA}) by over 23-fold and enhances EC-conditioned enzyme design. Ablation studies and cross-domain transfer substantiate the benefits of joint discrete-continuous training, establishing UniGenX as a significant advance from prediction to controllable, function-aware generation.

[405] Activation degree thresholds and expressiveness of polynomial neural networks

Bella Finkel, Jose Israel Rodriguez, Chenxi Wu, Thomas Yahl

Main category: cs.LG

TL;DR: Deep polynomial neural networks’ expressive power is analyzed through neurovariety geometry, introducing activation degree threshold concept and proving it exists for networks without width-one bottlenecks with quadratic upper bounds.

Details

Motivation: To understand the expressive capabilities of deep polynomial neural networks by examining the geometry of their neurovariety and determining when these networks achieve maximum theoretical expressiveness.

Method: Introducing the concept of activation degree threshold, proving its existence for polynomial neural networks without width-one bottlenecks, and analyzing structured architectures like equi-width networks.

Result: Proved existence of activation degree threshold with universal quadratic upper bound in network width, confirmed high activation degree conjecture, and showed equi-width architectures are maximally expressive with threshold of one.

Conclusion: Polynomial neural networks without width-one bottlenecks have well-defined activation degree thresholds, and equi-width architectures are particularly expressive as they achieve maximum neurovariety dimension at low activation degrees.

Abstract: We study the expressive power of deep polynomial neural networks through the geometry of their neurovariety. We introduce the notion of the activation degree threshold of a network architecture to express when the dimension of the neurovariety achieves its theoretical maximum. We prove the existence of the activation degree threshold for all polynomial neural networks without width-one bottlenecks and demonstrate a universal upper bound that is quadratic in the width of largest size. In doing so, we prove the high activation degree conjecture of Kileel, Trager, and Bruna. Certain structured architectures have exceptional activation degree thresholds, making them especially expressive in the sense of their neurovariety dimension. In this direction, we prove that polynomial neural networks with equi-width architectures are maximally expressive by showing their activation degree threshold is one.

[406] Noise-based reward-modulated learning

Jesús García Fernández, Nasir Ahmad, Marcel van Gerven

Main category: cs.LG

TL;DR: A novel noise-based learning rule inspired by biological neural circuits that uses reward prediction errors and eligibility traces to handle delayed rewards, outperforming reward-modulated Hebbian learning and achieving BP-comparable performance with better biological plausibility.

Details

Motivation: Biological neural systems efficiently learn from delayed rewards despite noisy synapses and lack of centralized optimization, while artificial networks rely on backpropagation which is unsuitable for resource-constrained systems. Existing noise-based alternatives struggle with temporal delays and hierarchical processing.

Method: Derived a noise-based learning rule using reward prediction errors as optimization target, incorporating eligibility traces for retrospective credit assignment. The method uses local information only, making it biologically plausible and suitable for neuromorphic implementation.

Result: Significantly outperforms reward-modulated Hebbian learning (RMHL) and achieves performance comparable to backpropagation (BP) on reinforcement tasks with both immediate and delayed rewards, though with slower convergence due to noise-driven updates.

Conclusion: The approach demonstrates potential for low-power adaptive systems where energy efficiency and biological plausibility are priorities, and provides insights into how dopamine-like signals and synaptic stochasticity enable learning in biological networks.

Abstract: Biological neural systems efficiently learn from delayed rewards despite relying on noisy synaptic transmission and lacking centralized optimization mechanisms. In contrast, artificial neural networks trained with reinforcement learning typically rely on backpropagation (BP), which limits their use in resource-constrained systems or with non-differentiable components. While noise-based alternatives, like reward-modulated Hebbian learning (RMHL), provide a biologically grounded framework for credit assignment, they struggle with temporal delays and hierarchical processing -key challenges in real-world learning. In this work, we derive a novel noise-based learning rule to address these challenges. Drawing inspiration from biological neural circuits, our method uses reward prediction errors as its optimization target to generate increasingly advantageous behavior, and incorporates an eligibility trace to facilitate retrospective credit assignment. Its formulation relies on local information, aligning with biological constraints and enabling neuromorphic implementation. Experimental validation on reinforcement tasks (immediate and delayed rewards) shows our approach significantly outperforms RMHL and achieves performance comparable to BP, although with slower convergence due to its noise-driven updates. While tested on simple architectures, the results highlight the potential of noise-driven, brain-inspired learning for low-power adaptive systems, particularly in scenarios where energy efficiency and biological plausibility are a priority. These findings also offer mechanistic insights into how dopamine-like signals and synaptic stochasticity may jointly enable learning in biological networks, bridging computational models with neurobiological principles.

[407] Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model

Yuran Xiang, Haiteng Zhao, Chang Ma, Zhi-Hong Deng

Main category: cs.LG

TL;DR: UTGDiff is a novel framework that uses language models for discrete graph diffusion to generate molecular graphs from textual instructions, outperforming sequence-based methods with fewer parameters.

Details

Motivation: Current methods for text-based molecule synthesis primarily use molecular sequences with pre-trained LLMs, but integrating graph generation with textual instructions remains complex and challenging.

Method: UTGDiff utilizes a unified text-graph transformer as the denoising network, derived from pre-trained language models with minimal modifications to process graph data through attention bias for discrete graph diffusion.

Result: UTGDiff consistently outperforms sequence-based baselines in instruction-based molecule generation and editing tasks, achieving superior performance with fewer parameters given equivalent pretraining corpus.

Conclusion: The proposed UTGDiff framework successfully addresses the challenge of integrating graph generation with textual instructions, demonstrating that discrete graph diffusion with language models is effective for molecular graph generation from instructions.

Abstract: Recent advancements in computational chemistry have increasingly focused on synthesizing molecules based on textual instructions. Integrating graph generation with these instructions is complex, leading most current methods to use molecular sequences with pre-trained large language models. In response to this challenge, we propose a novel framework, named $\textbf{UTGDiff (Unified Text-Graph Diffusion Model)}$, which utilizes language models for discrete graph diffusion to generate molecular graphs from instructions. UTGDiff features a unified text-graph transformer as the denoising network, derived from pre-trained language models and minimally modified to process graph data through attention bias. Our experimental results demonstrate that UTGDiff consistently outperforms sequence-based baselines in tasks involving instruction-based molecule generation and editing, achieving superior performance with fewer parameters given an equivalent level of pretraining corpus. Our code is availble at https://github.com/ran1812/UTGDiff.

[408] PinnDE: Physics-Informed Neural Networks for Solving Differential Equations

Jason Matthews, Alex Bihlo

Main category: cs.LG

TL;DR: PinnDE is an open-source Python library that provides tools for solving differential equations using both Physics-Informed Neural Networks (PINNs) and Deep Operator Networks (DeepONets).

Details

Motivation: The growing interest in deep learning for solving differential equations has created a need for accessible tools that implement both PINNs and DeepONets approaches in a unified framework.

Method: The paper introduces PinnDE, a Python library that implements both PINN and DeepONet methodologies for approximating solutions to differential equations, including package structure and usage examples.

Result: The library provides worked examples demonstrating its effectiveness in approximating solutions of systems of differential equations using both PINNs and DeepONets.

Conclusion: PinnDE serves as a valuable open-source resource that makes advanced deep learning approaches for differential equations more accessible to researchers and practitioners.

Abstract: In recent years the study of deep learning for solving differential equations has grown substantially. The use of physics-informed neural networks (PINNs) and deep operator networks (DeepONets) have emerged as two of the most useful approaches in approximating differential equation solutions using machine learning. Here, we introduce PinnDE, an open-source Python library for solving differential equations with both PINNs and DeepONets. We give a brief review of both PINNs and DeepONets, introduce PinnDE along with the structure and usage of the package, and present worked examples to show PinnDE’s effectiveness in approximating solutions of systems of differential equations with both PINNs and DeepONets.

[409] Gradient Boosting Decision Trees on Medical Diagnosis over Tabular Data

A. Yarkın Yıldız, Asli Kalayci

Main category: cs.LG

TL;DR: GBDT ensemble methods (XGBoost, CatBoost, LightGBM) outperform traditional ML and deep learning models in medical diagnosis tasks on tabular data, offering superior performance with lower computational costs.

Details

Motivation: Medical diagnosis requires highly accurate classification as incorrect decisions can have catastrophic consequences. While various ML and DL methods exist, ensemble methods offer promising alternatives for successful medical decision-making.

Method: Investigated Gradient Boosting Decision Tree (GBDT) algorithms including XGBoost, CatBoost, and LightGBM on multiple benchmark tabular medical diagnosis datasets, comparing them against traditional ML methods and deep neural networks.

Result: GBDT methods demonstrated superior performance over traditional ML and deep learning architectures, achieving the highest average rank across several medical datasets while requiring significantly less computational power than DL models.

Conclusion: GBDT ensemble methods provide the optimal methodology for medical classification tasks, offering high performance with lower complexity and computational requirements compared to both traditional ML and deep learning approaches.

Abstract: Medical diagnosis is a crucial task in the medical field, in terms of providing accurate classification and respective treatments. Having near-precise decisions based on correct diagnosis can affect a patient’s life itself, and may extremely result in a catastrophe if not classified correctly. Several traditional machine learning (ML), such as support vector machines (SVMs) and logistic regression, and state-of-the-art tabular deep learning (DL) methods, including TabNet and TabTransformer, have been proposed and used over tabular medical datasets. Additionally, due to the superior performances, lower computational costs, and easier optimization over different tasks, ensemble methods have been used in the field more recently. They offer a powerful alternative in terms of providing successful medical decision-making processes in several diagnosis tasks. In this study, we investigated the benefits of ensemble methods, especially the Gradient Boosting Decision Tree (GBDT) algorithms in medical classification tasks over tabular data, focusing on XGBoost, CatBoost, and LightGBM. The experiments demonstrate that GBDT methods outperform traditional ML and deep neural network architectures and have the highest average rank over several benchmark tabular medical diagnosis datasets. Furthermore, they require much less computational power compared to DL models, creating the optimal methodology in terms of high performance and lower complexity.

[410] fLSA: Learning Semantic Structures in Document Collections Using Foundation Models

Weijia Xu, Nebojsa Jojic, Nicolas Le Roux

Main category: cs.LG

TL;DR: fLSA is a foundation-model-based Latent Semantic Analysis method that clusters and tags document segments to model latent structure, enabling better text reconstruction and hierarchical sampling for improved problem-solving.

Details

Motivation: To enable large language models to induce high-level strategies from example solutions like humans do, by extracting and modeling the latent structure of documents for better adaptation to unseen problems.

Method: Iterative clustering and tagging of document segments based on document-level contexts using foundation models, creating hierarchical tags that capture semantic structure.

Result: fLSA tags are more informative for text reconstruction than existing methods and enable hierarchical sampling that expands the output space in directions leading to correct solutions more frequently.

Conclusion: fLSA successfully models latent document structure through foundation-model-based clustering and tagging, demonstrating improved performance in story writing, math, and reasoning tasks compared to direct sampling and existing tagging approaches.

Abstract: Humans can learn to solve new tasks by inducing high-level strategies from example solutions to similar problems and then adapting these strategies to solve unseen problems. Can we use large language models to induce such high-level structure from example documents or solutions? We introduce fLSA, a foundation-model-based Latent Semantic Analysis method that iteratively clusters and tags document segments based on document-level contexts. These tags can be used to model the latent structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fLSA tags are more informative in reconstructing the original texts than existing tagging methods. Moreover, when used for hierarchical sampling, fLSA tags help expand the output space in the right directions that lead to correct solutions more often than direct sampling and hierarchical sampling with existing tagging methods. Code: https://github.com/microsoft/fLSA

[411] Direct Preference Optimization for Primitive-Enabled Hierarchical Reinforcement Learning

Utsav Singh, Souradip Chakraborty, Wesley A. Suttle, Brian M. Sadler, Derrik E. Asher, Anit Kumar Sahu, Mubarak Shah, Vinay P. Namboodiri, Amrit Singh Bedi

Main category: cs.LG

TL;DR: DIPPER is a hierarchical RL framework that uses bi-level optimization and direct preference optimization to address non-stationarity and infeasible subgoal problems in HRL, achieving 40% improvement over SOTA baselines.

Details

Motivation: HRL methods suffer from non-stationarity caused by changing lower-level policies during training and generation of infeasible subgoals that lower-level policies cannot achieve.

Method: Formulates hierarchical policy learning as bi-level optimization problem, leverages direct preference optimization (DPO) to train higher-level policy using preference feedback, and incorporates regularization to ensure subgoal feasibility.

Result: Achieves up to 40% improvement over state-of-the-art baselines in sparse reward scenarios on challenging robotic navigation and manipulation benchmarks.

Conclusion: DIPPER effectively overcomes longstanding limitations of HRL by mitigating non-stationarity and ensuring subgoal feasibility through DPO-based optimization and regularization techniques.

Abstract: Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods often suffer from two fundamental challenges: (i) non-stationarity, caused by the changing behavior of the lower-level policy during training, which destabilizes higher-level policy learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. In this work, we introduce DIPPER, a novel HRL framework that formulates hierarchical policy learning as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy using preference feedback. By optimizing the higher-level policy with DPO, we decouple higher-level learning from the non-stationary lower-level reward signal, thus mitigating non-stationarity. To further address the infeasible subgoal problem, DIPPER incorporates a regularization that tries to ensure the feasibility of subgoal tasks within the capabilities of the lower-level policy. Extensive experiments on challenging robotic navigation and manipulation benchmarks demonstrate that DIPPER achieves up to 40% improvement over state-of-the-art baselines in sparse reward scenarios, highlighting its effectiveness in overcoming longstanding limitations of HRL.

[412] Concept-Guided Interpretability via Neural Chunking

Shuchen Wu, Stephan Alaniz, Shyamgopal Karthik, Peter Dayan, Eric Schulz, Zeynep Akata

Main category: cs.LG

TL;DR: The paper challenges the black box view of neural networks by proposing the Reflection Hypothesis - that neural activity patterns mirror training data regularities. It introduces three chunking methods to extract interpretable concept-encoding entities from neural population dynamics.

Details

Motivation: To address the challenge of understanding neural networks' internal workings and move beyond the black box perspective by showing that neural activity reflects training data patterns, enabling interpretability through cognitive chunking principles.

Method: Three complementary chunking methods: Discrete Sequence Chunking (DSC) for learning entity dictionaries in lower-dimensional space, Population Averaging (PA) for extracting labeled entities, and Unsupervised Chunk Discovery (UCD) for unlabeled data. Applied to both RNNs and LLMs.

Result: Successfully extracted concept-encoding entities across different model architectures, including concrete (words), abstract (POS tags), and structural (narrative schema) concepts. Demonstrated causal role through grafting experiments that produced controlled behavioral changes.

Conclusion: The work provides a new interpretability direction by combining cognitive principles with naturalistic data structure to reveal hidden computations in complex learning systems, transforming them from black boxes to understandable systems.

Abstract: Neural networks are often described as black boxes, reflecting the significant challenge of understanding their internal workings and interactions. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage our cognitive tendency of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract recurring chunks on a neural population level, complementing each other based on label availability and neural data dimensionality. Discrete sequence chunking (DSC) learns a dictionary of entities in a lower-dimensional neural space; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting concept-encoding entities agnostic to model architectures. These concepts can be both concrete (words), abstract (POS tags), or structural (narrative schema). Additionally, we show that extracted chunks play a causal role in network behavior, as grafting them leads to controlled and predictable changes in the model’s behavior. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.

[413] Generalization, Expressivity, and Universality of Graph Neural Networks on Attributed Graphs

Levi Rauchwerger, Stefanie Jegelka, Ron Levie

Main category: cs.LG

TL;DR: The paper analyzes universality and generalization of graph neural networks (GNNs) on attributed graphs by proposing pseudometrics that capture GNN expressivity, proving Lipschitz continuity, separation power, and relative compactness, leading to universal approximation theorems and generalization bounds.

Details

Motivation: To address limitations in existing approaches that either handle only non-attributed graphs, lack separation power, or fail to achieve relative compactness - preventing comprehensive theoretical analysis of GNN universality and generalization.

Method: Proposes hierarchical optimal transport-based pseudometrics between computation trees to measure similarity between attributed graph structures, enabling analysis of GNN properties.

Result: Proves GNNs are Lipschitz continuous with respect to the proposed metrics, can separate distant attributed graphs, and the space of attributed graphs is relatively compact, enabling universal approximation theorems and generalization bounds.

Conclusion: The work provides a unified theoretical framework for analyzing GNN expressivity on attributed graphs, overcoming previous limitations and establishing foundations for universal approximation and generalization guarantees.

Abstract: We analyze the universality and generalization of graph neural networks (GNNs) on attributed graphs, i.e., with node attributes. To this end, we propose pseudometrics over the space of all attributed graphs that describe the fine-grained expressivity of GNNs. Namely, GNNs are both Lipschitz continuous with respect to our pseudometrics and can separate attributed graphs that are distant in the metric. Moreover, we prove that the space of all attributed graphs is relatively compact with respect to our metrics. Based on these properties, we prove a universal approximation theorem for GNNs and generalization bounds for GNNs on any data distribution of attributed graphs. The proposed metrics compute the similarity between the structures of attributed graphs via a hierarchical optimal transport between computation trees. Our work extends and unites previous approaches which either derived theory only for graphs with no attributes, derived compact metrics under which GNNs are continuous but without separation power, or derived metrics under which GNNs are continuous and separate points but the space of graphs is not relatively compact, which prevents universal approximation and generalization analysis.

[414] Graph Neural Network Based Action Ranking for Planning

Rajesh Mangannavar, Stefan Lee, Alan Fern, Prasad Tadepalli

Main category: cs.LG

TL;DR: A novel Graph Neural Network approach that learns to rank actions for relational planning, achieving better generalization to larger problems than training instances and outperforming baseline methods.

Details

Motivation: To develop a more sample-efficient approach for classical planning that can generalize to larger problem instances than those used in training, overcoming limitations of value-function based methods that require globally consistent functions.

Method: Proposes a new graph representation capturing action information and uses a GNN architecture with Gated Recurrent Units (GRUs) to learn action rankings. Trained on small problem instances solved by planners and applied to larger instances.

Result: Experimental results show the approach achieves better generalization to larger problems than training instances and outperforms multiple baseline methods (both value function and action ranking) in success rate and plan quality.

Conclusion: Action ranking with GNNs and GRUs provides a more sample-efficient alternative to value-function approaches for relational planning, enabling effective generalization from small to large problem instances.

Abstract: We propose a novel approach to learn relational policies for classical planning based on learning to rank actions. We introduce a new graph representation that explicitly captures action information and propose a Graph Neural Network (GNN) architecture augmented with Gated Recurrent Units (GRUs) to learn action rankings. Unlike value-function based approaches that must learn a globally consistent function, our action ranking method only needs to learn locally consistent ranking, which is more sample-efficient. Our model is trained on data generated from small problem instances that are easily solved by planners and is applied to significantly larger instances where planning is computationally prohibitive. Experimental results across standard planning benchmarks demonstrate that our action-ranking approach not only achieves better generalization to larger problems than those used in training but also outperforms multiple baseline (value function and action ranking) methods in terms of success rate and plan quality.

[415] KNN and K-means in Gini Prametric Spaces

Cassandra Mussard, Arthur Charpentier, Stéphane Mussard

Main category: cs.LG

TL;DR: Enhanced K-means and KNN algorithms using Gini prametric spaces that combine value and rank information for improved robustness to noise and outliers.

Details

Motivation: Traditional distance metrics in clustering and classification algorithms are sensitive to noise and outliers. The paper aims to develop more robust algorithms by incorporating rank-based measures alongside value distances.

Method: Developed a Gini prametric that captures both value-based and rank-based measures, then created Gini K-means (provably convergent) and Gini KNN algorithms based on this prametric space.

Result: Experimental evaluations on 16 UCI datasets show superior performance and efficiency in both clustering and classification tasks, with Gini KNN performing competitively with state-of-the-art approaches like Hassanat’s distance in noisy environments.

Conclusion: Gini prametric spaces offer effective robustness to noise and open new directions for rank-based prametrics in machine learning and statistical analysis applications.

Abstract: This paper introduces enhancements to the K-means and K-nearest neighbors (KNN) algorithms based on the concept of Gini prametric spaces, instead of traditional metric spaces. Unlike standard distance metrics, Gini prametrics incorporate both value-based and rank-based measures, offering robustness to noise and outliers. The main contributions include: (1) a Gini prametric that captures rank information alongside value distances; (2) a Gini K-means algorithm that is provably convergent and resilient to noisy data; and (3) a Gini KNN method that performs competitively with state-of-the-art approaches like Hassanat’s distance in noisy environments. Experimental evaluations on 16 UCI datasets demonstrate the superior performance and efficiency of the Gini-based algorithms in clustering and classification tasks. This work opens new directions for rank-based prametrics in machine learning and statistical analysis.

[416] Spectra-to-Structure and Structure-to-Spectra Inference Across the Periodic Table

Yufeng Wang, Peiyao Wang, Lu Wei, Lu Ma, Yuewei Lin, Qun Liu, Haibin Ling

Main category: cs.LG

TL;DR: XAStruct is a machine learning system that enables bidirectional prediction between X-ray absorption spectroscopy (XAS) spectra and crystal structures, supporting over 70 elements without element-specific tuning.

Details

Motivation: XAS interpretation traditionally requires expert analysis, expensive simulations, and element-specific heuristics, limiting its accessibility and scalability.

Method: Deep neural networks combined with efficient baseline models trained on large-scale dataset spanning 70+ elements, enabling both spectrum prediction from structures and structural descriptor inference from spectra.

Result: First ML approach for predicting neighbor atom types directly from XAS spectra and generalizable regression model for mean nearest-neighbor distance without element-specific tuning.

Conclusion: XAStruct provides a scalable, extensible solution for data-driven XAS analysis and local structure inference across diverse chemistries and bonding environments.

Abstract: X-ray Absorption Spectroscopy (XAS) is a powerful technique for probing local atomic environments, yet its interpretation remains limited by the need for expert-driven analysis, computationally expensive simulations, and element-specific heuristics. Recent advances in machine learning have shown promise for accelerating XAS interpretation, but many existing models are narrowly focused on specific elements, edge types, or spectral regimes. In this work, we present XAStruct, a learning-based system capable of both predicting XAS spectra from crystal structures and inferring local structural descriptors from XAS input. XAStruct is trained on a large-scale dataset spanning over 70 elements across the periodic table, enabling generalization to a wide variety of chemistries and bonding environments. The framework includes the first machine learning approach for predicting neighbor atom types directly from XAS spectra, as well as a generalizable regression model for mean nearest-neighbor distance that requires no element-specific tuning. By combining deep neural networks for complex structure property mappings with efficient baseline models for simpler tasks, XAStruct offers a scalable and extensible solution for data-driven XAS analysis and local structure inference. The source code will be released upon paper acceptance.

[417] Keep your distance: learning dispersed embeddings on $\mathbb{S}_m$

Evgeniia Tokarchuk, Hua Chang Bakker, Vlad Niculae

Main category: cs.LG

TL;DR: This paper analyzes dispersion methods for learning well-separated features in high-dimensional spaces, connecting existing approaches and proposing new methods including an MMD reinterpretation, online Lloyd’s algorithm variant, and hypersphere-specific dispersion technique.

Details

Motivation: Learning well-separated features is crucial for ML applications, but existing theoretical solutions are inapplicable to high-dimensional representation learning where dispersion must be balanced with task objectives.

Method: The paper provides an overview of existing dispersion methods, proposes MMD reinterpretation of pairwise dispersion, introduces online variant of Lloyd’s algorithm as regularizer, and derives novel hypersphere-specific dispersion method.

Result: Experiments demonstrate the importance of dispersion in image classification and NLP tasks, showing different algorithms exhibit varying trade-offs across different regimes.

Conclusion: The work connects disconnected literature on dispersion methods and provides new effective approaches for achieving feature separation in high-dimensional representation learning.

Abstract: Learning well-separated features in high-dimensional spaces, such as text or image embeddings, is crucial for many machine learning applications. Achieving such separation can be effectively accomplished through the dispersion of embeddings, where unrelated vectors are pushed apart as much as possible. By constraining features to be on a hypersphere, we can connect dispersion to well-studied problems in mathematics and physics, where optimal solutions are known for limited low-dimensional cases. However, in representation learning we typically deal with a large number of features in high-dimensional space, and moreover, dispersion is usually traded off with some other task-oriented training objective, making existing theoretical and numerical solutions inapplicable. Therefore, it is common to rely on gradient-based methods to encourage dispersion, usually by minimizing some function of the pairwise distances. In this work, we first give an overview of existing methods from disconnected literature, making new connections and highlighting similarities. Next, we introduce some new angles. We propose to reinterpret pairwise dispersion using a maximum mean discrepancy (MMD) motivation. We then propose an online variant of the celebrated Lloyd’s algorithm, of K-Means fame, as an effective alternative regularizer for dispersion on generic domains. Finally, we derive a novel dispersion method that directly exploits properties of the hypersphere. Our experiments show the importance of dispersion in image classification and natural language processing tasks, and how algorithms exhibit different trade-offs in different regimes.

[418] General Intelligence Requires Reward-based Pretraining

Seungwook Han, Jyothish Pari, Samuel J. Gershman, Pulkit Agrawal

Main category: cs.LG

TL;DR: LLMs show utility but struggle with adaptive reasoning and generalization. The paper proposes disentangling knowledge from reasoning through RL pretraining, synthetic curriculum tasks, and small context windows to improve transferability.

Details

Motivation: Current LLMs demonstrate artificial useful intelligence but lack robust reasoning capabilities needed for AGI. They overfit to training data and fail to generalize algorithmic understanding across novel contexts.

Method: Proposes three approaches: 1) RL pretraining from scratch instead of next-token prediction, 2) Synthetic task curriculum to build reasoning prior, 3) Small context windows to reduce spurious correlations. Combines with retrieval system and external memory bank.

Result: The approach aims to create a more generalizable reasoning system that can overcome limitations of current LLM architectures in novel scenarios.

Conclusion: Disentangling knowledge and reasoning through these methods can help transition from artificial useful intelligence to more robust artificial general intelligence with better reasoning transferability.

Abstract: Large Language Models (LLMs) have demonstrated impressive real-world utility, exemplifying artificial useful intelligence (AUI). However, their ability to reason adaptively and robustly – the hallmarks of artificial general intelligence (AGI) – remains fragile. While LLMs seemingly succeed in commonsense reasoning, programming, and mathematics, they struggle to generalize algorithmic understanding across novel contexts. Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM’s reasoning overfits to the training data and is limited in its transferability. We hypothesize that the core issue underlying such limited transferability is the coupling of reasoning and knowledge in LLMs. To transition from AUI to AGI, we propose disentangling knowledge and reasoning through three key directions: (1) pretaining to reason using RL from scratch as an alternative to the widely used next-token prediction pretraining, (2) using a curriculum of synthetic tasks to ease the learning of a reasoning prior for RL that can then be transferred to natural language tasks, and (3) learning more generalizable reasoning functions using a small context window to reduce exploiting spurious correlations between tokens. Such a reasoning system coupled with a trained retrieval system and a large external memory bank as a knowledge store can overcome several limitations of existing architectures at learning to reason in novel scenarios.

[419] Seal Your Backdoor with Variational Defense

Ivan Sabolić, Matej Grcić, Siniša Šegvić

Main category: cs.LG

TL;DR: VIBE is a model-agnostic framework that uses variational inference and EM algorithm to train classifiers resilient to backdoor attacks by treating malicious inputs as observed variables and recovering clean labels.

Details

Motivation: To develop a defense mechanism against backdoor attacks that can work with any classifier model and handle malicious inputs and corrupted labels in training data.

Method: Uses variational inference to recover clean label posterior through EM algorithm: E-step infers clean pseudolabels via entropy-regularized optimal transport, M-step updates classifier parameters via gradient descent. Integrates with self-supervised representation learning.

Result: Outperforms previous defenses across standard datasets, large-scale setups (1k classes), and datasets poisoned with multiple attacks. Consistently effective against contemporary backdoor attacks.

Conclusion: VIBE provides a modular, model-agnostic framework that successfully trains resilient classifiers against backdoor attacks through variational inference and EM optimization, demonstrating superior performance across various attack scenarios.

Abstract: We propose VIBE, a model-agnostic framework that trains classifiers resilient to backdoor attacks. The key concept behind our approach is to treat malicious inputs and corrupted labels from the training dataset as observed random variables, while the actual clean labels are latent. VIBE then recovers the corresponding latent clean label posterior through variational inference. The resulting training procedure follows the expectation-maximization (EM) algorithm. The E-step infers the clean pseudolabels by solving an entropy-regularized optimal transport problem, while the M-step updates the classifier parameters via gradient descent. Being modular, VIBE can seamlessly integrate with recent advancements in self-supervised representation learning, which enhance its ability to resist backdoor attacks. We experimentally validate the method effectiveness against contemporary backdoor attacks on standard datasets, a large-scale setup with 1$k$ classes, and a dataset poisoned with multiple attacks. VIBE consistently outperforms previous defenses across all tested scenarios.

[420] VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG

Junkyum Kim, Divya Mahajan

Main category: cs.LG

TL;DR: VectorLiteRAG is a deployment-friendly RAG system that optimizes GPU resource allocation between vector search and LLM inference to prevent performance degradation without requiring additional hardware.

Details

Motivation: RAG systems face performance challenges when co-locating vector search (memory/I/O intensive) and LLM inference (throughput/latency sensitive) on shared GPU infrastructure, leading to severe degradation under high load.

Method: Introduces fine-grained GPU resource allocation based on performance modeling and access pattern analysis, estimating search latency and query hit rates to find optimal CPU-GPU index partitioning to minimize contention.

Result: Consistently expands SLO compliant request rate range across all configurations (small/large LLMs and vector databases), improving attainable SLO throughput by up to 1.5x without compromising quality or requiring extra resources.

Conclusion: VectorLiteRAG enables latency-compliant RAG inference on existing hardware through intelligent resource partitioning, making RAG systems more practical for production deployment without additional compute investment.

Abstract: Retrieval-Augmented Generation (RAG) systems combine vector similarity search with large language models (LLMs) to deliver accurate, context-aware responses. However, co-locating the vector retriever and the LLM on shared GPU infrastructure introduces significant challenges: vector search is memory and I/O intensive, while LLM inference demands high throughput and low latency. Naive resource sharing often leads to severe performance degradation, particularly under high request load or large index sizes. We present VectorLiteRAG, a deployment-friendly RAG system that achieves latency-compliant inference without requiring additional hardware resources. VectorLiteRAG introduces a fine-grained GPU resource allocation mechanism based on detailed performance modeling and access pattern analysis. By estimating search latency and query hit rate distributions, it identifies an optimal index partitioning point across CPU and GPU tiers to minimize contention and maximize throughput. Our evaluations show that VectorLiteRAG consistently expands the SLO compliant request rate range across all tested configurations, including both small and large LLMs, and small and large vector databases compared to naive baselines and state of the art alternatives. In the best case, VectorLiteRAG improves the attainable SLO throughput by up to 1.5 times without compromising generation quality or requiring additional compute resources.

[421] Apple Intelligence Foundation Language Models: Tech Report 2025

Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang, Xiyou Zhou, Jun Qin, Dian Ang Yap, Narendran Raghavan, Xuankai Chang, Margit Bowler, Eray Yildiz, John Peebles, Hannah Gillis Coleman, Matteo Ronchi, Peter Gray, Keen You, Anthony Spalvieri-Kruse, Ruoming Pang, Reed Li, Yuli Yang, Emad Soroush, Zhiyun Lu, Crystal Xiao, Rong Situ, Jordan Huffaker, David Griffiths, Zaid Ahmed, Peng Zhang, Daniel Parilla, Asaf Liberman, Jennifer Mallalieu, Parsa Mazaheri, Qibin Chen, Manjot Bilkhu, Aonan Zhang, Eric Wang, Dave Nelson, Michael FitzMaurice, Thomas Voice, Jeremy Liu, Josh Shaffer, Shiwen Zhao, Prasanth Yadla, Farzin Rasteh, Pengsheng Guo, Arsalan Farooq, Jeremy Snow, Stephen Murphy, Tao Lei, Minsik Cho, George Horrell, Sam Dodge, Lindsay Hislop, Sumeet Singh, Alex Dombrowski, Aiswarya Raghavan, Sasha Sirovica, Mandana Saebi, Faye Lao, Max Lam, TJ Lu, Zhaoyang Xu, Karanjeet Singh, Marc Kirchner, David Mizrahi, Rajat Arora, Haotian Zhang, Henry Mason, Lawrence Zhou, Yi Hua, Ankur Jain, Felix Bai, Joseph Astrauskas, Floris Weers, Josh Gardner, Mira Chiang, Yi Zhang, Pulkit Agrawal, Tony Sun, Quentin Keunebroek, Matthew Hopkins, Bugu Wu, Tao Jia, Chen Chen, Xingyu Zhou, Nanzhu Wang, Peng Liu, Ruixuan Hou, Rene Rauch, Yuan Gao, Afshin Dehghan, Jonathan Janke, Zirui Wang, Cha Chen, Xiaoyi Ren, Feng Nan, Josh Elman, Dong Yin, Yusuf Goren, Jeff Lai, Yiran Fei, Syd Evans, Muyang Yu, Guoli Yin, Yi Qin, Erin Feldman, Isha Garg, Aparna Rajamani, Karla Vega, Walker Cheng, TJ Collins, Hans Han, Raul Rea Menacho, Simon Yeung, Sophy Lee, Phani Mutyala, Ying-Chang Cheng, Zhe Gan, Sprite Chu, Justin Lazarow, Alessandro Pappalardo, Federico Scozzafava, Jing Lu, Erik Daxberger, Laurent Duchesne, Jen Liu, David Güera, Stefano Ligas, Mary Beth Kery, Brent Ramerth, Ciro Sannino, Marcin Eichner, Haoshuo Huang, Rui Qian, Moritz Schwarzer-Becker, David Riazati, Mingfei Gao, Bailin Wang, Jack Cackler, Yang Lu, Ransen Niu, John Dennison, Guillaume Klein, Jeffrey Bigham, Deepak Gopinath, Navid Shiee, Darren Botten, Guillaume Tartavel, Alex Guillen Garcia, Sam Xu, Victoria MönchJuan Haladjian, Zi-Yi Dou, Matthias Paulik, Adolfo Lopez Mendez, Zhen Li, Hong-You Chen, Chao Jia, Dhaval Doshi, Zhengdong Zhang, Raunak Manjani, Aaron Franklin, Zhile Ren, David Chen, Artsiom Peshko, Nandhitha Raghuram, Hans Hao, Jiulong Shan, Kavya Nerella, Ramsey Tantawi, Vivek Kumar, Saiwen Wang, Brycen Wershing, Bhuwan Dhingra, Dhruti Shah, Ob Adaranijo, Xin Zheng, Tait Madsen, Hadas Kotek, Chang Liu, Yin Xia, Hanli Li, Suma Jayaram, Yanchao Sun, Ahmed Fakhry, Vasileios Saveris, Dustin Withers, Yanghao Li, Alp Aygar, Andres Romero Mier Y Teran, Kaiwei Huang, Mark Lee, Xiujun Li, Yuhong Li, Tyler Johnson, Jay Tang, Joseph Yitan Cheng, Futang Peng, Andrew Walkingshaw, Lucas Guibert, Abhishek Sharma, Cheng Shen, Piotr Maj, Yasutaka Tanaka, You-Cyuan Jhang, Vivian Ma, Tommi Vehvilainen, Kelvin Zou, Jeff Nichols, Matthew Lei, David Qiu, Yihao Qian, Gokul Santhanam, Wentao Wu, Yena Han, Dominik Moritz, Haijing Fu, Mingze Xu, Vivek Rathod, Jian Liu, Louis D’hauwe, Qin Ba, Haitian Sun, Haoran Yan, Philipp Dufter, Anh Nguyen, Yihao Feng, Emma Wang, Keyu He, Rahul Nair, Sanskruti Shah, Jiarui Lu, Patrick Sonnenberg, Jeremy Warner, Yuanzhi Li, Bowen Pan, Ziyi Zhong, Joe Zhou, Sam Davarnia, Olli Saarikivi, Irina Belousova, Rachel Burger, Shang-Chen Wu, Di Feng, Bas Straathof, James Chou, Yuanyang Zhang, Marco Zuliani, Eduardo Jimenez, Abhishek Sundararajan, Xianzhi Du, Chang Lan, Nilesh Shahdadpuri, Peter Grasch, Sergiu Sima, Josh Newnham, Varsha Paidi, Jianyu Wang, Kaelen Haag, Alex Braunstein, Daniele Molinari, Richard Wei, Brenda Yang, Nicholas Lusskin, Joanna Arreaza-Taylor, Meng Cao, Nicholas Seidl, Simon Wang, Jiaming Hu, Yiping Ma, Mengyu Li, Kieran Liu, Hang Su, Sachin Ravi, Chong Wang, Xin Wang, Kevin Smith, Haoxuan You, Binazir Karimzadeh, Rui Li, Jinhao Lei, Wei Fang, Alec Doane, Sam Wiseman, Ismael Fernandez, Jane Li, Andrew Hansen, Javier Movellan, Christopher Neubauer, Hanzhi Zhou, Chris Chaney, Nazir Kamaldin, Valentin Wolf, Fernando Bermúdez-Medina, Joris Pelemans, Peter Fu, Howard Xing, Xiang Kong, Wayne Shan, Gabriel Jacoby-Cooper, Dongcai Shen, Tom Gunter, Guillaume Seguin, Fangping Shi, Shiyu Li, Yang Xu, Areeba Kamal, Dan Masi, Saptarshi Guha, Qi Zhu, Jenna Thibodeau, Changyuan Zhang, Rebecca Callahan, Charles Maalouf, Wilson Tsao, Boyue Li, Qingqing Cao, Naomy Sabo, Cheng Leong, Yi Wang, Anupama Mann Anupama, Colorado Reed, Kenneth Jung, Zhifeng Chen, Mohana Prasad Sathya Moorthy, Yifei He, Erik Hornberger, Devi Krishna, Senyu Tong, Michael, Lee, David Haldimann, Yang Zhao, Bowen Zhang, Chang Gao, Chris Bartels, Sushma Rao, Nathalie Tran, Simon Lehnerer, Co Giang, Patrick Dong, Junting Pan, Biyao Wang, Dongxu Li, Mehrdad Farajtabar, Dongseong Hwang, Grace Duanmu, Eshan Verma, Sujeeth Reddy, Qi Shan, Hongbin Gao, Nan Du, Pragnya Sridhar, Forrest Huang, Yingbo Wang, Nikhil Bhendawade, Diane Zhu, Sai Aitharaju, Fred Hohman, Lauren Gardiner, Chung-Cheng Chiu, Yinfei Yang, Alper Kokmen, Frank Chu, Ke Ye, Kaan Elgin, Oron Levy, John Park, Donald Zhang, Eldon Schoop, Nina Wenzel, Michael Booker, Hyunjik Kim, Chinguun Erdenebileg, Nan Dun, Eric Liang Yang, Priyal Chhatrapati, Vishaal Mahtani, Haiming Gang, Kohen Chia, Deepa Seshadri, Donghan Yu, Yan Meng, Kelsey Peterson, Zhen Yang, Yongqiang Wang, Carina Peng, Doug Kang, Anuva Agarwal, Albert Antony, Juan Lao Tebar, Albin Madappally Jose, Regan Poston, Andy De Wang, Gerard Casamayor, Elmira Amirloo, Violet Yao, Wojciech Kryscinski, Kun Duan, Lezhi L

Main category: cs.LG

TL;DR: Apple introduces two multilingual multimodal foundation models: a 3B-parameter on-device model optimized for Apple silicon with KV-cache sharing and 2-bit quantization, and a scalable server model using Parallel-Track Mixture-of-Experts architecture. Both models outperform comparably sized baselines in benchmarks.

Details

Motivation: To power Apple Intelligence features across devices and services with efficient, high-quality multilingual and multimodal AI capabilities that respect user privacy and responsible AI principles.

Method: Developed two models: 1) On-device 3B model with KV-cache sharing and 2-bit quantization-aware training; 2) Server model using Parallel-Track MoE transformer with track parallelism, sparse computation, and interleaved global-local attention. Trained on multilingual/multimodal datasets with web crawling, licensed data, and synthetic data, refined with SFT and RL.

Result: Both models match or surpass comparably sized open baselines in public benchmarks and human evaluations. They support additional languages while understanding images and executing tool calls. A Swift framework enables easy integration for developers.

Conclusion: Apple successfully developed efficient foundation models that deliver competitive performance while maintaining privacy safeguards through Private Cloud Compute and responsible AI practices, making advanced AI capabilities accessible across Apple ecosystem.

Abstract: We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple’s Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users’ privacy with innovations like Private Cloud Compute.

[422] ChemKANs for Combustion Chemistry Modeling and Acceleration

Benjamin C. Koenig, Suyong Kim, Sili Deng

Main category: cs.LG

TL;DR: ChemKANs is a novel neural network framework that combines Kolmogorov Arnold Network ODEs with chemical kinetic knowledge for improved combustion model inference and simulation acceleration, demonstrating robustness to noise and parameter efficiency.

Details

Motivation: Chemical kinetic model inference in combustion faces challenges with large ODE systems, widely separated time scales, strong nonlinearity, numerical stiffness, and noisy data. Existing machine learning approaches struggle with these complexities.

Method: ChemKANs augments KAN-ODEs with knowledge of information flow through kinetic and thermodynamic laws. This chemistry-specific structure provides inductive bias, streamlined training, and parameter sparsity through shared information across inputs and outputs.

Result: ChemKANs showed no overfitting or degradation with up to 15% noise and large parameterizations. A lean 344-parameter model accurately represented hydrogen combustion with 2x acceleration over detailed chemistry, while being generalizable to turbulent flow simulations.

Conclusion: ChemKANs demonstrate potential as robust, expressive, and efficient tools for model inference and simulation acceleration in combustion physics and chemical kinetics, addressing common deep learning failure modes.

Abstract: Efficient chemical kinetic model inference and application in combustion are challenging due to large ODE systems and widely separated time scales. Machine learning techniques have been proposed to streamline these models, though strong nonlinearity and numerical stiffness combined with noisy data sources make their application challenging. Here, we introduce ChemKANs, a novel neural network framework with applications both in model inference and simulation acceleration for combustion chemistry. ChemKAN’s novel structure augments the generic Kolmogorov Arnold Network Ordinary Differential Equations (KAN-ODEs) with knowledge of the information flow through the relevant kinetic and thermodynamic laws. This chemistry-specific structure combined with the expressivity and rapid neural scaling of the underlying KAN-ODE algorithm instills in ChemKANs a strong inductive bias, streamlined training, and higher accuracy predictions compared to standard benchmarks, while facilitating parameter sparsity through shared information across all inputs and outputs. In a model inference investigation, we benchmark the robustness of ChemKANs to sparse data containing up to 15% added noise, and superfluously large network parameterizations. We find that ChemKANs exhibit no overfitting or model degradation in any of these training cases, demonstrating significant resilience to common deep learning failure modes. Next, we find that a remarkably parameter-lean ChemKAN (344 parameters) can accurately represent hydrogen combustion chemistry, providing a 2x acceleration over the detailed chemistry in a solver that is generalizable to larger-scale turbulent flow simulations. These demonstrations indicate the potential for ChemKANs as robust, expressive, and efficient tools for model inference and simulation acceleration for combustion physics and chemical kinetics.

[423] Breaking Data Silos: Towards Open and Scalable Mobility Foundation Models via Generative Continual Learning

Yuan Yuan, Yukun Liu, Chonghua Han, Jie Feng, Yong Li

Main category: cs.LG

TL;DR: MoveGCL is a privacy-preserving framework for training mobility foundation models using generative continual learning without sharing raw data, achieving performance comparable to joint training while protecting privacy.

Details

Motivation: Foundation models have transformed NLP and computer vision, but building similar models for human mobility is challenging due to privacy concerns and data silos across institutions.

Method: Uses generative continual learning with synthetic trajectory replay from frozen teacher models, Mixture-of-Experts Transformer with mobility-aware routing, and layer-wise progressive adaptation to prevent catastrophic forgetting.

Result: Outperforms federated learning baselines and achieves performance comparable to joint training on six real-world urban datasets while providing strong privacy protection.

Conclusion: MoveGCL enables practical, scalable, and privacy-preserving development of mobility foundation models, representing a crucial step forward in this domain.

Abstract: Foundation models have revolutionized fields such as natural language processing and computer vision by enabling general-purpose learning across diverse tasks and datasets. However, building analogous models for human mobility remains challenging due to the privacy-sensitive nature of mobility data and the resulting data silos across institutions. To bridge this gap, we propose MoveGCL, a scalable and privacy-preserving framework for training mobility foundation models via generative continual learning. Without sharing raw data, MoveGCL enables decentralized and progressive model evolution by replaying synthetic trajectories generated from a frozen teacher model, and reinforces knowledge retention through a tailored distillation strategy that mitigates catastrophic forgetting. To address the heterogeneity of mobility patterns, MoveGCL incorporates a Mixture-of-Experts Transformer with a mobility-aware expert routing mechanism, and employs a layer-wise progressive adaptation strategy to stabilize continual updates. Experiments on six real-world urban datasets demonstrate that MoveGCL achieves performance comparable to joint training and significantly outperforms federated learning baselines, while offering strong privacy protection. MoveGCL marks a crucial step toward unlocking foundation models for mobility, offering a practical blueprint for open, scalable, and privacy-preserving model development in the era of foundation models. To facilitate reproducibility and future research, we have released the code and models at https://github.com/tsinghua-fib-lab/MoveGCL.

[424] Local Learning Rules for Out-of-Equilibrium Physical Generative Models

Cyrill Bösch, Geoffrey Roeder, Marc Serra-Garcia, Ryan P. Adams

Main category: cs.LG

TL;DR: Score-based generative models can be learned through local learning rules using force measurements or observed dynamics, demonstrated with nonlinear oscillator networks for Gaussian mixture sampling and MNIST digit generation.

Details

Motivation: To develop a method for learning out-of-equilibrium driving protocols in score-based generative models using local learning rules that can be implemented in physical systems like oscillator networks.

Method: Using local learning rules to compute gradients for driving protocol parameters from force measurements or observed system dynamics, implemented in networks of driven nonlinear overdamped oscillators coupled to a thermal bath.

Result: Successfully applied to sample from 2D Gaussian mixtures and trained a 12x12 oscillator network to generate handwritten digits 0 and 1 from MNIST dataset.

Conclusion: Score-based generative models can be effectively learned through local learning rules and implemented in physical oscillator networks for practical generative tasks.

Abstract: We show that the out-of-equilibrium driving protocol of score-based generative models (SGMs) can be learned via local learning rules. The gradient with respect to the parameters of the driving protocol is computed directly from force measurements or from observed system dynamics. As a demonstration, we implement an SGM in a network of driven, nonlinear, overdamped oscillators coupled to a thermal bath. We first apply it to the problem of sampling from a mixture of two Gaussians in 2D. Finally, we train a 12x12 oscillator network on the MNIST dataset to generate images of handwritten digits 0 and 1.

[425] Deep Generative Methods and Tire Architecture Design

Fouad Oubari, Raphael Meunier, Rodrigue Décatoire, Mathilde Mougeot

Main category: cs.LG

TL;DR: Comparative analysis of 5 deep generative models for industrial tire design generation, with diffusion models showing best overall performance and novel categorical inpainting method for conditional generation.

Details

Motivation: Industrial practitioners lack guidance on which deep generative models work best for complex manufacturing design tasks like tire architecture generation.

Method: Evaluated 5 models (VAE, GAN, MMVAE, DDPM, MDM) across three industrial scenarios: unconditional generation, component-conditioned generation, and dimension-constrained generation. Introduced categorical inpainting for conditional diffusion models.

Result: Diffusion models achieved strongest overall performance. Masking-trained VAE outperformed MMVAE+ on component-conditioned metrics. MDM led in-distribution while DDPM generalized better to out-of-distribution constraints.

Conclusion: Diffusion models are most effective for industrial design generation tasks, with specific model strengths depending on the scenario (in-distribution vs out-of-distribution requirements).

Abstract: As deep generative models proliferate across the AI landscape, industrial practitioners still face critical yet unanswered questions about which deep generative models best suit complex manufacturing design tasks. This work addresses this question through a complete study of five representative models (Variational Autoencoder, Generative Adversarial Network, multimodal Variational Autoencoder, Denoising Diffusion Probabilistic Model, and Multinomial Diffusion Model) on industrial tire architecture generation. Our evaluation spans three key industrial scenarios: (i) unconditional generation of complete multi-component designs, (ii) component-conditioned generation (reconstructing architectures from partial observations), and (iii) dimension-constrained generation (creating designs that satisfy specific dimensional requirements). To enable discrete diffusion models to handle conditional scenarios, we introduce categorical inpainting, a mask-aware reverse diffusion process that preserves known labels without requiring additional training. Our evaluation employs geometry-aware metrics specifically calibrated for industrial requirements, quantifying spatial coherence, component interaction, structural connectivity, and perceptual fidelity. Our findings reveal that diffusion models achieve the strongest overall performance; a masking-trained VAE nonetheless outperforms the multimodal variant MMVAE\textsuperscript{+} on nearly all component-conditioned metrics, and within the diffusion family MDM leads in-distribution whereas DDPM generalises better to out-of-distribution dimensional constraints.

[426] Multi-Component VAE with Gaussian Markov Random Field

Fouad Oubari, Mohamed El-Baha, Raphael Meunier, Rodrigue Décatoire, Mathilde Mougeot

Main category: cs.LG

TL;DR: GMRF MCVAE: A novel multi-component VAE that uses Gaussian Markov Random Fields to model cross-component relationships, achieving state-of-the-art performance on complex datasets with intricate dependencies.

Details

Motivation: Current multi-component VAEs use simplified aggregation strategies that neglect critical nuances and compromise structural coherence across generated components, especially for datasets with complex dependencies like industrial assemblies or multi-modal imaging.

Method: Embed Gaussian Markov Random Fields into both prior and posterior distributions of a multi-component VAE to explicitly model cross-component relationships and enable richer representation of complex interactions.

Result: Achieves state-of-the-art performance on synthetic Copula dataset, competitive results on PolyMNIST benchmark, and significantly enhances structural coherence on real-world BIKED dataset.

Conclusion: GMRF MCVAE is particularly well-suited for practical applications requiring robust and realistic modeling of multi-component coherence, effectively addressing the limitations of existing approaches.

Abstract: Multi-component datasets with intricate dependencies, like industrial assemblies or multi-modal imaging, challenge current generative modeling techniques. Existing Multi-component Variational AutoEncoders typically rely on simplified aggregation strategies, neglecting critical nuances and consequently compromising structural coherence across generated components. To explicitly address this gap, we introduce the Gaussian Markov Random Field Multi-Component Variational AutoEncoder , a novel generative framework embedding Gaussian Markov Random Fields into both prior and posterior distributions. This design choice explicitly models cross-component relationships, enabling richer representation and faithful reproduction of complex interactions. Empirically, our GMRF MCVAE achieves state-of-the-art performance on a synthetic Copula dataset specifically constructed to evaluate intricate component relationships, demonstrates competitive results on the PolyMNIST benchmark, and significantly enhances structural coherence on the real-world BIKED dataset. Our results indicate that the GMRF MCVAE is especially suited for practical applications demanding robust and realistic modeling of multi-component coherence

[427] From Taylor Series to Fourier Synthesis: The Periodic Linear Unit

Shiko Kudo

Main category: cs.LG

TL;DR: PLU activation function uses learnable sine waves for periodic non-monotonicity, enabling minimal MLPs to solve complex tasks like spiral classification that standard activations cannot handle.

Details

Motivation: Current neural networks rely on simple monotonic activations like ReLU, requiring massive parameterization to approximate complex functions. The authors seek to improve parameter efficiency through more expressive neuron design.

Method: Introduces Periodic Linear Unit (PLU) - a learnable sine-wave based activation with periodic non-monotonicity. Uses Repulsive Reparameterization to prevent collapse into linear functions and ensure numerical stability.

Result: A minimal MLP with only two PLU neurons can solve the spiral classification task, which is impossible for equivalent networks using standard activation functions.

Conclusion: PLU enables a paradigm shift from piecewise Taylor-like approximators to Fourier-like function synthesizers, achieving exponential gains in parameter efficiency by making neurons more intelligent.

Abstract: The dominant paradigm in modern neural networks relies on simple, monotonically-increasing activation functions like ReLU. While effective, this paradigm necessitates large, massively-parameterized models to approximate complex functions. In this paper, we introduce the Periodic Linear Unit (PLU), a learnable sine-wave based activation with periodic non-monotonicity. PLU is designed for maximum expressive power and numerical stability, achieved through its formulation and a paired innovation we term Repulsive Reparameterization, which prevents the activation from collapsing into a non-expressive linear function. We demonstrate that a minimal MLP with only two PLU neurons can solve the spiral classification task, a feat impossible for equivalent networks using standard activations. This suggests a paradigm shift from networks as piecewise Taylor-like approximators to powerful Fourier-like function synthesizers, achieving exponential gains in parameter efficiency by placing intelligence in the neuron itself.

[428] On Task Vectors and Gradients

Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Giuseppe Alessio D’Inverno, Fabrizio Silvestri, Emanuele Rodolà

Main category: cs.LG

TL;DR: Task arithmetic works because task vectors approximate negative gradients of task losses, with first-epoch gradients dominating the finetuning process, making single-epoch models sufficient for effective merging.

Details

Motivation: Despite empirical success of task arithmetic for model merging, there was no clear theoretical explanation for why and when it works effectively.

Method: Established theoretical connection between task vectors and gradients of task losses, proved equivalence under gradient descent, bounded error terms for multi-epoch settings, and conducted empirical analysis across seven vision benchmarks.

Result: Task vectors from one epoch of finetuning are equivalent to negative gradients scaled by learning rate. First-epoch gradients dominate finetuning trajectory in both norm and direction. Single-epoch models yield comparable performance to fully converged models.

Conclusion: Task arithmetic is a form of approximate multitask learning, with early training dynamics playing a critical role in model merging effectiveness.

Abstract: Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.

[429] LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

Ze Tao, Hanxuan Wang, Fujun Liu

Main category: cs.LG

TL;DR: LNN-PINN is a novel physics-informed neural network framework that uses liquid residual gating to improve predictive accuracy while maintaining the original physics modeling pipeline.

Details

Motivation: Standard PINNs often show limited predictive accuracy in complex problems, requiring architectural improvements without changing the fundamental physics modeling approach.

Method: Incorporates lightweight liquid residual gating mechanism within hidden-layer mapping while keeping sampling strategy, loss composition, and hyperparameters unchanged.

Result: Consistently reduced RMSE and MAE across four benchmark problems under identical training conditions, with absolute error plots confirming accuracy gains. Shows strong adaptability across varying dimensions, boundary conditions, and operator characteristics.

Conclusion: LNN-PINN provides an effective architectural enhancement that improves PINN predictive accuracy for complex scientific and engineering problems while preserving the original physics modeling framework.

Abstract: Physics-informed neural networks (PINNs) have attracted considerable attention for their ability to integrate partial differential equation priors into deep learning frameworks; however, they often exhibit limited predictive accuracy when applied to complex problems. To address this issue, we propose LNN-PINN, a physics-informed neural network framework that incorporates a liquid residual gating architecture while preserving the original physics modeling and optimization pipeline to improve predictive accuracy. The method introduces a lightweight gating mechanism solely within the hidden-layer mapping, keeping the sampling strategy, loss composition, and hyperparameter settings unchanged to ensure that improvements arise purely from architectural refinement. Across four benchmark problems, LNN-PINN consistently reduced RMSE and MAE under identical training conditions, with absolute error plots further confirming its accuracy gains. Moreover, the framework demonstrates strong adaptability and stability across varying dimensions, boundary conditions, and operator characteristics. In summary, LNN-PINN offers a concise and effective architectural enhancement for improving the predictive accuracy of physics-informed neural networks in complex scientific and engineering problems.

[430] Prototype-Guided Diffusion: Visual Conditioning without External Memory

Bilal Faye, Hanane Azzag, Mustapha Lebbah

Main category: cs.LG

TL;DR: PDM integrates prototype learning into diffusion models for efficient visual conditioning without external memory, using compact visual prototypes instead of retrieval systems.

Details

Motivation: Current diffusion models are computationally intensive, and retrieval-based methods require costly storage infrastructure and lack adaptability during training.

Method: Constructs dynamic visual prototypes from clean image features using contrastive learning, which guide denoising by aligning noisy representations with relevant visual patterns.

Result: Maintains high generation quality while reducing computational and storage overhead compared to retrieval-based methods.

Conclusion: PDM offers a scalable alternative to retrieval-based conditioning in diffusion models through efficient prototype-based guidance.

Abstract: Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.

[431] CALR: Corrective Adaptive Low-Rank Decomposition for Efficient Large Language Model Layer Compression

Muchammad Daniyal Kautsar, Afra Majida Hariono, Widyawan, Syukron Abu Ishaq Alfarozi, Kuntpong Woraratpanya

Main category: cs.LG

TL;DR: CALR introduces a corrective low-rank decomposition method that combines SVD compression with learnable corrective modules to maintain functional performance while significantly reducing LLM size.

Details

Motivation: Large LLMs are computationally expensive to deploy, and existing compression methods like SVD cause substantial performance degradation by focusing only on matrix reconstruction error rather than functional performance.

Method: CALR uses a two-component approach: primary SVD-compressed layers plus parallel learnable low-rank corrective modules trained to recover functional residual error.

Result: CALR reduces parameters by 26.93-51.77% while retaining 59.45-90.42% of original performance, outperforming LaCo, ShortGPT, and LoSparse on multiple models.

Conclusion: Treating functional information loss as a learnable signal is an effective compression paradigm that enables smaller, more efficient LLMs for practical deployment.

Abstract: Large Language Models (LLMs) present significant deployment challenges due to their immense size and computational requirements. Model compression techniques are essential for making these models practical for resource-constrained environments. A prominent compression strategy is low-rank factorization via Singular Value Decomposition (SVD) to reduce model parameters by approximating weight matrices. However, standard SVD focuses on minimizing matrix reconstruction error, often leading to a substantial loss of the model’s functional performance. This performance degradation occurs because existing methods do not adequately correct for the functional information lost during compression. To address this gap, we introduce Corrective Adaptive Low-Rank Decomposition (CALR), a two-component compression approach. CALR combines a primary path of SVD-compressed layers with a parallel, learnable, low-rank corrective module that is explicitly trained to recover the functional residual error. Our experimental evaluation on SmolLM2-135M, Qwen3-0.6B, and Llama-3.2-1B, demonstrates that CALR can reduce parameter counts by 26.93% to 51.77% while retaining 59.45% to 90.42% of the original model’s performance, consistently outperforming LaCo, ShortGPT, and LoSparse. CALR’s success shows that treating functional information loss as a learnable signal is a highly effective compression paradigm. This approach enables the creation of significantly smaller, more efficient LLMs, advancing their accessibility and practical deployment in real-world applications.

[432] Comparison of Data Reduction Criteria for Online Gaussian Processes

Thore Wietzke, Knut Graichen

Main category: cs.LG

TL;DR: Comparison of reduction criteria for online Gaussian Processes to handle streaming data efficiently by removing redundant datapoints while maintaining performance.

Details

Motivation: Gaussian Processes have high computational complexity that limits their use to small datasets, making them intractable for streaming scenarios where data accumulates continuously.

Method: Analyzed several reduction criteria for online GPs, evaluating computational complexity and reduction behavior on benchmark functions and real-world datasets including dynamic system identification tasks. Proposed additional acceptance criteria to filter redundant datapoints.

Result: Provided comprehensive comparison of reduction criteria performance, yielding practical guidelines for selecting appropriate criteria in online GP algorithms.

Conclusion: This work offers valuable insights and practical recommendations for choosing effective reduction criteria to enable efficient online Gaussian Process regression in streaming data scenarios.

Abstract: Gaussian Processes (GPs) are widely used for regression and system identification due to their flexibility and ability to quantify uncertainty. However, their computational complexity limits their applicability to small datasets. Moreover in a streaming scenario, more and more datapoints accumulate which is intractable even for Sparse GPs. Online GPs aim to alleviate this problem by e.g. defining a maximum budget of datapoints and removing redundant datapoints. This work provides a unified comparison of several reduction criteria, analyzing both their computational complexity and reduction behavior. The criteria are evaluated on benchmark functions and real-world datasets, including dynamic system identification tasks. Additionally, acceptance criteria are proposed to further filter out redundant datapoints. This work yields practical guidelines for choosing a suitable criterion for an online GP algorithm.

[433] Finite-Width Neural Tangent Kernels from Feynman Diagrams

Max Guillen, Philipp Misof, Jan E. Gerken

Main category: cs.LG

TL;DR: The paper introduces Feynman diagrams to compute finite-width corrections to neural tangent kernels (NTKs), enabling analysis of training dynamics beyond the infinite-width limit where important properties like NTK evolution and feature learning are absent.

Details

Motivation: While NTKs provide powerful analytical tools for deep neural networks in the infinite-width limit, this limit lacks important training properties like NTK evolution and feature learning. Finite-width effects need to be incorporated to better understand real-world network behavior.

Method: The authors develop Feynman diagrams for computing finite-width corrections to NTK statistics. This approach simplifies algebraic manipulations and enables computation of layer-wise recursive relations for various statistics including preactivations, NTKs, and higher-derivative tensors (dNTK and ddNTK).

Result: The framework enables extension of stability results from preactivations to NTKs and proves absence of finite-width corrections for scale-invariant nonlinearities like ReLU on the diagonal of the NTK Gram matrix. Numerical experiments validate the theoretical results.

Conclusion: Feynman diagrams provide an effective framework for systematically computing finite-width corrections to NTK statistics, bridging the gap between infinite-width theory and practical finite-width neural network training dynamics.

Abstract: Neural tangent kernels (NTKs) are a powerful tool for analyzing deep, non-linear neural networks. In the infinite-width limit, NTKs can easily be computed for most common architectures, yielding full analytic control over the training dynamics. However, at infinite width, important properties of training such as NTK evolution or feature learning are absent. Nevertheless, finite width effects can be included by computing corrections to the Gaussian statistics at infinite width. We introduce Feynman diagrams for computing finite-width corrections to NTK statistics. These dramatically simplify the necessary algebraic manipulations and enable the computation of layer-wise recursive relations for arbitrary statistics involving preactivations, NTKs and certain higher-derivative tensors (dNTK and ddNTK) required to predict the training dynamics at leading order. We demonstrate the feasibility of our framework by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK. We validate our results with numerical experiments.

[434] Cohort-Aware Agents for Individualized Lung Cancer Risk Prediction Using a Retrieval-Augmented Model Selection Framework

Chongyu Qu, Allen J. Luna, Thomas Z. Li, Junchao Zhu, Junlin Guo, Juming Xiong, Kim L. Sandler, Bennett A. Landman, Yuankai Huo

Main category: cs.LG

TL;DR: Personalized lung cancer risk prediction agent that dynamically selects optimal model for each patient using cohort retrieval and LLM reasoning.

Details

Motivation: Address variability in lung cancer risk prediction across different patient populations and clinical settings where no single model performs best for all cohorts.

Method: Two-stage pipeline: 1) FAISS-based similarity search to retrieve most relevant patient cohort from multi-institutional database, 2) LLM prompting with retrieved cohort and performance metrics to recommend optimal prediction algorithm from pool of 8 models including classical, temporal, and multi-modal approaches.

Result: Enables dynamic, cohort-aware risk prediction personalized to each patient’s profile using CT scans and structured metadata.

Conclusion: Provides flexible, cohort-driven model selection across diverse clinical populations, offering practical individualized risk assessment in lung cancer screening.

Abstract: Accurate lung cancer risk prediction remains challenging due to substantial variability across patient populations and clinical settings – no single model performs best for all cohorts. To address this, we propose a personalized lung cancer risk prediction agent that dynamically selects the most appropriate model for each patient by combining cohort-specific knowledge with modern retrieval and reasoning techniques. Given a patient’s CT scan and structured metadata – including demographic, clinical, and nodule-level features – the agent first performs cohort retrieval using FAISS-based similarity search across nine diverse real-world cohorts to identify the most relevant patient population from a multi-institutional database. Second, a Large Language Model (LLM) is prompted with the retrieved cohort and its associated performance metrics to recommend the optimal prediction algorithm from a pool of eight representative models, including classical linear risk models (e.g., Mayo, Brock), temporally-aware models (e.g., TD-VIT, DLSTM), and multi-modal computer vision-based approaches (e.g., Liao, Sybil, DLS, DLI). This two-stage agent pipeline – retrieval via FAISS and reasoning via LLM – enables dynamic, cohort-aware risk prediction personalized to each patient’s profile. Building on this architecture, the agent supports flexible and cohort-driven model selection across diverse clinical populations, offering a practical path toward individualized risk assessment in real-world lung cancer screening.

[435] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song

Main category: cs.LG

TL;DR: RuscaRL is a novel reinforcement learning framework that uses checklist-style rubrics to break the exploration bottleneck in LLM reasoning, achieving significant performance improvements on benchmarks like HealthBench-500.

Details

Motivation: To address the fundamental dilemma where RL improvement in LLMs relies on high-quality samples but exploration is limited by the model's inherent capabilities, creating a cycle where what cannot be explored cannot be learned.

Method: Introduces Rubric-Scaffolded Reinforcement Learning (RuscaRL) with checklist-style rubrics as explicit scaffolding for exploration during rollout generation and verifiable rewards for exploitation during training. The guidance is gradually decayed to encourage internalization of reasoning patterns.

Result: Significant performance improvements across various benchmarks, boosting Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500 (surpassing GPT-4.1) and achieving 61.1 with Qwen3-30B-A3B-Instruct, outperforming leading LLMs including OpenAI-o3.

Conclusion: RuscaRL effectively expands reasoning boundaries and breaks the exploration bottleneck for general LLM reasoning, demonstrating superior performance while the framework remains work in progress with planned releases of code, models, and datasets.

Abstract: Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3. This work is still in progress, and we will release the code, the models, and the datasets soon.

[436] Curvature Learning for Generalization of Hyperbolic Neural Networks

Xiaomeng Fan, Yuwei Wu, Zhi Gao, Mehrtash Harandi, Yunde Jia

Main category: cs.LG

TL;DR: This paper develops a theoretical foundation for curvature’s role in hyperbolic neural networks (HNNs), proposes a sharpness-aware curvature learning method to improve generalization, and validates it through extensive experiments.

Details

Motivation: Curvature plays a crucial role in HNN performance but inappropriate curvatures can cause suboptimal convergence. The theoretical understanding of curvature's effect on HNN generalization was lacking.

Method: Derived PAC-Bayesian generalization bound for HNNs, designed scope sharpness measure for curvatures, implemented bi-level optimization with implicit differentiation algorithm for efficient gradient approximation.

Result: The method shows improved performance across four settings: classification, long-tailed data learning, noisy data learning, and few-shot learning. Approximation error is upper-bounded and convergence is guaranteed.

Conclusion: The proposed sharpness-aware curvature learning method effectively smooths the loss landscape and improves HNN generalization, with theoretical guarantees and empirical validation across multiple learning scenarios.

Abstract: Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures via exploiting the geometric properties of hyperbolic spaces characterized by negative curvatures. Curvature plays a crucial role in optimizing HNNs. Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance. So far, the theoretical foundation of the effect of curvatures on HNNs has not been developed. In this paper, we derive a PAC-Bayesian generalization bound of HNNs, highlighting the role of curvatures in the generalization of HNNs via their effect on the smoothness of the loss landscape. Driven by the derived bound, we propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs. In our method, we design a scope sharpness measure for curvatures, which is minimized through a bi-level optimization process. Then, we introduce an implicit differentiation algorithm that efficiently solves the bi-level optimization by approximating gradients of curvatures. We present the approximation error and convergence analyses of the proposed method, showing that the approximation error is upper-bounded, and the proposed method can converge by bounding gradients of HNNs. Experiments on four settings: classification, learning from long-tailed data, learning from noisy data, and few-shot learning show that our method can improve the performance of HNNs.

[437] Quantum Graph Attention Network: A Novel Quantum Multi-Head Attention Mechanism for Graph Learning

An Ning, Tai Yue Li, Nan Yow Chen

Main category: cs.LG

TL;DR: QGAT integrates variational quantum circuits into graph attention mechanisms, using quantum parallelism to generate multiple attention coefficients simultaneously, reducing computational overhead while improving expressiveness and robustness.

Details

Motivation: To enhance graph neural networks by leveraging quantum computing advantages - specifically quantum parallelism and expressive nonlinear interactions - to improve attention mechanisms while reducing computational complexity.

Method: Uses strongly entangling quantum circuits with amplitude-encoded node features, with a single quantum circuit generating multiple attention coefficients simultaneously. Classical projection weights and quantum parameters are optimized end-to-end.

Result: Demonstrates effectiveness in capturing complex structural dependencies, improved generalization in inductive scenarios, enhanced robustness against feature and structural noise, and reduced computational overhead.

Conclusion: QGAT shows potential for scalable quantum-enhanced learning across domains like chemistry and biology, with modular design allowing easy integration into existing classical attention-based architectures.

Abstract: We propose the Quantum Graph Attention Network (QGAT), a hybrid graph neural network that integrates variational quantum circuits into the attention mechanism. At its core, QGAT employs strongly entangling quantum circuits with amplitude-encoded node features to enable expressive nonlinear interactions. Distinct from classical multi-head attention that separately computes each head, QGAT leverages a single quantum circuit to simultaneously generate multiple attention coefficients. This quantum parallelism facilitates parameter sharing across heads, substantially reducing computational overhead and model complexity. Classical projection weights and quantum circuit parameters are optimized jointly in an end-to-end manner, ensuring flexible adaptation to learning tasks. Empirical results demonstrate QGAT’s effectiveness in capturing complex structural dependencies and improved generalization in inductive scenarios, highlighting its potential for scalable quantum-enhanced learning across domains such as chemistry, biology, and network analysis. Furthermore, experiments confirm that quantum embedding enhances robustness against feature and structural noise, suggesting advantages in handling real-world noisy data. The modularity of QGAT also ensures straightforward integration into existing architectures, allowing it to easily augment classical attention-based models.

[438] Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery

Robert Yang

Main category: cs.LG

TL;DR: Proposes unlearning-as-ablation as a method to test whether LLMs can generate new scientific knowledge or just remix memorized content by systematically removing target results and evaluating re-derivation capability.

Details

Motivation: To address the epistemic question of whether large language models truly generate new knowledge or merely remix memorized fragments, particularly in scientific contexts where bold claims about AI's role are being made.

Method: Unlearning-as-ablation: systematically remove a target result along with its forget-closure (supporting lemmas, paraphrases, and multi-hop entailments), then evaluate if the model can re-derive the result using only permitted axioms and tools.

Result: Conceptual framework proposed with feasibility illustrated through minimal pilot studies in mathematics and algorithms - success indicates generative capability beyond recall, failure exposes current limits.

Conclusion: Position paper offering conceptual and methodological contribution to stimulate discussion on using principled ablation tests to distinguish knowledge reconstruction from retrieval, and to guide next-generation AI-for-Science benchmarks.

Abstract: Bold claims about AI’s role in science-from “AGI will cure all diseases” to promises of radically accelerated discovery-raise a central epistemic question: do large language models (LLMs) truly generate new knowledge, or do they merely remix memorized fragments? We propose unlearning-as-ablation as a falsifiable probe of constructive scientific discovery. The idea is to systematically remove a target result together with its forget-closure (supporting lemmas, paraphrases, and multi-hop entailments) and then evaluate whether the model can re-derive the result from only permitted axioms and tools. Success would indicate generative capability beyond recall; failure would expose current limits. Unlike prevailing motivations for unlearning-privacy, copyright, or safety-our framing repositions it as an epistemic probe for AI-for-Science. We outline a minimal pilot in mathematics and algorithms to illustrate feasibility, and sketch how the same approach could later be extended to domains such as physics or chemistry. This is a position paper: our contribution is conceptual and methodological, not empirical. We aim to stimulate discussion on how principled ablation tests could help distinguish models that reconstruct knowledge from those that merely retrieve it, and how such probes might guide the next generation of AI-for-Science benchmarks.

[439] Ada-TransGNN: An Air Quality Prediction Model Based On Adaptive Graph Convolutional Networks

Dan Wang, Feng Jiang, Zhanquan Wang

Main category: cs.LG

TL;DR: Transformer-based spatiotemporal model (Ada-TransGNN) for air quality prediction that integrates global spatial semantics and temporal behavior using multi-head attention and graph convolutional networks with adaptive graph structure learning.

Details

Motivation: Address low prediction accuracy and slow real-time updates in existing air quality prediction models that lead to lagging results.

Method: Combines multi-head attention mechanism and graph convolutional network in spatiotemporal blocks, with adaptive graph structure learning module and auxiliary task learning module to capture spatial relationships and temporal dependencies.

Result: Outperforms existing state-of-the-art prediction models in both short-term and long-term predictions on benchmark and novel Mete-air datasets.

Conclusion: The proposed Ada-TransGNN model effectively captures spatiotemporal dependencies and improves air quality prediction accuracy through adaptive graph learning and integrated spatial-temporal feature extraction.

Abstract: Accurate air quality prediction is becoming increasingly important in the environmental field. To address issues such as low prediction accuracy and slow real-time updates in existing models, which lead to lagging prediction results, we propose a Transformer-based spatiotemporal data prediction method (Ada-TransGNN) that integrates global spatial semantics and temporal behavior. The model constructs an efficient and collaborative spatiotemporal block set comprising a multi-head attention mechanism and a graph convolutional network to extract dynamically changing spatiotemporal dependency features from complex air quality monitoring data. Considering the interaction relationships between different monitoring points, we propose an adaptive graph structure learning module, which combines spatiotemporal dependency features in a data-driven manner to learn the optimal graph structure, thereby more accurately capturing the spatial relationships between monitoring points. Additionally, we design an auxiliary task learning module that enhances the decoding capability of temporal relationships by integrating spatial context information into the optimal graph structure representation, effectively improving the accuracy of prediction results. We conducted comprehensive evaluations on a benchmark dataset and a novel dataset (Mete-air). The results demonstrate that our model outperforms existing state-of-the-art prediction models in short-term and long-term predictions.

[440] Generative Feature Imputing – A Technique for Error-resilient Semantic Communication

Jianhao Huang, Qunsong Zeng, Hongyang Du, Kaibin Huang

Main category: cs.LG

TL;DR: Proposes generative feature imputing framework for robust semantic communication, using spatial error concentration, diffusion-based reconstruction, and semantic-aware power allocation to improve transmission reliability.

Details

Motivation: Semantic communication faces challenges in ensuring robustness against transmission errors that distort semantically critical content when deployed over digital systems.

Method: Three key techniques: 1) Spatial error concentration packetization strategy, 2) Generative feature imputing using diffusion models to reconstruct missing features, 3) Semantic-aware power allocation for unequal error protection.

Result: Outperforms conventional approaches like DJSCC and JPEG2000 under block fading conditions, achieving higher semantic accuracy and lower LPIPS scores.

Conclusion: The proposed framework effectively addresses robustness challenges in semantic communication systems and demonstrates superior performance compared to existing methods.

Abstract: Semantic communication (SemCom) has emerged as a promising paradigm for achieving unprecedented communication efficiency in sixth-generation (6G) networks by leveraging artificial intelligence (AI) to extract and transmit the underlying meanings of source data. However, deploying SemCom over digital systems presents new challenges, particularly in ensuring robustness against transmission errors that may distort semantically critical content. To address this issue, this paper proposes a novel framework, termed generative feature imputing, which comprises three key techniques. First, we introduce a spatial error concentration packetization strategy that spatially concentrates feature distortions by encoding feature elements based on their channel mappings, a property crucial for both the effectiveness and reduced complexity of the subsequent techniques. Second, building on this strategy, we propose a generative feature imputing method that utilizes a diffusion model to efficiently reconstruct missing features caused by packet losses. Finally, we develop a semantic-aware power allocation scheme that enables unequal error protection by allocating transmission power according to the semantic importance of each packet. Experimental results demonstrate that the proposed framework outperforms conventional approaches, such as Deep Joint Source-Channel Coding (DJSCC) and JPEG2000, under block fading conditions, achieving higher semantic accuracy and lower Learned Perceptual Image Patch Similarity (LPIPS) scores.

[441] CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng

Main category: cs.LG

TL;DR: CMPhysBench is a new benchmark with 520+ graduate-level condensed matter physics calculation problems to evaluate LLMs’ capabilities in this domain, using a novel SEED scoring system that provides fine-grained partial credit.

Details

Motivation: To assess Large Language Models' proficiency in Condensed Matter Physics, a practical and frontier domain where traditional physics benchmarks may not adequately measure specialized knowledge and problem-solving abilities.

Method: Created a benchmark with 520+ curated graduate-level calculation problems covering major subfields. Introduced Scalable Expression Edit Distance (SEED) score using tree-based expression representations for fine-grained partial credit assessment of model solutions.

Result: Even the best model (Grok-4) achieved only 36 average SEED score and 28% accuracy, revealing significant capability gaps in condensed matter physics problem-solving.

Conclusion: LLMs show substantial limitations in condensed matter physics despite advances, highlighting the need for specialized benchmarks and improved capabilities in this practical, frontier physics domain.

Abstract: We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

cs.MA

[442] Consensus Is All You Need: Gossip-Based Reasoning Among Large Language Models

Saksham Arora

Main category: cs.MA

TL;DR: Gossip-based consensus method where multiple LLMs exchange answers and reasoning to reach collective decisions, improving accuracy and robustness over single models.

Details

Motivation: No single LLM excels in all areas - each has strengths and weaknesses. Instead of relying on one model, the paper takes inspiration from gossip protocols in distributed systems to leverage collective intelligence.

Method: Models act as nodes in a peer-to-peer network, exchanging answers and thought processes through gossip protocols until reaching consensus on solutions.

Result: The gossip-based consensus approach leads to robust, resilient, and accurate multi-agent AI reasoning, overcoming individual model weaknesses and leveraging collective strengths.

Conclusion: This collaborative approach mimics human consensus-building, making AI more trustworthy and collaborative rather than functioning as a black-box system.

Abstract: Large language models have advanced rapidly, but no single model excels in every area – each has its strengths and weaknesses. Instead of relying on one model alone, we take inspiration from gossip protocols in distributed systems, where information is exchanged with peers until they all come to an agreement. In this setup, models exchange answers and gradually work toward a shared solution. Each LLM acts as a node in a peer-to-peer network, sharing responses and thought processes to reach a collective decision. Our results show that this “gossip-based consensus” leads to robust, resilient, and accurate multi-agent AI reasoning. It helps overcome the weaknesses of individual models and brings out their collective strengths. This approach is similar to how humans build consensus, making AI seem more collaborative and trustworthy instead of just a black-box program.

[443] Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms

Gohar Irfan Chaudhry, Esha Choukse, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Adam Belay, Ricardo Bianchini

Main category: cs.MA

TL;DR: Murakkab is a resource-efficient serving system for agentic workflows that decouples workflow specification from execution configuration, enabling cross-layer optimization to reduce GPU usage, energy consumption, and cost while maintaining service-level objectives.

Details

Motivation: Current frameworks for serving agentic workflows are inefficient because they treat workflows as opaque sequences of model and tool calls, tightly coupling agent logic with model and hardware choices, leading to resource waste and degraded performance.

Method: Murakkab introduces a declarative abstraction that separates workflow specification from execution configuration, uses a profile-guided optimizer and adaptive runtime to manage the full stack, orchestrates workflow components, maps them to models and hardware, and dynamically reconfigures execution to meet SLOs.

Result: Evaluation shows Murakkab reduces GPU usage by up to 2.8×, energy consumption by 3.7×, and cost by 4.3× while maintaining service-level objectives across diverse workflows.

Conclusion: Murakkab successfully addresses the inefficiencies in serving agentic workflows by exposing internal structure for cross-layer optimization, achieving significant resource savings without compromising performance objectives.

Abstract: Agentic workflows commonly coordinate multiple models and tools with complex control logic. They are quickly becoming the dominant paradigm for AI applications. However, serving them remains inefficient with today’s frameworks. The key problem is that they expose workflows as opaque sequences of model and tool calls that tightly couple agent logic with model and hardware choices. Often, these workflow components are fragmented across different entities, preventing systems from reasoning about trade-offs across accuracy, latency, energy, and cost. This leads to resource waste and degraded service-level objectives (SLOs). We present Murakkab, a resource-efficient serving system for agentic workflows. Murakkab introduces a declarative abstraction that decouples workflow specification from execution configuration. A profile-guided optimizer and adaptive runtime jointly manage the full stack: orchestrating workflow components, mapping them to models and hardware, and dynamically reconfiguring execution to satisfy user-defined SLOs. By exposing the internal structure of agentic workflows, Murakkab enables cross-layer optimization that existing frameworks and cloud schedulers cannot achieve. Our evaluation on diverse workflows shows that \sysname{} reduces GPU usage by up to 2.8$\times$, energy consumption by 3.7$\times$, and cost by 4.3$\times$ while maintaining SLOs.

Ryan Hare, Ying Tang

Main category: cs.MA

TL;DR: A neuro-symbolic multi-agent framework combining RL-based tutor and LLM-powered peer agents to support learner-centered education through structured scaffolding and social interaction.

Details

Motivation: To empower students to take ownership of their learning by addressing challenges in goal-setting, progress tracking, and strategy adaptation through AI-powered digital learning environments.

Method: A multi-agent neuro-symbolic framework with specialized pedagogical roles: RL-based tutor agent for non-verbal scaffolding and LLM-powered peer agent for social learning dimensions, unified through a central educational ontology.

Result: The framework demonstrated adaptability across domains through case studies in both college-level and middle school settings, showing successful implementation of combined authoritative and social learning support.

Conclusion: The unified multi-agent approach provides a transformative opportunity for AI-driven learning environments, with future directions focusing on advancing scalable, cross-domain learning support through neuro-symbolic AI systems.

Abstract: One of the enduring challenges in education is how to empower students to take ownership of their learning by setting meaningful goals, tracking their progress, and adapting their strategies when faced with setbacks. Research has shown that this form of leaner-centered learning is best cultivated through structured, supportive environments that promote guided practice, scaffolded inquiry, and collaborative dialogue. In response, educational efforts have increasingly embraced artificial-intelligence (AI)-powered digital learning environments, ranging from educational apps and virtual labs to serious games. Recent advances in large language models (LLMs) and neuro-symbolic systems, meanwhile, offer a transformative opportunity to reimagine how support is delivered in digital learning environments. LLMs are enabling socially interactive learning experiences and scalable, cross-domain learning support that can adapt instructional strategies across varied subjects and contexts. In parallel, neuro-symbolic AI provides new avenues for designing these agents that are not only adaptive but also scalable across domains. Based on these remarks, this paper presents a multi-agent, neuro-symbolic framework designed to resolve the aforementioned challenges. The framework assigns distinct pedagogical roles to specialized agents: an RL-based ’tutor’ agent provides authoritative, non-verbal scaffolding, while a proactive, LLM-powered ‘peer’ agent facilitates the social dimensions of learning. While prior work has explored such agents in isolation, our framework’s novelty lies in unifying them through a central educational ontology. Through case studies in both college-level and middle school settings, we demonstrate the framework’s adaptability across domains. We conclude by outlining key insights and future directions for advancing AI-driven learning environments.

[445] Skill-Aligned Fairness in Multi-Agent Learning for Collaboration in Healthcare

Promise Osaine Ekpo, Brian La, Thomas Wiener, Saesha Agarwal, Arshia Agrawal, Gonzalo Gonzalez-Pumariega, Lekan P. Molu, Angelique Taylor

Main category: cs.MA

TL;DR: FairSkillMARL framework combines workload balance and skill-task alignment for fairness in healthcare MARL, with experiments showing equal workload alone causes task-skill mismatches.

Details

Motivation: Current MARL fairness approaches focus only on workload balance, ignoring agent expertise and structured coordination needed in real-world domains like healthcare where equitable task allocation must prevent burnout and optimize skilled agent usage.

Method: Proposed FairSkillMARL framework defining fairness as dual objective of workload balance and skill-task alignment, and created MARLHospital environment for modeling team compositions and energy-constrained scheduling impacts.

Result: Experiments comparing FairSkillMARL with four standard MARL methods and two state-of-the-art fairness metrics showed that fairness based solely on equal workload leads to task-skill mismatches.

Conclusion: The work provides tools and foundation for studying fairness in heterogeneous multi-agent systems where aligning effort with expertise is critical, highlighting need for more robust metrics that capture skill-task misalignment.

Abstract: Fairness in multi-agent reinforcement learning (MARL) is often framed as a workload balance problem, overlooking agent expertise and the structured coordination required in real-world domains. In healthcare, equitable task allocation requires workload balance or expertise alignment to prevent burnout and overuse of highly skilled agents. Workload balance refers to distributing an approximately equal number of subtasks or equalised effort across healthcare workers, regardless of their expertise. We make two contributions to address this problem. First, we propose FairSkillMARL, a framework that defines fairness as the dual objective of workload balance and skill-task alignment. Second, we introduce MARLHospital, a customizable healthcare-inspired environment for modeling team compositions and energy-constrained scheduling impacts on fairness, as no existing simulators are well-suited for this problem. We conducted experiments to compare FairSkillMARL in conjunction with four standard MARL methods, and against two state-of-the-art fairness metrics. Our results suggest that fairness based solely on equal workload might lead to task-skill mismatches and highlight the need for more robust metrics that capture skill-task misalignment. Our work provides tools and a foundation for studying fairness in heterogeneous multi-agent systems where aligning effort with expertise is critical.

[446] Optimizing Highway Traffic Flow in Mixed Autonomy: A Multiagent Truncated Rollout Approach

Lu Liu, Chi Xie, Xi Xiong

Main category: cs.MA

TL;DR: Multiagent truncated rollout approach for CAV speed coordination in mixed autonomy traffic to improve highway throughput while reducing computational overhead.

Details

Motivation: Address challenges in CAV coordination with human-driven vehicles due to heterogeneous driving behaviors in mixed autonomy environments.

Method: Formulates traffic density evolution equation, establishes distributed coordination framework, uses neighbor kinematic information with sequential solution mechanism, and introduces truncated rollout scheme to shorten optimization horizon adaptively.

Result: Outperforms conventional MPC methods by reducing average travel time in bottleneck areas and overall computational time in large-scale mixed traffic simulations.

Conclusion: The method provides efficient real-time CAV coordination with theoretical stability guarantees and strong potential for practical deployment in mixed autonomy traffic systems.

Abstract: The development of connected and autonomous vehicles (CAVs) offers substantial opportunities to enhance traffic efficiency. However, in mixed autonomy environments where CAVs coexist with human-driven vehicles (HDVs), achieving efficient coordination among CAVs remains challenging due to heterogeneous driving behaviors. To address this, this paper proposes a multiagent truncated rollout approach that enhances CAV speed coordination to improve highway throughput while reducing computational overhead. In this approach, a traffic density evolution equation is formulated that comprehensively accounts for the presence or absence of CAVs, and a distributed coordination control framework is established accordingly. By incorporating kinematic information from neighbor agents and employing an agent-by-agent sequential solution mechanism, our method enables explicit cooperation among CAVs. Furthermore, we introduce a truncated rollout scheme that adaptively shortens the optimization horizon based on the evaluation of control sequences. This significantly reduces the time complexity, thereby improving real-time performance and scalability. Theoretical analysis provides rigorous guarantees on the stability and performance improvement of the system. Simulations conducted on real-world bottleneck scenarios demonstrate that, in large-scale mixed traffic flows, the proposed method outperforms conventional model predictive control methods by reducing both the average travel time in the bottleneck area and overall computational time, highlighting its strong potential for practical deployment.

[447] Safe Multiagent Coordination via Entropic Exploration

Ayhan Alp Aydeniz, Enrico Marchesini, Robert Loftin, Christopher Amato, Kagan Tumer

Main category: cs.MA

TL;DR: E2C uses entropic exploration with team constraints to improve safety and performance in multiagent reinforcement learning, reducing unsafe behaviors by up to 50% while maintaining task effectiveness.

Details

Motivation: Real-world multiagent learning often involves safety concerns, but existing safe RL algorithms limit exploration needed for discovering cooperative behaviors. Current multiagent approaches use individual constraints rather than joint team constraints.

Method: Proposes E2C (entropic exploration for constrained multiagent RL) that leverages observation entropy maximization to incentivize exploration while learning safe cooperative behaviors under team constraints.

Result: Experiments show E2C agents match or surpass both unconstrained and constrained baselines in task performance while reducing unsafe behaviors by up to 50% across increasingly complex domains.

Conclusion: Team constraints combined with entropic exploration effectively address the exploration-safety tradeoff in multiagent RL, enabling safer and more effective cooperative behaviors.

Abstract: Many real-world multiagent learning problems involve safety concerns. In these setups, typical safe reinforcement learning algorithms constrain agents' behavior, limiting exploration – a crucial component for discovering effective cooperative multiagent behaviors. Moreover, the multiagent literature typically models individual constraints for each agent and has yet to investigate the benefits of using joint team constraints. In this work, we analyze these team constraints from a theoretical and practical perspective and propose entropic exploration for constrained multiagent reinforcement learning (E2C) to address the exploration issue. E2C leverages observation entropy maximization to incentivize exploration and facilitate learning safe and effective cooperative behaviors. Experiments across increasingly complex domains show that E2C agents match or surpass common unconstrained and constrained baselines in task performance while reducing unsafe behaviors by up to $50%$.

[448] PE-MA: Parameter-Efficient Co-Evolution of Multi-Agent Systems

Yingfan Deng, Anhao Zhou, Yuan Yuan, Xiao Zhang, Yifei Zou, Dongxiao Yu

Main category: cs.MA

TL;DR: PE-MA framework enables efficient multi-agent collaboration with personalized adapters and shared optimization, achieving optimal convergence rates.

Details

Motivation: Address challenges in multi-agent systems including high communication overhead and insufficient agent-level personalization for collaborative learning.

Method: Each agent maintains lightweight personalized adapter for specific behavior, while shared adapter is collaboratively optimized across neighboring agents.

Result: Achieves asymptotically optimal convergence rate of O(1/(NK)^(1/2)) where N is number of agents and K is local update steps.

Conclusion: PE-MA provides efficient, scalable, and personalized co-evolution framework that balances global coordination with local adaptation in heterogeneous environments.

Abstract: Multi-Agent Systems have recently emerged as a promising paradigm for collaborative reasoning and solving complex tasks. However, the design of collaborative learning algorithms in multi-agent systems faces several challenges, including high communication overhead and insufficient agent-level personalization. In this paper, we propose PE-MA (Parameter-Efficient Multi-Agent Co-Evolution), a novel collaboration framework that supports efficient, scalable, and personalized co-evolution in multi-agent systems. In PE-MA, each agent maintains a lightweight personalized adapter to support agent-specific behavior, while a shared adapter is collaboratively optimized across neighboring agents. This design balances global coordination with local adaptation under heterogeneous environments. We achieve an asymptotically optimal convergence rate of O( 1/(NK)^(1/2) ), where N is the number of agents and K the local update steps.

cs.MM

[449] adder-viz: Real-Time Visualization Software for Transcoding Event Video

Andrew C. Freeman, Luke Reinkensmeyer

Main category: cs.MM

TL;DR: This paper presents improvements to adder-viz software for visualizing real-time event video transcoding processes, building on the ADDER representation to address limitations in flexibility, speed, and compressibility of event video formats.

Details

Motivation: Existing event video representations have shown limitations in flexibility, speed, and compressibility. The authors previously proposed the ADDER representation to address these concerns, and this work focuses on improving visualization tools for real-time event transcoding processes.

Method: The paper introduces numerous improvements to the adder-viz software for visualizing real-time event transcode processes and applications in-the-loop. The software is MIT-licensed and available from a centralized repository.

Result: The improved adder-viz software enables better visualization of real-time event video transcoding, supporting the ADDER representation which aims to overcome limitations of traditional event camera representations.

Conclusion: The enhanced visualization tools provide better support for working with the ADDER event video representation, facilitating real-time transcoding processes and applications in computer vision research involving neuromorphic event cameras.

Abstract: Recent years have brought about a surge in neuromorphic ``event’’ video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified ADDER representation to address these concerns. This paper introduces numerous improvements to the adder-viz software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at https://github.com/ac-freeman/adder-codec-rs.

eess.AS

[450] Toward Responsible ASR for African American English Speakers: A Scoping Review of Bias and Equity in Speech Technology

Jay L. Cunningham, Adinawa Adjagbodjou, Jeffrey Basoah, Jainaba Jawara, Kowe Kadoma, Aaleyah Lewis

Main category: eess.AS

TL;DR: This scoping review analyzes 44 papers on fairness, bias, and equity in ASR systems for African American English speakers, identifying gaps in governance approaches and proposing a governance-centered lifecycle framework.

Details

Motivation: To examine how fairness and equity are conceptualized in ASR technologies for linguistically diverse communities, particularly African American English speakers, and identify gaps in current approaches.

Method: Scoping literature review of 44 peer-reviewed publications across HCI, ML/NLP, and Sociolinguistics, analyzing four major areas: understanding ASR-related harms, inclusive data practices, methodological approaches, and design recommendations.

Result: Identified that while technical fairness interventions are growing, there’s a critical gap in governance-centered approaches that emphasize community agency, linguistic justice, and participatory accountability.

Conclusion: Proposes a governance-centered ASR lifecycle framework for responsible development and provides implications for addressing language marginalization in speech AI systems through interdisciplinary collaboration.

Abstract: This scoping literature review examines how fairness, bias, and equity are conceptualized and operationalized in Automatic Speech Recognition (ASR) and adjacent speech and language technologies (SLT) for African American English (AAE) speakers and other linguistically diverse communities. Drawing from 44 peer-reviewed publications across Human-Computer Interaction (HCI), Machine Learning/Natural Language Processing (ML/NLP), and Sociolinguistics, we identify four major areas of inquiry: (1) how researchers understand ASR-related harms; (2) inclusive data practices spanning collection, curation, annotation, and model training; (3) methodological and theoretical approaches to linguistic inclusion; and (4) emerging practices and design recommendations for more equitable systems. While technical fairness interventions are growing, our review highlights a critical gap in governance-centered approaches that foreground community agency, linguistic justice, and participatory accountability. We propose a governance-centered ASR lifecycle as an emergent interdisciplinary framework for responsible ASR development and offer implications for researchers, practitioners, and policymakers seeking to address language marginalization in speech AI systems.

[451] EAI-Avatar: Emotion-Aware Interactive Talking Head Generation

Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

Main category: eess.AS

TL;DR: EAI-Avatar is an emotion-aware talking head generation framework for bidirectional conversations that uses LLMs for dialogue generation and Transformer-based mask generation for consistent motion, with an interactive talking tree structure for emotional state transitions.

Details

Motivation: Existing talking head generation methods focus on one-way animation and lack precise emotion-adaptive capabilities for bidirectional conversational interactions, limiting practical applicability.

Method: Uses LLMs (e.g., GPT-4) for dialogue generation, Transformer-based head mask generator for consistent motion features in latent space, and interactive talking tree structure with reverse-level traversal to extract historical emotional cues for expression synthesis.

Result: The method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states.

Conclusion: Extensive experiments demonstrate superior performance and effectiveness of the proposed EAI-Avatar framework for emotion-aware dyadic interaction generation.

Abstract: Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character’s emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.

[452] On the Application of Diffusion Models for Simultaneous Denoising and Dereverberation

Adrian Meise, Tobias Cord-Landwehr, Reinhold Haeb-Umbach

Main category: eess.AS

TL;DR: Study compares different diffusion model approaches for simultaneous denoising and dereverberation of speech, finding cascaded models work best when applied in order of dominant distortion, while single models trained on mixed distortion data provide the best compromise.

Details

Motivation: Diffusion models have shown promise for speech enhancement but their capability for simultaneous denoising and dereverberation - the most common practical scenario - has not been well studied.

Method: Examined cascaded application of models (each trained on only one distortion) vs single models trained on: 1) both noisy and reverberated data, or 2) mixed data subsets (purely noisy, purely reverberated, and noisy reverberant speech). Tests on artificial and real recordings.

Result: Cascaded models only achieve satisfactory results when applied in order of dominating distortion. For single models, the best compromise is training on the three subsets of degraded speech data.

Conclusion: For practical applications requiring handling of both noise and reverberation, a single diffusion model trained on mixed distortion data provides the optimal balance of performance across different degradation scenarios.

Abstract: Diffusion models have been shown to achieve natural-sounding enhancement of speech degraded by noise or reverberation. However, their simultaneous denoising and dereverberation capability has so far not been studied much, although this is arguably the most common scenario in a practical application. In this work, we investigate different approaches to enhance noisy and/or reverberant speech. We examine the cascaded application of models, each trained on only one of the distortions, and compare it with a single model, trained either solely on data that is both noisy and reverberated, or trained on data comprising subsets of purely noisy, of purely reverberated, and of noisy reverberant speech. Tests are performed both on artificially generated and real recordings of noisy and/or reverberant data. The results show that, when using the cascade of models, satisfactory results are only achieved if they are applied in the order of the dominating distortion. If only a single model is desired that can operate on all distortion scenarios, the best compromise appears to be a model trained on the aforementioned three subsets of degraded speech data.

[453] A Framework for Robust Speaker Verification in Highly Noisy Environments Leveraging Both Noisy and Enhanced Audio

Adam Katav, Yair Moshe, Israel Cohen

Main category: eess.AS

TL;DR: A novel Siamese neural network framework that combines speaker embeddings from both noisy and enhanced speech to improve speaker verification robustness in noisy environments, without distorting speaker-specific information.

Details

Motivation: Speech enhancement methods using generative DNNs can improve audio quality but often distort speaker-specific characteristics, leading to degraded speaker verification performance in challenging acoustic environments.

Method: Proposes a lightweight Siamese architecture that extracts and combines speaker embeddings from both noisy and enhanced speech, leveraging complementary information from both sources to enhance verification robustness.

Result: Experimental results demonstrate superior performance of the proposed framework in maintaining speaker verification accuracy under severe noise conditions.

Conclusion: The proposed framework effectively addresses the trade-off between speech enhancement and speaker verification, providing a robust solution that works with various state-of-the-art techniques without modification.

Abstract: Recent advancements in speaker verification techniques show promise, but their performance often deteriorates significantly in challenging acoustic environments. Although speech enhancement methods can improve perceived audio quality, they may unintentionally distort speaker-specific information, which can affect verification accuracy. This problem has become more noticeable with the increasing use of generative deep neural networks (DNNs) for speech enhancement. While these networks can produce intelligible speech even in conditions of very low signal-to-noise ratio (SNR), they may also severely alter distinctive speaker characteristics. To tackle this issue, we propose a novel neural network framework that effectively combines speaker embeddings extracted from both noisy and enhanced speech using a Siamese architecture. This architecture allows us to leverage complementary information from both sources, enhancing the robustness of speaker verification under severe noise conditions. Our framework is lightweight and agnostic to specific speaker verification and speech enhancement techniques, enabling the use of a wide range of state-of-the-art solutions without modification. Experimental results demonstrate the superior performance of our proposed framework.

[454] MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR

Junjie Li, Jing Peng, Yangui Fang, Shuai Wang, Kai Yu

Main category: eess.AS

TL;DR: MOSA (Mixture of Simple Adapters) improves multilingual ASR by using lightweight adapters with Mixture-of-Experts mechanism to better share cross-lingual knowledge and handle data scarcity.

Details

Motivation: Traditional multilingual ASR faces data scarcity issues, and existing LLM-based approaches with single complex projectors struggle to effectively capture both shared and language-specific features across languages.

Method: Proposes MOSA framework that leverages Mixture-of-Experts mechanism to combine lightweight adapters that separately learn shared and language-specific linguistic knowledge, enabling better knowledge transfer from high-resource to low-resource languages.

Result: MOSA-Base achieves 15.4% relative reduction in average WER compared to baseline, outperforms across all languages, and works effectively even with only 60% of baseline parameters. MOSA-Large shows better robustness to data imbalance.

Conclusion: A mixture of simple adapters is more effective than a single complex adapter design for LLM-based ASR, enabling better cross-lingual knowledge sharing and handling of data scarcity issues.

Abstract: End-to-end multilingual ASR aims to transcribe speech from different languages into corresponding text, but is often limited by scarce multilingual data. LLM-based ASR aligns speech encoder outputs with LLM input space via a projector and has achieved notable success. However, prior work mainly improves performance by increasing data, with little focus on cross-lingual knowledge sharing. Moreover, a single complex projector struggles to capture both shared and language-specific features effectively. In this work, we propose MOSA (Mixture of Simple Adapters), leveraging a Mixture-of-Experts mechanism to combine lightweight adapters that learn shared and language-specific knowledge. This enables better utilization of high-resource language data to support low-resource languages, mitigating data scarcity issues. Experimental results show that MOSA-Base achieves a 15.4% relative reduction in average WER compared to the Baseline-Base and consistently outperforms it across all languages. Remarkably, MOSA-Base surpasses the Baseline-Base even when trained with only 60% of its parameters. Similarly, MOSA-Large outperforms the Baseline-Large in average WER and demonstrates greater robustness to data imbalance. Ablation studies further indicate that MOSA is more effective at handling individual languages and learning both language-specific and shared linguistic knowledge. These findings support that, in LLM-based ASR, a mixture of simple adapters is more effective than a single, complex adapter design.

[455] CLEAR: Continuous Latent Autoregressive Modeling for High-quality and Low-latency Speech Synthesis

Chun Yat Wu, Jiajun Deng, Guinan Li, Qiuqiang Kong, Simon Lui

Main category: eess.AS

TL;DR: CLEAR is a zero-shot TTS framework that directly models continuous audio representations instead of discrete tokens, achieving high-quality speech synthesis with low latency and competitive performance.

Details

Motivation: Conventional AR-based TTS systems using discrete audio tokens suffer from lossy compression during tokenization, requiring longer sequences that increase inference latency and complicate AR modeling.

Method: Proposes Continuous Latent Autoregressive model (CLEAR) with enhanced variational autoencoder with shortcut connections for high compression ratio, and lightweight MLP-based rectified flow head to model continuous latent probability distribution in a single-stage framework.

Result: Achieves SOTA results on LibriSpeech test-clean with 1.88% word error rate and 0.29 RTF, enables streaming synthesis with 96ms first-frame delay while maintaining high-quality speech.

Conclusion: CLEAR provides a unified zero-shot TTS framework that directly models continuous audio representations, delivering competitive performance in robustness, speaker similarity and naturalness with lower latency compared to SOTA models.

Abstract: Autoregressive (AR) language models have emerged as powerful solutions for zero-shot text-to-speech (TTS) synthesis, capable of generating natural speech from a few seconds of audio prompts. However, conventional AR-based TTS systems relying on discrete audio tokens face the challenge of lossy compression during tokenization, requiring longer discrete token sequences to capture the same information as continuous ones, which adds inference latency and complicates AR modeling. To address this challenge, this paper proposes the Continuous Latent Autoregressive model (CLEAR), a unified zero-shot TTS framework that directly models continuous audio representations. More specifically, CLEAR introduces an enhanced variational autoencoder with shortcut connections, which achieves a high compression ratio to map waveforms into compact continuous latents. A lightweight MLP-based rectified flow head that operates independently for each hidden state is presented to model the continuous latent probability distribution, and trained jointly with the AR model within a single-stage framework. Experiments show that the proposed zero-shot CLEAR TTS can synthesize high-quality speech with low latency. Compared to state-of-the-art (SOTA) TTS models, CLEAR delivers competitive performance in robustness, speaker similarity and naturalness, while offering a lower real-time factor (RTF). In particular, CLEAR achieves SOTA results on the LibriSpeech test-clean dataset, with a word error rate of 1.88% and an RTF of 0.29. Moreover, CLEAR facilitates streaming speech synthesis with a first-frame delay of 96ms, while maintaining high-quality speech synthesis.

[456] MDD: a Mask Diffusion Detector to Protect Speaker Verification Systems from Adversarial Perturbations

Yibo Bai, Sizhou Chen, Michele Panariello, Xiao-Lei Zhang, Massimiliano Todisco, Nicholas Evans

Main category: eess.AS

TL;DR: MDD is a novel adversarial detection and purification framework for speaker verification using text-conditioned masked diffusion models, achieving state-of-the-art performance without requiring adversarial examples or large pretraining.

Details

Motivation: Speaker verification systems are vulnerable to adversarial attacks in security-sensitive applications, requiring robust detection and purification methods.

Method: Uses text-conditioned masked diffusion model that applies partial masking to Mel-spectrograms, adds noise through forward diffusion, and reconstructs clean speech conditioned on input transcription.

Result: Achieves strong adversarial detection performance, outperforms prior state-of-the-art methods, and effectively purifies adversarial speech to restore verification performance close to clean conditions.

Conclusion: Demonstrates the potential of diffusion-based masking strategies for building secure and reliable speaker verification systems.

Abstract: Speaker verification systems are increasingly deployed in security-sensitive applications but remain highly vulnerable to adversarial perturbations. In this work, we propose the Mask Diffusion Detector (MDD), a novel adversarial detection and purification framework based on a \textit{text-conditioned masked diffusion model}. During training, MDD applies partial masking to Mel-spectrograms and progressively adds noise through a forward diffusion process, simulating the degradation of clean speech features. A reverse process then reconstructs the clean representation conditioned on the input transcription. Unlike prior approaches, MDD does not require adversarial examples or large-scale pretraining. Experimental results show that MDD achieves strong adversarial detection performance and outperforms prior state-of-the-art methods, including both diffusion-based and neural codec-based approaches. Furthermore, MDD effectively purifies adversarially-manipulated speech, restoring speaker verification performance to levels close to those observed under clean conditions. These findings demonstrate the potential of diffusion-based masking strategies for secure and reliable speaker verification systems.

[457] Interpolating Speaker Identities in Embedding Space for Data Expansion

Tianchi Liu, Ruijie Tao, Qiongqiong Wang, Yidi Jiang, Hardik B. Sailor, Ke Zhang, Jingru Lin, Haizhou Li

Main category: eess.AS

TL;DR: INSIDE synthesizes new speaker identities by interpolating between existing speaker embeddings in a pretrained space, then uses text-to-speech to generate corresponding speech, improving speaker verification performance by 3-5%

Details

Motivation: Collecting large-scale diverse speaker identity data is expensive, challenging, and limited by privacy concerns, creating a need for synthetic data expansion methods

Method: Select pairs of nearby speaker embeddings, compute intermediate embeddings using spherical linear interpolation, feed to text-to-speech system to generate speech waveforms, combine with original dataset for training

Result: Models trained with INSIDE-expanded data outperform real-data-only models with 3.06% to 5.24% relative improvements in speaker verification and 13.44% improvement in gender classification

Conclusion: INSIDE is an effective data expansion method that synthesizes new speaker identities, improves performance, is compatible with other augmentation techniques, and serves as a flexible addition to training pipelines

Abstract: The success of deep learning-based speaker verification systems is largely attributed to access to large-scale and diverse speaker identity data. However, collecting data from more identities is expensive, challenging, and often limited by privacy concerns. To address this limitation, we propose INSIDE (Interpolating Speaker Identities in Embedding Space), a novel data expansion method that synthesizes new speaker identities by interpolating between existing speaker embeddings. Specifically, we select pairs of nearby speaker embeddings from a pretrained speaker embedding space and compute intermediate embeddings using spherical linear interpolation. These interpolated embeddings are then fed to a text-to-speech system to generate corresponding speech waveforms. The resulting data is combined with the original dataset to train downstream models. Experiments show that models trained with INSIDE-expanded data outperform those trained only on real data, achieving 3.06% to 5.24% relative improvements. While INSIDE is primarily designed for speaker verification, we also validate its effectiveness on gender classification, where it yields a 13.44% relative improvement. Moreover, INSIDE is compatible with other augmentation techniques and can serve as a flexible, scalable addition to existing training pipelines.

[458] Revisiting SSL for sound event detection: complementary fusion and adaptive post-processing

Hanfang Cui, Longfei Song, Li Li, Dongxing Xu, Yanhua Long

Main category: eess.AS

TL;DR: Systematic evaluation of SSL models for sound event detection, proposing fusion strategies and adaptive post-processing that improve performance.

Details

Motivation: Self-supervised learning models offer powerful representations for sound event detection but their synergistic potential remains underexplored, requiring guidance for optimal model selection and integration.

Method: Proposed framework combining heterogeneous SSL representations (BEATs, HuBERT, WavLM) through three fusion strategies: individual SSL embedding integration, dual-modal fusion, and full aggregation. Introduced normalized sound event bounding boxes (nSEBBs) for adaptive post-processing.

Result: Dual-modal fusion (e.g., CRNN+BEATs+WavLM) achieves complementary performance gains. CRNN+BEATs alone delivers best results among individual SSL models. nSEBBs improve PSDS1 by up to 4% for standalone SSL models.

Conclusion: SSL architectures show compatibility and complementarity, providing guidance for task-specific fusion and robust SED system design.

Abstract: Self-supervised learning (SSL) models offer powerful representations for sound event detection (SED), yet their synergistic potential remains underexplored. This study systematically evaluates state-of-the-art SSL models to guide optimal model selection and integration for SED. We propose a framework that combines heterogeneous SSL representations (e.g., BEATs, HuBERT, WavLM) through three fusion strategies: individual SSL embedding integration, dual-modal fusion, and full aggregation. Experiments on the DCASE 2023 Task 4 Challenge reveal that dual-modal fusion (e.g., CRNN+BEATs+WavLM) achieves complementary performance gains, while CRNN+BEATs alone delivers the best results among individual SSL models. We further introduce normalized sound event bounding boxes (nSEBBs), an adaptive post-processing method that dynamically adjusts event boundary predictions, improving PSDS1 by up to 4% for standalone SSL models. These findings highlight the compatibility and complementarity of SSL architectures, providing guidance for task-specific fusion and robust SED system design.

[459] DG-SED: Domain Generalization for Sound Event Detection with Heterogeneous Training Data

Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

Main category: eess.AS

TL;DR: DG-SED method improves sound event detection domain generalization using mean-teacher framework with mixstyle and adaptive normalization techniques.

Details

Motivation: To advance sound event detection adaptability to real-world scenarios by addressing domain generalization challenges when integrating heterogeneous training data.

Method: Mean-teacher framework with mixstyle applied to frequency dimension, adaptive residual normalization using instance normalization, and sound event bounding boxes for post-processing.

Result: Improved PSDS on DESED dataset and macro-average pAUC on MAESTRO dataset compared to baselines in DCASE 2024 Challenge Task 4.

Conclusion: The proposed DG-SED method effectively enhances domain generalization for sound event detection across different datasets.

Abstract: This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability to real-world scenarios. Our approach employs a mean-teacher framework with domain generalization named DG-SED to integrate heterogeneous training data while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Next, we use the adaptive residual normalization method to generalize features across multiple domains by applying instance normalization in the frequency dimension. Lastly, we use the sound event bounding boxes method for post-processing. We evaluate the proposed approach DG-SED on the DCASE 2024 Challenge Task 4, measuring PSDS on the DESED dataset and macro-average pAUC on the MAESTRO dataset. The results indicate that the proposed DG-SED method improves both PSDS and macro-average pAUC compared to the baselines. The code will be released in due course.

[460] Leveraging Content and Acoustic Representations for Speech Emotion Recognition

Soumya Dutta, Sriram Ganapathy

Main category: eess.AS

TL;DR: CARE proposes a dual encoding scheme for speech emotion recognition that combines semantic and acoustic representations from unsupervised raw speech, achieving state-of-the-art performance across 8 diverse datasets.

Details

Motivation: Speech emotion recognition is challenging due to difficulty extracting emotional representations from speech and scarcity of labeled datasets, which causes large models to overfit.

Method: Dual encoding scheme with semantic encoder trained via distillation from text representations and acoustic encoder trained to predict low-level frame-wise speech features. Base-sized model trained only on unsupervised raw speech.

Result: CARE achieves best average performance across 8 diverse datasets compared to other self-supervised models and large-language model approaches.

Conclusion: The proposed dual encoding scheme effectively captures both semantic and acoustic factors for emotion recognition, providing superior performance with simple lightweight classification.

Abstract: Speech emotion recognition (SER), the task of identifying the expression of emotion from spoken content, is challenging due to the difficulty in extracting representations that capture emotional attributes from speech. The scarcity of labeled datasets further complicates the challenge where large models are prone to over-fitting. In this paper, we propose CARE (Content and Acoustic Representations of Emotions), where we design a dual encoding scheme which emphasizes semantic and acoustic factors of speech. While the semantic encoder is trained using distillation from utterance-level text representations, the acoustic encoder is trained to predict low-level frame-wise features of the speech signal. The proposed dual encoding scheme is a base-sized model trained only on unsupervised raw speech. With a simple light-weight classification model trained on the downstream task, we show that the CARE embeddings provide effective emotion recognition on a variety of datasets. We compare the proposal with several other self-supervised models as well as recent large-language model based approaches. In these evaluations, the proposed CARE is shown to be the best performing model based on average performance across 8 diverse datasets. We also conduct several ablation studies to analyze the importance of various design choices.

[461] Fusion of Modulation Spectrogram and SSL with Multi-head Attention for Fake Speech Detection

Rishith Sadashiv T N, Abhishek Bedge, Saisha Suresh Bore, Jagabandhu Mishra, Mrinmoy Bhattacharjee, S R Mahadeva Prasanna

Main category: eess.AS

TL;DR: Proposes SSL+MS fusion representation with AASIST back-end for fake speech detection, achieving significant performance improvements in both in-domain and cross-dataset scenarios.

Details

Motivation: Address poor generalizability of current fake speech detection systems on out-of-domain samples due to lack of diverse training data.

Method: Novel speech representation combining self-supervised speech embeddings and Modulation Spectrogram features, fused and passed to AASIST back-end network.

Result: 37% relative improvement on ASVspoof 2019, 20% on MLAAD in-domain; 36% improvement in cross-dataset evaluation; consistent outperformance across all languages.

Conclusion: Proposed SSL+MS fusion representation significantly enhances domain generalization for fake speech detection in both monolingual and multilingual scenarios.

Abstract: Fake speech detection systems have become a necessity to combat against speech deepfakes. Current systems exhibit poor generalizability on out-of-domain speech samples due to lack to diverse training data. In this paper, we attempt to address domain generalization issue by proposing a novel speech representation using self-supervised (SSL) speech embeddings and the Modulation Spectrogram (MS) feature. A fusion strategy is used to combine both speech representations to introduce a new front-end for the classification task. The proposed SSL+MS fusion representation is passed to the AASIST back-end network. Experiments are conducted on monolingual and multilingual fake speech datasets to evaluate the efficacy of the proposed model architecture in cross-dataset and multilingual cases. The proposed model achieves a relative performance improvement of 37% and 20% on the ASVspoof 2019 and MLAAD datasets, respectively, in in-domain settings compared to the baseline. In the out-of-domain scenario, the model trained on ASVspoof 2019 shows a 36% relative improvement when evaluated on the MLAAD dataset. Across all evaluated languages, the proposed model consistently outperforms the baseline, indicating enhanced domain generalization.

eess.IV

[462] Stress-testing cross-cancer generalizability of 3D nnU-Net for PET-CT tumor segmentation: multi-cohort evaluation with novel oesophageal and lung cancer datasets

Soumen Ghosh, Christine Jestin Hannan, Rajat Vashistha, Parveen Kundu, Sandra Brosda, Lauren G. Aoude, James Lonie, Andrew Nathanson, Jessica Ng, Andrew P. Barbour, Viktor Vegh

Main category: eess.IV

TL;DR: Cross-cancer evaluation of nnU-Net on PET-CT shows that combining diverse datasets (oesophageal and lung cancer from different demographics) provides the most robust tumor segmentation, outperforming single-dataset models and demonstrating that dataset diversity is more important than model complexity for clinical generalization.

Details

Motivation: To achieve robust generalization for deep learning-based tumor segmentation in clinical PET-CT workflows where anatomical sites, scanners, and patient populations vary widely, addressing the need for models that perform well across different cancer types and demographics.

Method: Trained and tested 3D nnUNet models under three paradigms: target-only (oesophageal cancer), public-only (AutoPET dataset), and combined training using two novel expert-annotated whole-body datasets (279 oesophageal cancer patients from Australia and 54 lung cancer patients from India) complemented by the public AutoPET dataset.

Result: Target-only model achieved best in-domain accuracy (mean DSC 57.8) but failed externally (mean DSC <3.4). Public-only model generalized better (mean DSC 63.5 on AutoPET, 51.6 on Indian lung) but underperformed on oesophageal cohort (26.7). Combined approach provided most balanced results (mean DSC: lung 52.9, oesophageal 40.7, AutoPET 60.9) with reduced boundary errors and improved robustness.

Conclusion: Dataset diversity, particularly multi-demographic, multi-center, and multi-cancer integration, outweighs architectural novelty as the key driver of robust generalization. Diversity in training data is more critical than model complexity for clinically robust segmentation.

Abstract: Robust generalization is essential for deploying deep learning based tumor segmentation in clinical PET-CT workflows, where anatomical sites, scanners, and patient populations vary widely. This study presents the first cross cancer evaluation of nnU-Net on PET-CT, introducing two novel, expert-annotated whole-body datasets. 279 patients with oesophageal cancer (Australian cohort) and 54 with lung cancer (Indian cohort). These cohorts complement the public AutoPET dataset and enable systematic stress-testing of cross domain performance. We trained and tested 3D nnUNet models under three paradigms. Target only (oesophageal), public only (AutoPET), and combined training. For the tested sets, the oesophageal only model achieved the best in-domain accuracy (mean DSC, 57.8) but failed on external Indian lung cohort (mean DSC less than 3.4), indicating severe overfitting. The public only model generalized more broadly (mean DSC, 63.5 on AutoPET, 51.6 on Indian lung cohort) but underperformed in oesophageal Australian cohort (mean DSC, 26.7). The combined approach provided the most balanced results (mean DSC, lung (52.9), oesophageal (40.7), AutoPET (60.9)), reducing boundary errors and improving robustness across all cohorts. These findings demonstrate that dataset diversity, particularly multi demographic, multi center and multi cancer integration, outweighs architectural novelty as the key driver of robust generalization. This work presents the demography based cross cancer deep learning segmentation evaluation and highlights dataset diversity, rather than model complexity, as the foundation for clinically robust segmentation.

[463] ModAn-MulSupCon: Modality-and Anatomy-Aware Multi-Label Supervised Contrastive Pretraining for Medical Imaging

Eichi Takaya, Ryusei Inamori

Main category: eess.IV

TL;DR: ModAn-MulSupCon uses modality and anatomy metadata for multi-label supervised contrastive pretraining, achieving superior fine-tuning performance on medical imaging tasks compared to traditional methods.

Details

Motivation: Expert annotations are scarce for large-scale supervised pretraining in medical imaging, while ubiquitous metadata like modality and anatomical region remain underutilized as potential training signals.

Method: Encodes each image’s modality and anatomy as multi-hot vectors, uses ResNet-18 encoder pretrained on miniRIN dataset with Jaccard-weighted multi-label supervised contrastive loss, evaluated on three binary classification tasks.

Result: Achieved best AUC on ACL tear (0.964) and thyroid nodule malignancy (0.763), second best on breast lesion malignancy (0.926). Superior with fine-tuning but underperformed SimCLR/ImageNet with frozen encoder.

Conclusion: Modality/anatomy metadata provides practical pretraining signal that improves downstream accuracy when fine-tuning is feasible. Best for label-scarce clinical settings with task adaptation, while SimCLR/ImageNet better for frozen deployments.

Abstract: Background and objective: Expert annotations limit large-scale supervised pretraining in medical imaging, while ubiquitous metadata (modality, anatomical region) remain underused. We introduce ModAn-MulSupCon, a modality- and anatomy-aware multi-label supervised contrastive pretraining method that leverages such metadata to learn transferable representations. Method: Each image’s modality and anatomy are encoded as a multi-hot vector. A ResNet-18 encoder is pretrained on a mini subset of RadImageNet (miniRIN, 16,222 images) with a Jaccard-weighted multi-label supervised contrastive loss, and then evaluated by fine-tuning and linear probing on three binary classification tasks–ACL tear (knee MRI), lesion malignancy (breast ultrasound), and nodule malignancy (thyroid ultrasound). Result: With fine-tuning, ModAn-MulSupCon achieved the best AUC on MRNet-ACL (0.964) and Thyroid (0.763), surpassing all baselines ($p<0.05$), and ranked second on Breast (0.926) behind SimCLR (0.940; not significant). With the encoder frozen, SimCLR/ImageNet were superior, indicating that ModAn-MulSupCon representations benefit most from task adaptation rather than linear separability. Conclusion: Encoding readily available modality/anatomy metadata as multi-label targets provides a practical, scalable pretraining signal that improves downstream accuracy when fine-tuning is feasible. ModAn-MulSupCon is a strong initialization for label-scarce clinical settings, whereas SimCLR/ImageNet remain preferable for frozen-encoder deployments.

[464] HOTSPOT-YOLO: A Lightweight Deep Learning Attention-Driven Model for Detecting Thermal Anomalies in Drone-Based Solar Photovoltaic Inspections

Mahmoud Dhimish

Main category: eess.IV

TL;DR: HOTSPOT-YOLO is a lightweight AI model for detecting thermal anomalies in solar PV systems using drone-based inspections, achieving 90.8% mAP with real-time performance.

Details

Motivation: Thermal anomaly detection is crucial for maintaining solar PV system efficiency and reducing maintenance costs, requiring specialized models for drone-based inspections of small, subtle anomalies.

Method: Developed HOTSPOT-YOLO model integrating efficient CNN backbone with attention mechanisms, specifically designed for detecting hotspots and defective modules in thermal imagery.

Result: Achieved 90.8% mean average precision, significant improvement over baseline models, with reduced computational load and robustness across diverse environmental conditions.

Conclusion: Provides scalable, reliable solution for large-scale PV inspections, demonstrating successful integration of advanced AI techniques with practical renewable energy applications.

Abstract: Thermal anomaly detection in solar photovoltaic (PV) systems is essential for ensuring operational efficiency and reducing maintenance costs. In this study, we developed and named HOTSPOT-YOLO, a lightweight artificial intelligence (AI) model that integrates an efficient convolutional neural network backbone and attention mechanisms to improve object detection. This model is specifically designed for drone-based thermal inspections of PV systems, addressing the unique challenges of detecting small and subtle thermal anomalies, such as hotspots and defective modules, while maintaining real-time performance. Experimental results demonstrate a mean average precision of 90.8%, reflecting a significant improvement over baseline object detection models. With a reduced computational load and robustness under diverse environmental conditions, HOTSPOT-YOLO offers a scalable and reliable solution for large-scale PV inspections. This work highlights the integration of advanced AI techniques with practical engineering applications, revolutionizing automated fault detection in renewable energy systems.

[465] Federative ischemic stroke segmentation as alternative to overcome domain-shift multi-institution challenges

Edgar Rangel, Fabio Martinez

Main category: eess.IV

TL;DR: A federated learning framework for ischemic stroke lesion segmentation that achieves better performance than centralized approaches while preserving data privacy across multiple healthcare centers.

Details

Motivation: Stroke lesion analysis is highly variable due to different patient demographics, scanner vendors, and expert annotations. Current computational approaches lack generalization across institutions and many centers lack sufficient labeled data for training.

Method: Developed a collaborative federated learning framework (FedAvg) for segmenting ischemic stroke lesions in DWI sequences by sharing knowledge from deep center-independent representations across 14 healthcare centers with 2031 studies.

Result: FedAvg achieved DSC of 0.71±0.24, AVD of 5.29±22.74, ALD of 2.16±3.60 and LF1 of 0.70±0.26, outperforming centralized and other federated approaches. Showed strong generalization with uniform performance across lesion categories and reliable performance in out-of-distribution centers (DSC 0.64±0.29 without additional training).

Conclusion: The federated learning framework successfully addresses the variability in stroke lesion analysis across different healthcare centers while maintaining data privacy and demonstrating strong generalization capabilities.

Abstract: Stroke is the second leading cause of death and the third leading cause of disability worldwide. Clinical guidelines establish diffusion resonance imaging (DWI, ADC) as the standard for localizing, characterizing, and measuring infarct volume, enabling treatment support and prognosis. Nonetheless, such lesion analysis is highly variable due to different patient demographics, scanner vendors, and expert annotations. Computational support approaches have been key to helping with the localization and segmentation of lesions. However, these strategies are dedicated solutions that learn patterns from only one institution, lacking the variability to generalize geometrical lesions shape models. Even worse, many clinical centers lack sufficient labeled samples to adjust these dedicated solutions. This work developed a collaborative framework for segmenting ischemic stroke lesions in DWI sequences by sharing knowledge from deep center-independent representations. From 14 emulated healthcare centers with 2031 studies, the FedAvg model achieved a general DSC of $0.71 \pm 0.24$, AVD of $5.29 \pm 22.74$, ALD of $2.16 \pm 3.60$ and LF1 of $0.70 \pm 0.26$ over all centers, outperforming both the centralized and other federated rules. Interestingly, the model demonstrated strong generalization properties, showing uniform performance across different lesion categories and reliable performance in out-of-distribution centers (with DSC of $0.64 \pm 0.29$ and AVD of $4.44 \pm 8.74$ without any additional training).

[466] Lossless 4:2:0 Screen Content Coding Using Luma-Guided Soft Context Formation

Hannah Och, André Kaup

Main category: eess.IV

TL;DR: Extends soft context formation coder to support YCbCr 4:2:0 format by analyzing mutual information between planes, enhancing chroma prediction with luminance data, and adding side-information about luma-chroma combinations.

Details

Motivation: The original soft context formation coder only supports RGB 4:4:4 format and cannot handle YCbCr 4:2:0 format commonly used in video compression, limiting its applicability.

Method: Successively code Y and CbCr planes using normalized mutual information analysis, enhance chroma prediction based on luminance plane, and transmit side-information about luma-chroma combinations for better probability modeling.

Result: Outperforms HEVC-SCC by achieving 5.66% lower bitrate on average across a large screen content image dataset.

Conclusion: The proposed extensions successfully adapt the soft context formation coder to YCbCr 4:2:0 format while maintaining excellent compression performance for screen content images.

Abstract: The soft context formation coder is a pixel-wise state-of-the-art lossless screen content coder using pattern matching and color palette coding in combination with arithmetic coding. It achieves excellent compression performance on screen content images in RGB 4:4:4 format with few distinct colors. In contrast to many other lossless compression methods, it codes entire color pixels at once, i.e., all color components of one pixel are coded together. Consequently, it does not natively support image formats with downsampled chroma, such as YCbCr 4:2:0, which is an often used chroma format in video compression. In this paper, we extend the soft context formation coding capabilities to 4:2:0 image compression, by successively coding Y and CbCr planes based on an analysis of normalized mutual information between image planes. Additionally, we propose an enhancement to the chroma prediction based on the luminance plane. Furthermore, we propose to transmit side-information about occurring luma-chroma combinations to improve chroma probability distribution modelling. Averaged over a large screen content image dataset, our proposed method outperforms HEVC-SCC, with HEVC-SCC needing 5.66% more bitrate compared to our method.

[467] Random forest-based out-of-distribution detection for robust lung cancer segmentation

Aneesh Rangnekar, Harini Veeraraghavan

Main category: eess.IV

TL;DR: RF-Deep uses random forest classifier with deep features from pretrained transformer to detect out-of-distribution CT scans and improve cancer segmentation reliability.

Details

Motivation: Transformer-based models degrade when applied to out-of-distribution CT datasets, requiring a solution to enhance segmentation reliability across different medical imaging domains.

Method: Combines Swin Transformer encoder pretrained with masked image modeling on 10,432 unlabeled 3D CT scans with convolution decoder, and uses random forest classifier to detect OOD scans based on deep features.

Result: Achieved FPR95 of 18.26% on PE, 27.66% on COVID-19, and <0.1% on abdominal CTs for OOD detection, consistently outperforming established OOD approaches.

Conclusion: RF-Deep provides a simple and effective approach to enhance cancer segmentation reliability in both in-distribution and out-of-distribution scenarios.

Abstract: Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining can produce reliably accurate segmentation from in-distribution (ID) data but degrade when applied to out-of-distribution (OOD) datasets. We address this challenge with RF-Deep, a random forest classifier that utilizes deep features from a pretrained transformer encoder of the segmentation model to detect OOD scans and enhance segmentation reliability. The segmentation model comprises a Swin Transformer encoder, pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and non-cancerous conditions, with a convolution decoder, trained to segment lung cancers in 317 3D scans. Independent testing was performed on 603 3D CT public datasets that included one ID dataset and four OOD datasets comprising chest CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of 18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs, consistently outperforming established OOD approaches. The RF-Deep classifier provides a simple and effective approach to enhance reliability of cancer segmentation in ID and OOD scenarios.

[468] MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation

De-Xing Huang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Zhi-Chao Lai, Zeng-Guang Hou

Main category: eess.IV

TL;DR: MOSformer introduces dual encoders with momentum update and inter-slice fusion transformer to effectively leverage multi-scale inter-slice information for superior medical image segmentation performance.

Details

Motivation: Existing 2.5D segmentation models use single encoders that fail to effectively fuse inter-slice information, leading to suboptimal segmentation performance.

Method: Proposes MOSformer with dual encoders (one momentum-averaged) to enhance feature distinguishability and an inter-slice fusion transformer module to fuse multi-scale features across slices.

Result: Achieves state-of-the-art results on three benchmarks: 85.63% DSC on Synapse, 92.19% on ACDC, and 85.43% on AMOS.

Conclusion: MOSformer demonstrates competitive performance in medical image segmentation by effectively leveraging inter-slice information through dual encoders and fusion transformer.

Abstract: Medical image segmentation takes an important position in various clinical applications. 2.5D-based segmentation models bridge the computational efficiency of 2D-based models with the spatial perception capabilities of 3D-based models. However, existing 2.5D-based models primarily adopt a single encoder to extract features of target and neighborhood slices, failing to effectively fuse inter-slice information, resulting in suboptimal segmentation performance. In this study, a novel momentum encoder-based inter-slice fusion transformer (MOSformer) is proposed to overcome this issue by leveraging inter-slice information from multi-scale feature maps extracted by different encoders. Specifically, dual encoders are employed to enhance feature distinguishability among different slices. One of the encoders is moving-averaged to maintain consistent slice representations. Moreover, an inter-slice fusion transformer (IF-Trans) module is developed to fuse inter-slice multi-scale features. MOSformer is evaluated on three benchmark datasets (Synapse, ACDC, and AMOS), achieving a new state-of-the-art with 85.63%, 92.19%, and 85.43% DSC, respectively. These results demonstrate MOSformer’s competitiveness in medical image segmentation.

[469] TimeFlow: Temporal Conditioning for Longitudinal Brain MRI Registration and Aging Analysis

Bailiang Jian, Jiazhen Pan, Yitong Li, Fabian Bongratz, Ruochen Li, Daniel Rueckert, Benedikt Wiestler, Christian Wachinger

Main category: eess.IV

TL;DR: TimeFlow is a learning-based framework for longitudinal brain MRI registration that models neuroanatomy as a continuous function of age, enabling accurate deformation field estimation and future brain state prediction from just two scans.

Details

Motivation: Existing longitudinal brain MRI registration methods are limited by reliance on densely sampled time series, trade-offs between accuracy and temporal smoothness, and inability to prospectively forecast future brain states.

Method: Uses a U-Net backbone with temporal conditioning to model neuroanatomy as a continuous function of age. Incorporates inter-/extra-polation consistency constraints on deformation fields and deformed images to preserve temporal consistency without explicit smoothness regularizers.

Result: Outperforms state-of-the-art methods in future timepoint forecasting and registration accuracy. Enables differentiation of neurodegenerative trajectories from normal aging without requiring segmentation or annotations.

Conclusion: TimeFlow provides an accurate, data-efficient, and annotation-free framework for longitudinal brain aging analysis, capable of forecasting brain changes beyond observed study periods.

Abstract: Longitudinal brain analysis is essential for understanding healthy aging and identifying pathological deviations. Longitudinal registration of sequential brain MRI underpins such analyses. However, existing methods are limited by reliance on densely sampled time series, a trade-off between accuracy and temporal smoothness, and an inability to prospectively forecast future brain states. To overcome these challenges, we introduce \emph{TimeFlow}, a learning-based framework for longitudinal brain MRI registration. TimeFlow uses a U-Net backbone with temporal conditioning to model neuroanatomy as a continuous function of age. Given only two scans from an individual, TimeFlow estimates accurate and temporally coherent deformation fields, enabling non-linear extrapolation to predict future brain states. This is achieved by our proposed inter-/extra-polation consistency constraints applied to both the deformation fields and deformed images. Remarkably, these constraints preserve temporal consistency and continuity without requiring explicit smoothness regularizers or densely sampled sequential data. Extensive experiments demonstrate that TimeFlow outperforms state-of-the-art methods in terms of both future timepoint forecasting and registration accuracy. Moreover, TimeFlow supports novel biological brain aging analyses by differentiating neurodegenerative trajectories from normal aging without requiring segmentation, thereby eliminating the need for labor-intensive annotations and mitigating segmentation inconsistency. TimeFlow offers an accurate, data-efficient, and annotation-free framework for longitudinal analysis of brain aging and chronic diseases, capable of forecasting brain changes beyond the observed study period.

[470] Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization

Samuel Fernández-Menduiña, Eduardo Pavez, Antonio Ortega

Main category: eess.IV

TL;DR: A method for optimizing image/video compression for both visual quality and computer vision task performance by using feature distance as distortion metric, with block-wise approximations to make it computationally practical.

Details

Motivation: Many images/videos are processed by computer vision algorithms with only occasional human inspection, requiring compression methods that optimize for both visual quality and downstream task performance.

Method: Use feature distance as distortion metric in rate-distortion optimization, approximate with Taylor expansion and block-wise input-dependent squared error (IDSE) using Jacobian sketches, combined with SSE for visual quality.

Result: Up to 17% bit-rate savings for same task accuracy compared to SSE-based RDO, with no decoder complexity overhead and 7.86% encoder complexity increase.

Conclusion: The proposed method effectively optimizes compression for computer vision tasks while maintaining visual quality, with minimal computational overhead.

Abstract: Many images and videos are primarily processed by computer vision algorithms, involving only occasional human inspection. When this content requires compression before processing, e.g., in distributed applications, coding methods must optimize for both visual quality and downstream task performance. We first show theoretically that an approach to reduce the effect of compression for a given task loss is to perform rate-distortion optimization (RDO) using the distance between features, obtained from the original and the decoded images, as a distortion metric. However, optimizing directly such a rate-distortion objective is computationally impractical because it requires iteratively encoding and decoding the entire image-plus feature evaluation-for each possible coding configuration. We address this problem by simplifying the RDO formulation to make the distortion term computable using block-based encoders. We first apply Taylor’s expansion to the feature extractor, recasting the feature distance as a quadratic metric involving the Jacobian matrix of the neural network. Then, we replace the linearized metric with a block-wise approximation, which we call input-dependent squared error (IDSE). To make the metric computable, we approximate IDSE using sketches of the Jacobian. The resulting loss can be evaluated block-wise in the transform domain and combined with the sum of squared errors (SSE) to address both visual quality and computer vision performance. Simulations with AVC and HEVC across multiple feature extractors and downstream networks show up to 17 % bit-rate savings for the same task accuracy compared to RDO based on SSE, with no decoder complexity overhead and a small (7.86 %) encoder complexity increase.

[471] Uni-AIMS: AI-Powered Microscopy Image Analysis

Yanhui Hong, Nan Wang, Zhiyi Xia, Haoyi Tao, Xi Fang, Yiming Li, Jiankun Wang, Peng Jin, Xiaochen Cai, Shengyu Li, Ziqi Chen, Zezhong Zhang, Guolin Ke, Linfeng Zhang

Main category: eess.IV

TL;DR: Systematic solution for intelligent microscopy image recognition with data engine, robust segmentation model, and automatic scale bar detection, validated in real applications.

Details

Motivation: To address the challenges of microscopy image analysis including diverse data needs, object detection in cluttered environments, and quantitative analysis requirements.

Method: Developed a data engine combining experimental image collection, synthetic data generation, and human-in-the-loop annotation. Created segmentation model for detecting both small/large objects and separating closely situated targets. Implemented automatic scale bar recognition.

Result: Built comprehensive intelligent analysis platform validated in real-world applications. The solution effectively handles thousands of targets in cluttered environments and supports precise quantitative analysis.

Conclusion: The study advances automatic microscopy recognition with scalable, generalizable tools across multiple domains, providing an online application for researchers to access automated analysis services.

Abstract: This paper presents a systematic solution for the intelligent recognition and automatic analysis of microscopy images. We developed a data engine that generates high-quality annotated datasets through a combination of the collection of diverse microscopy images from experiments, synthetic data generation and a human-in-the-loop annotation process. To address the unique challenges of microscopy images, we propose a segmentation model capable of robustly detecting both small and large objects. The model effectively identifies and separates thousands of closely situated targets, even in cluttered visual environments. Furthermore, our solution supports the precise automatic recognition of image scale bars, an essential feature in quantitative microscopic analysis. Building upon these components, we have constructed a comprehensive intelligent analysis platform and validated its effectiveness and practicality in real-world applications. This study not only advances automatic recognition in microscopy imaging but also ensures scalability and generalizability across multiple application domains, offering a powerful tool for automated microscopic analysis in interdisciplinary research. A online application is made available for researchers to access and evaluate the proposed automated analysis service.

[472] MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation

Dingwei Fan, Junyong Zhao, Chunlin Li, Mingliang Wang, Qi Zhu, Haipeng Si, Daoqiang Zhang, Liang Sun

Main category: eess.IV

TL;DR: MorphSAM enhances spine image segmentation by learning morphological information from anatomical atlases through two prompt learning networks, achieving state-of-the-art performance on CT and MR spine segmentation tasks.

Details

Motivation: Spine image segmentation is challenging due to complex spine structure and high morphological similarity between vertebrae and discs. Existing SAM models struggle to effectively capture and utilize morphological information for improved segmentation performance.

Method: Proposes MorphSAM with two fully automatic prompt learning networks: 1) anatomical prompt learning network that learns morphological information directly from anatomical atlases, and 2) semantic prompt learning network that derives morphological information from text descriptions converted from atlases. Both prompts are fed into SAM to boost segmentation.

Result: Validated on two spine image segmentation tasks (spine anatomical structure segmentation with CT images and lumbosacral plexus segmentation with MR images). Achieves superior segmentation performance compared to state-of-the-art methods.

Conclusion: MorphSAM effectively enhances spine image segmentation by explicitly learning morphological information from atlases, demonstrating significant performance improvements over existing methods on both CT and MR imaging modalities.

Abstract: Spine image segmentation is crucial for clinical diagnosis and treatment of spine diseases. The complex structure of the spine and the high morphological similarity between individual vertebrae and adjacent intervertebral discs make accurate spine segmentation a challenging task. Although the Segment Anything Model (SAM) has been proposed, it still struggles to effectively capture and utilize morphological information, limiting its ability to enhance spine image segmentation performance. To address these challenges, in this paper, we propose a MorphSAM that explicitly learns morphological information from atlases, thereby strengthening the spine image segmentation performance of SAM. Specifically, the MorphSAM includes two fully automatic prompt learning networks, 1) an anatomical prompt learning network that directly learns morphological information from anatomical atlases, and 2) a semantic prompt learning network that derives morphological information from text descriptions converted from the atlases. Then, the two learned morphological prompts are fed into the SAM model to boost the segmentation performance. We validate our MorphSAM on two spine image segmentation tasks, including a spine anatomical structure segmentation task with CT images and a lumbosacral plexus segmentation task with MR images. Experimental results demonstrate that our MorphSAM achieves superior segmentation performance when compared to the state-of-the-art methods.

[473] Analise de Desaprendizado de Maquina em Modelos de Classificacao de Imagens Medicas

Andreza M. C. Falcao, Filipe R. Cordeiro

Main category: eess.IV

TL;DR: Evaluation of SalUn unlearning model on medical image datasets shows it achieves performance close to full retraining, making it suitable for medical applications where data privacy is crucial.

Details

Motivation: Machine unlearning techniques have not been explored in medical image classification despite the need to remove private/sensitive data from pre-trained models while maintaining model robustness in healthcare applications.

Method: Conducted experiments using the SalUn unlearning model on PathMNIST, OrganAMNIST, and BloodMNIST datasets, and analyzed the impact of data augmentation on unlearning quality.

Result: SalUn achieves performance close to full retraining, demonstrating efficient unlearning capabilities for medical image data.

Conclusion: SalUn provides an effective solution for machine unlearning in medical image classification, offering a practical approach for handling sensitive healthcare data while preserving model performance.

Abstract: Machine unlearning aims to remove private or sensitive data from a pre-trained model while preserving the model’s robustness. Despite recent advances, this technique has not been explored in medical image classification. This work evaluates the SalUn unlearning model by conducting experiments on the PathMNIST, OrganAMNIST, and BloodMNIST datasets. We also analyse the impact of data augmentation on the quality of unlearning. Results show that SalUn achieves performance close to full retraining, indicating an efficient solution for use in medical applications.

[474] A Deep Learning Application for Psoriasis Detection

Anna Milani, Fábio S. da Silva, Elloá B. Guedes, Ricardo Rios

Main category: eess.IV

TL;DR: Comparative study of ResNet50, Inception v3 and VGG19 for psoriasis skin lesion classification, with Inception v3 showing best performance (97.5% accuracy and F1-Score).

Details

Motivation: To evaluate and compare the performance of different CNN architectures for automated diagnosis support of psoriasis from skin lesion images.

Method: Used three CNN models (ResNet50, Inception v3, VGG19) trained and validated on skin lesion images from specialized platforms, with techniques applied to adjust evaluation metrics.

Result: Inception v3 achieved the best performance with 97.5% ± 0.2 accuracy and F1-Score, outperforming ResNet50 and VGG19.

Conclusion: Inception v3 is identified as a valuable tool for supporting psoriasis diagnosis due to its high classification accuracy and F1-Score performance.

Abstract: In this paper a comparative study of the performance of three Convolutional Neural Network models, ResNet50, Inception v3 and VGG19 for classification of skin images with lesions affected by psoriasis is presented. The images used for training and validation of the models were obtained from specialized platforms. Some techniques were used to adjust the evaluation metrics of the neural networks. The results found suggest the model Inception v3 as a valuable tool for supporting the diagnosis of psoriasis. This is due to its satisfactory performance with respect to accuracy and F1-Score (97.5% ${\pm}$ 0.2).

[475] A Closer Look at Edema Area Segmentation in SD-OCT Images Using Adversarial Framework

Yuhui Tao, Yizhe Zhang, Qiang Chen

Main category: eess.IV

TL;DR: Novel weakly-supervised macular edema segmentation method using retinal layer guidance and test-time adaptation to bridge performance gap with fully-supervised approaches.

Details

Motivation: Expert-annotated pixel-level datasets for macular edema analysis are expensive to collect, and current weakly-supervised methods underperform compared to fully-supervised approaches.

Method: Leverages correlation between edema area and retinal layers in SD-OCT images. Enhances adversarial framework with layer-structure-guided post-processing and test-time adaptation strategy to confirm intersection points between edema contour and retinal layers.

Result: Extensive experiments on two public datasets show improved accuracy and robustness in edema area segmentation, narrowing the performance gap between weakly-supervised and fully-supervised models.

Conclusion: Incorporating retinal layer information and test-time adaptation significantly enhances weakly-supervised macular edema segmentation performance while reducing reliance on expensive expert annotations.

Abstract: The development of artificial intelligence models for macular edema (ME) analy-sis always relies on expert-annotated pixel-level image datasets which are expen-sive to collect prospectively. While anomaly-detection-based weakly-supervised methods have shown promise in edema area (EA) segmentation task, their per-formance still lags behind fully-supervised approaches. In this paper, we leverage the strong correlation between EA and retinal layers in spectral-domain optical coherence tomography (SD-OCT) images, along with the update characteristics of weakly-supervised learning, to enhance an off-the-shelf adversarial framework for EA segmentation with a novel layer-structure-guided post-processing step and a test-time-adaptation (TTA) strategy. By incorporating additional retinal lay-er information, our framework reframes the dense EA prediction task as one of confirming intersection points between the EA contour and retinal layers, result-ing in predictions that better align with the shape prior of EA. Besides, the TTA framework further helps address discrepancies in the manifestations and presen-tations of EA between training and test sets. Extensive experiments on two pub-licly available datasets demonstrate that these two proposed ingredients can im-prove the accuracy and robustness of EA segmentation, bridging the gap between weakly-supervised and fully-supervised models.

[476] Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI Data

Jan Nikolas Morshuis, Matthias Hein, Christian F. Baumgartner

Main category: eess.IV

TL;DR: This paper provides the first unified benchmark comparing 7 approaches for segmenting undersampled MRI data, finding that simple two-stage methods with data-consistency outperform complex specialized one-stage methods.

Details

Motivation: MRI acquisition is time-consuming and costly, but perfect reconstruction may not be necessary when the goal is downstream tasks like segmentation. Existing segmentation methods for accelerated MRI lack unified comparison and evaluation standards.

Method: The study compares 7 approaches on two MRI datasets with multi-coil k-space data and human-annotated segmentation ground-truth. Focuses on comparing one-stage (combined reconstruction+segmentation) vs two-stage (reconstruction followed by segmentation) methods.

Result: Simple two-stage methods that incorporate data-consistency achieved the best segmentation scores, outperforming complex specialized methods developed specifically for this task.

Conclusion: For segmenting accelerated MRI data, straightforward two-stage approaches that maintain data-consistency are more effective than complex unified models, providing practical guidance for clinical applications.

Abstract: MR imaging is a valuable diagnostic tool allowing to non-invasively visualize patient anatomy and pathology with high soft-tissue contrast. However, MRI acquisition is typically time-consuming, leading to patient discomfort and increased costs to the healthcare system. Recent years have seen substantial research effort into the development of methods that allow for accelerated MRI acquisition while still obtaining a reconstruction that appears similar to the fully-sampled MR image. However, for many applications a perfectly reconstructed MR image may not be necessary, particularly, when the primary goal is a downstream task such as segmentation. This has led to growing interest in methods that aim to perform segmentation directly on accelerated MRI data. Despite recent advances, existing methods have largely been developed in isolation, without direct comparison to one another, often using separate or private datasets, and lacking unified evaluation standards. To date, no high-quality, comprehensive comparison of these methods exists, and the optimal strategy for segmenting accelerated MR data remains unknown. This paper provides the first unified benchmark for the segmentation of undersampled MRI data comparing 7 approaches. A particular focus is placed on comparing \textit{one-stage approaches}, that combine reconstruction and segmentation into a unified model, with \textit{two-stage approaches}, that utilize established MRI reconstruction methods followed by a segmentation network. We test these methods on two MRI datasets that include multi-coil k-space data as well as a human-annotated segmentation ground-truth. We find that simple two-stage methods that consider data-consistency lead to the best segmentation scores, surpassing complex specialized methods that are developed specifically for this task.

[477] RDDM: Practicing RAW Domain Diffusion Model for Real-world Image Restoration

Yan Chen, Yi Wen, Wei Li, Junchao Liu, Yong Guo, Jie Hu, Xinghao Chen

Main category: eess.IV

TL;DR: RDDM is an end-to-end diffusion model that directly restores photo-realistic images from sensor RAW data, bypassing traditional ISP pipelines and achieving superior fidelity compared to sRGB-domain methods.

Details

Motivation: Existing sRGB-domain diffusion models face a dilemma between high fidelity and realistic generation due to processing lossy sRGB inputs and ignoring the accessibility of sensor RAW data in edge devices, leading to suboptimal performance.

Method: Proposes RAW-domain VAE (RVAE) for optimal latent representations, differentiable Post Tone Processing (PTP) for joint RAW-sRGB optimization, scalable degradation pipeline for dataset synthesis, and configurable multi-bayer LoRA module for handling diverse RAW patterns.

Result: Extensive experiments show RDDM’s superiority over state-of-the-art sRGB diffusion methods, producing higher fidelity results with fewer artifacts.

Conclusion: RDDM effectively addresses out-of-distribution issues in RAW domain adaptation and provides an end-to-end solution for direct RAW image restoration, outperforming conventional two-stage ISP + IR pipelines.

Abstract: We present the RAW domain diffusion model (RDDM), an end-to-end diffusion model that restores photo-realistic images directly from the sensor RAW data. While recent sRGB-domain diffusion methods achieve impressive results, they are caught in a dilemma between high fidelity and realistic generation. As these models process lossy sRGB inputs and neglect the accessibility of the sensor RAW images in many scenarios, e.g., in image and video capturing in edge devices, resulting in sub-optimal performance. RDDM bypasses this limitation by directly restoring images in the RAW domain, replacing the conventional two-stage image signal processing (ISP) + IR pipeline. However, a simple adaptation of pre-trained diffusion models to the RAW domain confronts the out-of-distribution (OOD) issues. To this end, we propose: (1) a RAW-domain VAE (RVAE) learning optimal latent representations, (2) a differentiable Post Tone Processing (PTP) module enabling joint RAW and sRGB space optimization. To compensate for the deficiency in the dataset, we develop a scalable degradation pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for large-scale training. Furthermore, we devise a configurable multi-bayer (CMB) LoRA module handling diverse RAW patterns such as RGGB, BGGR, etc. Extensive experiments demonstrate RDDM’s superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Semantic Attractors and the Emergence of Meaning: Towards a Teleological Model of AGI

[2] LLMs Can’t Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

[3] Not All Visitors are Bilingual: A Measurement Study of the Multilingual Web from an Accessibility Perspective

[4] Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models

[5] Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

[6] Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails

[7] Integral Transformer: Denoising Attention, Not Too Much Not Too Little

[8] Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

[9] Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering

[10] How Reliable are LLMs for Reasoning on the Re-ranking task?

[11] Emotion Omni: Enabling Empathetic Speech Response Generation through Large Language Models

[12] Integrating gender inclusivity into large language models via instruction tuning

[13] Principled Detection of Hallucinations in Large Language Models via Multiple Testing

[14] VibeVoice Technical Report

[15] COMET-poly: Machine Translation Metric Grounded in Other Candidates

[16] The Mind’s Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation

[17] What do language models model? Transformers, automata, and the format of thought

[18] A New NMT Model for Translating Clinical Texts from English to Spanish

[19] EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems

[20] Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

[21] Thinking Before You Speak: A Proactive Test-time Scaling Approach

[22] Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models

[23] An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

[24] Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning

[25] Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System

[26] Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs

[27] EMMM, Explain Me My Model! Explainable Machine Generated Text Detection in Dialogues

[28] Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

[29] M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations

[30] Chronological Passage Assembling in RAG framework for Temporal Question Answering

[31] ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models

[32] Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction

[33] Controllable Conversational Theme Detection Track at DSTC 12

[34] LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

[35] LLM-based Contrastive Self-Supervised AMR Learning with Masked Graph Autoencoders for Fake News Detection

[36] Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness

[37] ConfTuner: Training Large Language Models to Express Their Confidence Verbally

[38] ReflectivePrompt: Reflective evolution in autoprompting algorithms

[39] Empowering Computing Education Researchers Through LLM-Assisted Content Analysis

[40] Affective Polarization across European Parliaments

[41] Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework

[42] Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models

[43] Automatic Prompt Optimization with Prompt Distillation

[44] MovieCORE: COgnitive REasoning in Movies

[45] HiPlan: Hierarchical Planning for LLM-Based Agents with Adaptive Global-Local Guidance

[46] “Where does it hurt?” – Dataset and Study on Physician Intent Trajectories in Doctor Patient Dialogues

[47] It’s All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs

[48] Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index

[49] Beyond the Black Box: Integrating Lexical and Semantic Methods in Quantitative Discourse Analysis with BERTopic

[50] Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs

[51] Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

[52] Evaluating the Evaluators: Are readability metrics good measures of readability?

[53] Generative Interfaces for Language Models

[54] A Survey on Data Selection for LLM Instruction Tuning

[55] HateDebias: On the Diversity and Variability of Hate Speech Debiasing

[56] Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis

[57] ChatGPT Doesn’t Trust Chargers Fans: Guardrail Sensitivity in Context

[58] Recognizing Limits: Investigating Infeasibility in Large Language Models

[59] Label Set Optimization via Activation Distribution Kurtosis for Zero-shot Classification with Generative Models

[60] From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification

[61] Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge

[62] TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use

[63] Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

[64] Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications

[65] Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems

[66] SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?

[67] Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

[68] An Ontology-Driven Graph RAG for Legal Norms: A Hierarchical, Temporal, and Deterministic Approach

[69] Improving Multilingual Language Models by Aligning Representations through Steering

[70] Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals

[71] Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning

[72] sudoLLM: On Multi-role Alignment of Language Models

[73] RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection

[74] ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

[75] Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

[76] Measuring Sycophancy of Language Models in Multi-turn Dialogues

[77] Subjective Perspectives within Learned Representations Predict High-Impact Innovation